Paris-Saclay University

Introduction to Machine Learning

Support Vector Machine

Introduction to SVMs

Support Vector Machines (SVMs)

  • Definition: A supervised learning algorithm for classification and regression.
  • Key Idea: Find the optimal hyperplane that separates data into distinct classes.
  • Applications: Image recognition, text classification, bioinformatics.

Core Concepts of SVMs

  • Decision boundary: Separates data points of different classes.
  • Margins: Distance between the decision boundary and the closest data points.
  • Support vectors: Data points that determine the position of the hyperplane.

Visualization: SVM Decision Boundary

  • Data points: Two classes represented by different markers.
  • Optimal hyperplane: Maximizes the margin between classes.
decision boundary and support vectors

SVM with Linear Kernels

Linear SVMs

  • Applicable when data is linearly separable.
  • Objective: Maximize the margin while minimizing classification errors.
  • Optimization is achieved through a mathematical formulation.
a 2D dataset with an optimal hyperplane

Linear SVMs: Example

  • 2D classification problem with linearly separable data.
  • Hyperplane equation: $w\cdot x + b = 0$.
a 2D dataset with an optimal hyperplane

Optimization Techniques in SVMs

  • SVM optimization aims to maximize the margin while minimizing classification errors.
  • Formulated as a convex optimization problem:
    • Minimize: \( \frac{1}{2} \|w\|^2 \)
    • Subject to: \( y_i (w \cdot x_i + b) \geq 1 \) for all \( i \)
  • Uses Lagrange multipliers to solve the constrained optimization (see here).
  • Support vectors are the only points influencing the optimization.

Non-Linear SVMs

Challenges with Linear SVMs

  • Not all datasets are linearly separable.
  • Overlapping data points make linear separation impossible.
  • Example: Spiral or concentric circle datasets.
Spiral and concentric circle datasets

Introducing the Kernel Trick

  • Purpose: Transform data into a higher-dimensional space where it becomes linearly separable.
  • Technique: Use kernel functions to compute dot products in the transformed space.
  • Common kernels: Polynomial, Radial Basis Function (RBF), Sigmoid.

Visualization: Kernel Trick

  • Example: Data that is inseparable in 2D becomes separable in 3D.
data transformation using the kernel trick

Popular Kernel Functions

  • Linear Kernel: Keeps data in the original space.
  • Polynomial Kernel: Maps data to higher-degree polynomial space.
  • RBF Kernel: Creates a radial influence for each data point (see here):
    • $K(x, x') = \exp(-\gamma \|x - x'\|^2)$
linear, polynomial, and RBF kernel transformations

Tuning and Practical Considerations

Hyperparameter Tuning

  • $C$ (Regularization parameter): Controls trade-off between maximizing margin and minimizing misclassification.
  • $\gamma$ (Gamma): Defines the influence of individual training examples in RBF kernels.
  • Importance of cross-validation in hyperparameter tuning.

Model Evaluation for SVMs

  • Performance Metrics:
    • Accuracy: Fraction of correctly classified instances.
    • Precision and Recall: Balance between false positives and false negatives.
    • F1 Score: Harmonic mean of precision and recall.
  • Cross-Validation:
    • Divide dataset into $k$ subsets (folds).
    • Train on $k-1$ folds and validate on the remaining fold.
    • Avoids overfitting by testing on unseen data.

Common Mistakes and Best Practices

  • Feature Scaling: Always standardize features for SVMs.
  • Kernel Selection: Experiment with different kernels for non-linear data.
  • Overfitting: Use cross-validation and regularization to avoid overfitting.
  • Hyperparameter Tuning: Optimize $C$ and $\gamma$ for the best results.

Advanced Topics and Applications

Multi-Class SVMs

  • SVMs are inherently binary classifiers.
  • Strategies for multi-class classification:
    • One-vs-One (OvO): Train $\frac{n(n-1)}{2}$ classifiers for $n$ classes.
    • One-vs-Rest (OvR): Train $n$ classifiers, each separating one class from the rest.
  • Practical applications: Handwritten digit classification, object recognition.

Visualization: Multi-Class SVMs

  • Example: Handwritten digit classification with OvO and OvR strategies.
a diagram showing decision boundaries for multi-class SVMs

Support Vector Regression (SVR)

  • Extends SVM principles to regression tasks.
  • Objective: Minimize error within an ε-insensitive margin.
  • Key differences from classification:
    • Focuses on fitting a line/curve rather than separating classes.
    • Penalizes deviations from the ε-margin.
  • Applications: Predicting trends, stock prices, and more.

Practical Tips for SVM Implementation

  • Scale your data:
    • SVMs are sensitive to feature scaling.
    • Normalize or standardize features to improve performance.
  • Choose the kernel wisely:
    • Start with linear kernel for linearly separable data.
    • Use RBF kernel for non-linear problems.
  • Optimize hyperparameters using grid search or randomized search.

Practical Implementation in Scikit-learn

  • Define an SVM model:
  • from sklearn.svm import SVC
    model = SVC(kernel='rbf', C=1, gamma='scale')
  • Train the model:
  • model.fit(X_train, y_train)
  • Make predictions and evaluate:
  • y_pred = model.predict(X_test)
    from sklearn.metrics import classification_report
    print(classification_report(y_test, y_pred))

Grid Search for Hyperparameter Tuning

from sklearn.model_selection import GridSearchCV
param_grid = {'C': [0.1, 1, 10], 
    'gamma': [1, 0.1, 0.01], 'kernel': ['rbf']}
grid = GridSearchCV(SVC(), param_grid, refit=True, verbose=2)
grid.fit(X_train, y_train)
  • Best parameters: grid.best_params_
  • Best estimator: grid.best_estimator_

Advantages of SVMs

  • Effective in high-dimensional spaces.
  • Works well for clear margin of separation between classes.
  • Robust to overfitting, especially in high-dimensional spaces.
  • Versatile due to the kernel trick.

Limitations of SVMs

  • Computational Complexity:
    • Training time scales approximately \( O(n^2) \) or \( O(n^3) \), where \( n \) is the number of training samples.
    • Can be impractical for large datasets with thousands or millions of samples.
  • Memory Usage:
    • SVMs store support vectors, which can be numerous for complex datasets.
    • High-dimensional datasets further increase memory requirements.
  • Kernel and Hyperparameter Selection:
    • Performance is sensitive to the choice of kernel (e.g., linear, RBF, polynomial).
    • Requires careful tuning of hyperparameters like \( C \) and \( \gamma \), which can be computationally expensive.
  • Interpretability:
    • Decision boundaries are less intuitive compared to simpler models like decision trees.
    • Difficult to explain results to non-technical stakeholders.
  • Scalability:
    • Not suitable for datasets with millions of samples without specialized implementations (e.g., linear SVMs in Scikit-learn).

Comparison of SVMs with Other Algorithms

  • Decision Trees: Easier to interpret but less robust in high-dimensional spaces.
  • Linear Models: Faster for large datasets but less effective for non-linear problems.
  • Ensemble Methods: Often achieve better accuracy but are computationally expensive.
  • SVMs balance complexity and accuracy, making them versatile.

SVM Applications

  • Image classification:
    • Detecting handwritten digits (e.g., MNIST dataset).
    • Facial recognition systems.
  • Text classification:
    • Spam email detection.
    • Sentiment analysis for customer reviews.
  • Bioinformatics:
    • Gene classification and cancer diagnosis.

Real-World Example: SVMs for Spam Detection

  • Features: Presence of keywords, email length, frequency of links.
  • Model: Train an SVM classifier to differentiate spam from non-spam.
                            flowchart LR
                            subgraph Prediction["Production Deployment"]
                                direction TB
                                I[New Email] --> J["Preprocessing"]
                                J --> K["Feature Extraction"]
                                K --> L["SVM Prediction"]
                                
                                L --> M{Classification}
                                M -->|"Score > 0.9"| N["High Confidence Spam"]
                                M -->|"0.5 < Score < 0.9"| O["Potential Spam"]
                                M -->|"Score < 0.5"| P["Not Spam"]
                                
                                N --> Q["Spam Folder"]
                                O --> R["Quarantine for Review"]
                                P --> S["Inbox"]
                            end
                        
                            subgraph Training["Training Phase"]
                                direction TB
                                A[Email Training Dataset] --> AA["Data Preprocessing"]
                                
                                subgraph preprocess["Preprocessing Steps"]
                                    direction TB
                                    AA --> |Clean Data| AA1["• Remove HTML tags
                                    • Convert to lowercase
                                    • Remove special characters
                                    • Handle missing values
                                    • Remove duplicate emails"]
                                    AA1 --> AA2["Text Normalization
                                    • Tokenization
                                    • Stop word removal
                                    • Lemmatization
                                    • Spell checking"]
                                end
                        
                                subgraph features["Feature Engineering"]
                                    direction TB
                                    B["Feature Extraction"] --> C["Content-Based Features"]
                                    B --> D["Metadata Features"]
                                    B --> E["Behavioral Features"]
                                    
                                    C --> |Extract| C1["• Word frequency (TF-IDF)
                                    • Keywords presence
                                    • Character n-grams
                                    • Word embeddings
                                    • Spam word ratio"]
                                    
                                    D --> |Extract| D1["• Email headers
                                    • Sender information
                                    • Time sent
                                    • Number of recipients
                                    • Mail client used"]
                                    
                                    E --> |Extract| E1["• Link frequency
                                    • Image ratio
                                    • Email size
                                    • Text-to-HTML ratio
                                    • Recipient patterns"]
                                end
                        
                                subgraph model["Model Training"]
                                    direction TB
                                    F["SVM Configuration"] --> F1["Kernel Selection
                                    • Linear for high-dimensional data
                                    • RBF for complex patterns
                                    • Polynomial for non-linear data"]
                                    
                                    F1 --> F2["Hyperparameter Tuning
                                    • C: Control overfitting
                                    • Gamma: Kernel coefficient
                                    • Degree: Polynomial kernel
                                    • Cross-validation splits"]
                        
                                    F2 --> G["Model Training"]
                                    
                                    G --> H["Model Evaluation"]
                                    H --> H1["Performance Metrics
                                    • Accuracy
                                    • Precision
                                    • Recall
                                    • F1-Score
                                    • ROC-AUC"]
                                end
                            end
                        
                            H --> L
                        
                            style Training fill:#e1f5fe,stroke:#01579b
                            style Prediction fill:#f3e5f5,stroke:#4a148c
                            style preprocess fill:#fff,stroke:#666
                            style features fill:#fff,stroke:#666
                            style model fill:#fff,stroke:#666
                            
                            classDef processNode fill:#fff,stroke:#333,stroke-width:2px
                            classDef dataNode fill:#e3f2fd,stroke:#1565c0,stroke-width:2px
                            classDef decisionNode fill:#f5f5f5,stroke:#424242,stroke-width:2px
                            classDef metricsNode fill:#ffebee,stroke:#c62828,stroke-width:2px
                            
                            class A,I dataNode
                            class F,G processNode
                            class M decisionNode
                            class H1 metricsNode
                        

Conclusion and Summary

  • SVMs are powerful tools for both classification and regression tasks.
  • The kernel trick enables solving non-linear problems.
  • Hyperparameter tuning is critical for achieving optimal performance.
  • Applications span diverse domains, making SVMs a versatile choice.
To get the PDF of these slides and print them, click here and then use the PDF printer of your browser.