Paris-Saclay University

Introduction to Machine Learning

Supervised Learning - Ensemble Methods - Random Forests and Boosting

Introduction

Introduction to Ensemble Methods

  • Definition: Ensemble methods combine multiple models to produce a more robust and accurate prediction.
  • Key Principle: Aggregating diverse models reduces errors by compensating for individual weaknesses.
  • Types of Ensembles:
    • Bagging: Parallel approach using independent models (e.g., Random Forests).
      bagging
    • Boosting: Sequential approach improving model performance iteratively (e.g., AdaBoost, Gradient Boosting).
      boosting: Adaboost

Advantages of Ensemble Methods

  • Increased Accuracy: Combines the strengths of individual models to reduce overall error.
  • Robustness: More resistant to overfitting compared to standalone models.
  • Flexibility: Can combine diverse models (e.g., decision trees, linear models).
  • Feature Importance Insights: Random Forests highlight key drivers of predictions.
single decision tree against a random forest ensemble

Bagging - Random Forests

Random Forests: An Overview

  • Definition: A Random Forest is an ensemble of decision trees trained using the bagging method.
  • How It Works:
    1. Generate multiple random samples (bootstrapping) from the original dataset.
    2. Train a decision tree on each bootstrap sample.
    3. Combine predictions:
      • Classification: Majority vote.
      • Regression: Average prediction.
random forest

Feature Importance in Random Forests

  • Why Measure Feature Importance?
    • Identify key drivers of the model's predictions.
    • Improve interpretability of complex models.
    • Guide feature engineering efforts.
  • How It's Measured: Based on Gini impurity or information gain from decision tree splits.
Feature Importance Bar Chart and Feature Correlation Heatmap

Random Forests vs. Single Decision Trees

Aspect Single Decision Trees Random Forests
Accuracy Prone to overfitting Higher accuracy due to aggregation
Robustness Highly sensitive to noise More robust to noise
Interpretability Simple and explainable Complex, harder to interpret
Scalability Faster for small datasets Handles high-dimensional data well

Boosting

Boosting: An Introduction

  • Definition: Boosting is an ensemble technique that builds models sequentially, with each new model correcting errors of the previous ones.
  • Key Principle: Focus on "hard-to-learn" examples by adjusting their weights for subsequent models.
  • Comparison with Bagging:
    • Bagging: Trains models independently in parallel.
    • Boosting: Trains models sequentially, leveraging previous results.

How Boosting Works

  1. Start with an initial weak model (e.g., decision stump).
  2. Evaluate the model's performance and assign weights to examples:
    • Increase weights for misclassified examples.
    • Decrease weights for correctly classified examples.
  3. Train the next model using updated weights to focus on harder examples.
  4. Combine all models' predictions through weighted voting or averaging.
boosting

AdaBoost (Adaptive Boosting)

  • Concept: Builds a strong classifier by combining many weak classifiers (e.g., decision stumps).
  • Steps:
    1. Initialize weights for all examples equally.
    2. Train a weak model and compute its error rate.
    3. Update example weights:
      • Increase weights for misclassified examples.
      • Decrease weights for correctly classified examples.
    4. Repeat for a predefined number of iterations.
  • Final Prediction: Weighted sum of all weak classifiers' outputs.
    AdaBoost

Gradient Boosting

  • Concept: Builds models iteratively, minimizing residual errors (differences between true and predicted values).
  • Steps:
    1. Train an initial model (e.g., decision tree).
    2. Compute the residual errors of the model.
    3. Train a new model to predict these residuals.
    4. Combine the original predictions with residual predictions.
    5. Repeat until residuals are sufficiently small.
  • Applications:
    • Regression tasks
    • Classification tasks
gradient boosting regression

Comparing AdaBoost and Gradient Boosting

  • Core Difference:
    • AdaBoost focuses on reweighting examples based on classification errors.
    • Gradient Boosting minimizes residual errors directly.
  • Training Process:
    • AdaBoost: Sequentially updates example weights.
    • Gradient Boosting: Fits residuals as a new prediction target.
  • Flexibility: Gradient Boosting is more versatile and widely used.
  • Examples:
    • AdaBoost: Effective with simple base models.
    • Gradient Boosting: Commonly used in frameworks like XGBoost and LightGBM.

Limitations and Practical Recommendations

Limitations of Bagging (Random Forests)

  • Computational Complexity:
    • Training multiple decision trees requires significant time and resources.
    • Predictions require aggregating outputs from all trees, which can be slow for large forests.
  • Overfitting Risks:
    • If hyperparameters (e.g., max depth, number of trees) are poorly tuned, overfitting can occur.
  • Interpretability:
    • While individual trees are interpretable, the ensemble as a whole is harder to explain.
  • Data Sensitivity:
    • Less effective if there's a lot of noise in the data.

Limitations of Boosting (AdaBoost and Gradient Boosting)

  • Computational Complexity:
    • Sequential training is slower compared to parallel methods like bagging.
    • Boosting methods require more fine-tuning of hyperparameters (e.g., learning rate, number of iterations).
  • Overfitting Risks:
    • Boosting can overfit noisy data by focusing excessively on hard-to-learn examples.
  • Interpretability:
    • Boosted models, especially Gradient Boosting, are complex and difficult to interpret.
  • Imbalanced Datasets:
    • AdaBoost struggles with imbalanced datasets if the boosting process overemphasizes minority class examples.

Practical Recommendations for Ensemble Methods

  • Start Simple: Use Random Forests as a baseline before exploring boosting techniques.
  • Optimize Hyperparameters: Leverage grid search or random search to fine-tune:
    • Number of trees
    • Learning rate (for boosting)
    • Max depth
  • Evaluate Properly: Use cross-validation or separate test sets to validate performance and avoid overfitting.
  • Combine with Feature Engineering: Ensemble methods perform better with well-engineered data.
  • Balance Trade-offs: Consider computational costs versus accuracy gains when choosing methods.
choosing between Random Forests and Boosting methods

Conclusion

Bagging vs. Boosting

Aspect Bagging Boosting
Core Approach Train models independently in parallel. Train models sequentially, leveraging errors.
Error Reduction Reduces variance. Reduces bias.
Model Combination Average or majority voting. Weighted sum or weighted voting.
Risk of Overfitting Lower with proper hyperparameters. Higher if poorly tuned.
Interpretability More interpretable (e.g., Random Forests). Less interpretable (e.g., Gradient Boosting).
Popular Algorithms Random Forests AdaBoost, Gradient Boosting, XGBoost
To get the PDF of these slides and print them, click here and then use the PDF printer of your browser.