Paris-Saclay University

Introduction to Machine Learning

Supervised Learning - Decision Trees & Evaluation

Introduction

Introduction to Decision Trees

  • Definition: A decision tree is a tree-like structure used for decision-making and predictive modeling.
  • Applications:
    • Classification tasks: Predicting discrete labels.
    • Regression tasks: Predicting continuous values.
  • Key Advantage: Intuitive and interpretable for humans.
  • Connection to Previous Topics:
    • Builds upon concepts of supervised learning from Session 2.
    • Introduces an alternative to linear models for non-linear problems.

Understanding Decision Tree Structure

  • Core Components:
    • Nodes: Decision points that test specific features (e.g., "Age > 30").
    • Edges: Possible outcomes from each decision, connecting nodes.
    • Root Node: The starting point, using the most informative feature.
    • Leaf Nodes: Final predictions or outcomes (e.g., class labels).
  • Visual Representation:
    • Hierarchical structure flowing from top to bottom.
    • Each level represents a different feature test.
    • Path from root to leaf shows decision sequence.
  • Practical Aspects:
    • Tools like Scikit-learn and Graphviz enable easy visualization.
    • Tree depth indicates model complexity.
    • Branch patterns reveal feature importance.
Decision Tree

Splitting Criteria and Tree Construction

Key Metrics for Splitting

  • Entropy:
    • Measures the randomness or disorder in the data.
    • Formula: \( H = -\sum p_i \log_2 p_i \), where \( p_i \) is the proportion of class \( i \).
    • Low Entropy: Indicates homogeneous subsets (e.g., all samples in one class).
    • High Entropy: Indicates diverse subsets (e.g., equal distribution of classes).
  • Gini Impurity:
    • Measures the likelihood of incorrect classification for a randomly chosen element.
    • Formula: \( G = 1 - \sum p_i^2 \).
    • Low Gini Impurity: Indicates better separation of classes.
  • Information Gain:
    • Reduction in impurity after splitting.
    • Formula: \( IG = H(parent) - \text{weighted average } H(children) \).
    • Goal: Maximize information gain to choose the best feature for splitting.
  • Example:
    • For Entropy:
      • Suppose two classes: 50% each → \( H = - (0.5 \log_2 0.5 + 0.5 \log_2 0.5) = 1 \).
      • One class dominates: \( p = 1 \), \( H = 0 \) (pure subset).

    Gini is computed similarly but avoids logarithms.

Building a Decision Tree

  • Objective: Construct a tree that optimally separates data by:
    • Maximizing information gain at each split
    • Creating pure (homogeneous) subsets
    • Maintaining model simplicity
  • Construction Process:
    1. Initialize: Begin with complete dataset at root node
      • Calculate initial impurity (entropy or Gini)
      • Evaluate all features as potential first splits
    2. Split Selection:
      • For each feature, calculate impurity metrics for possible splits
      • Compute information gain for each potential split
      • Select feature with highest information gain
    3. Tree Growth:
      • Create child nodes based on best split
      • Recursively apply process to each child node
      • Continue until stopping conditions are met
  • Stopping Criteria:
    • Maximum depth reached
    • Minimum samples per node threshold
    • Pure subset achieved (all samples belong to same class)
    • No remaining valid splits

Comparing Entropy and Gini Impurity

  • Similarities:
    • Both measure impurity and guide the splitting process.
    • Lower values indicate better splits.
  • Differences:
    • Entropy uses logarithms and is more sensitive to class imbalance.
    • Gini is computationally simpler and often preferred in practice.
  • Practical Note: Both metrics often lead to similar splits in real-world data.

Visualizing Tree Construction

  • Root Node: Represents the entire dataset.
  • Branches: Represent splits based on feature values.
  • Leaf Nodes: Represent predictions or outcomes.
Weather Decision Tree

Practical Comparison: Entropy vs. Gini Impurity

  • Key Differences:
    • Entropy: More sensitive to class imbalance, but computationally intensive (uses logarithms).
    • Gini Impurity: Faster to compute, often preferred for efficiency with large datasets.
  • Practical Implications:
    • Choose Gini Impurity for speed, especially on large datasets.
    • Consider Entropy when class proportions matter (e.g., highly imbalanced datasets).
  • Real-World Note:
    • In practice, both metrics often lead to similar splits.
    • The choice between them depends on the dataset size and computational constraints.

Example

Example: Weather Dataset

Dataset features: Weather (Sunny, Rainy, Overcast), Labels: Play or Don't Play

  • Step 1: Calculate entropy of the root node.
  • Step 2: Split on "Weather" and calculate entropy for each subset.
  • Step 3: Compute information gain for the split.
  • Step 4: Choose the best split and continue recursively.

Outcome: Demonstrate the best split based on calculated information gain.

Example: Weather Dataset

We use a small dataset to demonstrate decision tree construction.

Weather Play
Sunny No
Sunny No
Overcast Yes
Rainy Yes
Rainy Yes
Rainy No
Overcast Yes
Sunny No
Sunny Yes
Rainy Yes
Sunny Yes
Overcast Yes
Overcast Yes
Rainy No

Step 1: Root Node Entropy

Class distribution at the root node:

  • Yes: 9
  • No: 5

Entropy formula:

$H = -\sum_{i=1}^c p_i \log_2(p_i)$

Root node entropy:

$p_{\text{Yes}} = \frac{9}{14}, \, p_{\text{No}} = \frac{5}{14}$ $H_{\text{root}} = -\left( \frac{9}{14} \log_2 \left( \frac{9}{14} \right) + \frac{5}{14} \log_2 \left( \frac{5}{14} \right) \right)$ $H_{\text{root}} \approx 0.940$

Step 2: Entropy for Subsets

Subset: Sunny

Weather Play
Sunny No
Sunny No
Sunny No
Sunny Yes
Sunny Yes

Entropy:

$p_{\text{Yes}} = \frac{2}{5}, \, p_{\text{No}} = \frac{3}{5}$ $H_{\text{Sunny}} = -\left( \frac{3}{5} \log_2\left(\frac{3}{5}\right) + \frac{2}{5} \log_2\left(\frac{2}{5}\right) \right) \approx 0.971$

Subset: Overcast

Weather Play
Overcast Yes
Overcast Yes
Overcast Yes
Overcast Yes

Entropy:

$p_{\text{Yes}} = 1, \, p_{\text{No}} = 0$ $H_{\text{Overcast}} = 0$

Subset: Rainy

Weather Play
Rainy Yes
Rainy Yes
Rainy No
Rainy Yes
Rainy No

Entropy:

$p_{\text{Yes}} = \frac{3}{5}, \, p_{\text{No}} = \frac{2}{5}$ $H_{\text{Rainy}} = -\left( \frac{3}{5} \log_2\left(\frac{3}{5}\right) + \frac{2}{5} \log_2\left(\frac{2}{5}\right) \right) \approx 0.971$

Step 3: Information Gain

Information Gain formula:

$IG = H_{\text{root}} - \sum_{k=1}^n \frac{N_k}{N} H_k$

For "Weather":

$IG_{\text{Weather}} = 0.940 - \left( \frac{5}{14} \cdot 0.971 + \frac{4}{14} \cdot 0 + \frac{5}{14} \cdot 0.971 \right)$ $IG_{\text{Weather}} = 0.940 - 0.694 \approx 0.246$

Step 4: Choosing the Best Split

Information gain for "Weather" is 0.246.

"Weather" is chosen as the root split.

Next Steps:

  • Repeat the process recursively for each subset (Sunny, Overcast, Rainy).
  • Stop when a stopping criterion is met (e.g., pure subset).

Conclusion

This example demonstrates:

  • Calculating entropy for each subset.
  • Computing information gain.
  • Choosing the best split based on information gain.

Model Optimization and Tuning

Handling Numerical and Categorical Features

  • Categorical Features: Split by unique values or grouped categories.
  • Numerical Features: Split based on threshold values (e.g., \( x > 50 \)).
  • Decision trees can handle mixed data types, making them versatile.

Example: Splitting a dataset on "Age > 30" or "Job Type" for classification.

Hyperparameters in Decision Trees

  • Key Hyperparameters:
    • Max Depth: Maximum number of levels in the tree.
    • Min Samples Split: Minimum samples required to split a node.
    • Min Samples Leaf: Minimum samples required in a leaf node.
    • Max Features: Maximum number of features considered for a split.
  • Impact: Tuning these parameters affects the model's complexity and performance.

Balancing Tree Depth and Minimum Samples

  • Impact of Depth:
    • Shallow Trees (Underfitting): Fail to capture important patterns in data, leading to low training and validation performance.
    • Deep Trees (Overfitting): Capture noise in the training data, reducing generalization to unseen data.
  • Impact of Minimum Samples:
    • Low Minimum Samples (Overfitting): Excessive splits, resulting in overly complex trees.
    • High Minimum Samples (Underfitting): Fewer splits, potentially missing critical details.
  • Optimization Strategies:
    • Start with default values and adjust incrementally based on performance.
    • Use cross-validation to test various combinations of tree depth and minimum samples.
    • Monitor training and validation metrics to ensure a good balance.
  • Key Insight: Optimal hyperparameters minimize the gap between training and validation performance.

Grid Search and Cross-Validation

  • Grid Search:
    • A systematic method for hyperparameter tuning.
    • Explores all combinations of predefined hyperparameter values.
    • Example: Testing combinations of maximum depth and minimum samples per leaf.
  • Cross-Validation:
    • Splits the dataset into training and validation subsets multiple times.
    • Averages performance across splits to reduce the risk of overfitting to a single validation set.
    • K-Fold Cross-Validation: Divides data into K subsets; each subset is used as a validation set once.
  • Combined Approach:
    • Grid search evaluates different hyperparameter combinations using cross-validation.
    • Ensures robust selection of hyperparameters based on consistent performance.
  • Tools:
    • Libraries like Scikit-learn offer built-in support for grid search with cross-validation.

Balancing Complexity and Performance

  • Trade-Off:
    • Simple models are easier to interpret but may underfit the data.
    • Complex models may capture more patterns but risk overfitting.
  • Techniques for Balance:
    • Pruning: Reduces unnecessary complexity by trimming branches.
    • Regularization: Set limits on parameters like maximum depth or minimum samples per leaf.
    • Cross-Validation: Ensures performance generalizes to unseen data.
  • Evaluating Balance:
    • Compare training and validation performance:
      • Overfitting: High training accuracy, low validation accuracy.
      • Underfitting: Low accuracy on both training and validation data.
  • Key Insight:
    • Optimal complexity minimizes the gap between training and validation performance.
    • Combining hyperparameter tuning with validation data ensures better trade-offs.

Overfitting and Pruning

Overfitting in Decision Trees

  • Definition: A model fits the training data too well, capturing noise and losing generalization.
  • Indicators:
    • Perfect accuracy on training data but poor performance on validation/test data.
    • Excessively deep trees with many splits.
  • Solutions:
    • Tree pruning (pre-pruning or post-pruning).
    • Limiting the depth of the tree.
    • Setting a minimum number of samples per leaf.

Tree Pruning Strategies

  • Pre-Pruning: Stop splitting early based on criteria like:
    • Minimum number of samples in a node.
    • Minimum information gain for a split.
  • Post-Pruning: Remove branches from a fully grown tree based on:
    • Validation set performance.
    • Simplifying the tree structure.
  • Objective: Improve generalization and prevent overfitting.

Visualizing Overfitting and Underfitting

  • Overfitting: Deep tree capturing noise, complex decision boundaries.
  • Underfitting: Shallow tree with simplistic decision boundaries.
  • Well-tuned Tree: Balanced tree with generalizable patterns.
decision boundaries for overfitted, underfitted, and well-tuned decision trees

Validating Pruning Results

  • Purpose of Validation:
    • Ensure the pruned tree generalizes well to unseen data.
    • Prevent overfitting caused by unnecessary complexity in the tree.
  • Steps to Validate Pruning:
    1. Split Data:
      • Use a validation set distinct from the training set.
    2. Evaluate Pre-Pruned Tree:
      • Measure performance on the validation set (e.g., accuracy, MSE).
    3. Apply Pruning:
      • Trim branches or set stopping criteria (e.g., maximum depth).
    4. Re-Evaluate Pruned Tree:
      • Compare performance before and after pruning using the validation set.
    5. Select the Best Model:
      • Choose the pruned tree that balances simplicity and validation performance.
  • Metrics for Validation:
    • Classification: Accuracy, Precision, Recall, F1-score.
    • Regression: Mean Squared Error (MSE), Mean Absolute Error (MAE).
  • Cross-Validation:
    • Use cross-validation to validate pruning on multiple splits of the data.
    • Ensures pruning decisions are robust across different subsets.
  • Visualization:
    • Compare tree structures before and after pruning to highlight simplified branches.
    • Use diagrams to demonstrate reduced complexity.

Conclusion

Key Takeaways

  • Decision trees are a versatile and interpretable model for both classification and regression tasks.
  • Splitting Criteria: Metrics like entropy, information gain, and Gini impurity guide tree construction by optimizing splits.
  • Tree Structure: Nodes and branches capture decision rules, while leaf nodes represent outcomes.
  • Overfitting and Pruning: Techniques like depth control and pruning are essential to balance complexity and generalization.
  • Hyperparameter Tuning: Parameters such as maximum depth and minimum samples significantly impact model performance and interpretability.
  • Evaluation: Tailor metrics (e.g., accuracy for classification, MSE for regression) to the task for robust model assessment.

When to Use Decision Trees

  • Advantages:
    • Highly interpretable, making them suitable for applications requiring transparency (e.g., healthcare).
    • Handles both numerical and categorical features without extensive preprocessing.
    • Automatic feature selection highlights important variables for decision-making.
  • Limitations:
    • Sensitivity to small changes in data can lead to instability in the tree structure.
    • Prone to overfitting without proper pruning or depth control.
    • Non-smooth, axis-aligned decision boundaries may struggle with complex patterns in data.
  • Best Use Cases:
    • Small-to-medium-sized datasets where interpretability and transparency are key.
    • Applications requiring clear decision rules (e.g., compliance, customer segmentation).
To get the PDF of these slides and print them, click here and then use the PDF printer of your browser.