Gini is computed similarly but avoids logarithms.
Dataset features: Weather (Sunny, Rainy, Overcast), Labels: Play or Don't Play
Outcome: Demonstrate the best split based on calculated information gain.
We use a small dataset to demonstrate decision tree construction.
Weather | Play |
---|---|
Sunny | No |
Sunny | No |
Overcast | Yes |
Rainy | Yes |
Rainy | Yes |
Rainy | No |
Overcast | Yes |
Sunny | No |
Sunny | Yes |
Rainy | Yes |
Sunny | Yes |
Overcast | Yes |
Overcast | Yes |
Rainy | No |
Class distribution at the root node:
Entropy formula:
$H = -\sum_{i=1}^c p_i \log_2(p_i)$
Root node entropy:
$p_{\text{Yes}} = \frac{9}{14}, \, p_{\text{No}} = \frac{5}{14}$ $H_{\text{root}} = -\left( \frac{9}{14} \log_2 \left( \frac{9}{14} \right) + \frac{5}{14} \log_2 \left( \frac{5}{14} \right) \right)$ $H_{\text{root}} \approx 0.940$
Weather | Play |
---|---|
Sunny | No |
Sunny | No |
Sunny | No |
Sunny | Yes |
Sunny | Yes |
Entropy:
$p_{\text{Yes}} = \frac{2}{5}, \, p_{\text{No}} = \frac{3}{5}$ $H_{\text{Sunny}} = -\left( \frac{3}{5} \log_2\left(\frac{3}{5}\right) + \frac{2}{5} \log_2\left(\frac{2}{5}\right) \right) \approx 0.971$
Weather | Play |
---|---|
Overcast | Yes |
Overcast | Yes |
Overcast | Yes |
Overcast | Yes |
Entropy:
$p_{\text{Yes}} = 1, \, p_{\text{No}} = 0$ $H_{\text{Overcast}} = 0$
Weather | Play |
---|---|
Rainy | Yes |
Rainy | Yes |
Rainy | No |
Rainy | Yes |
Rainy | No |
Entropy:
$p_{\text{Yes}} = \frac{3}{5}, \, p_{\text{No}} = \frac{2}{5}$ $H_{\text{Rainy}} = -\left( \frac{3}{5} \log_2\left(\frac{3}{5}\right) + \frac{2}{5} \log_2\left(\frac{2}{5}\right) \right) \approx 0.971$
Information Gain formula:
$IG = H_{\text{root}} - \sum_{k=1}^n \frac{N_k}{N} H_k$
For "Weather":
$IG_{\text{Weather}} = 0.940 - \left( \frac{5}{14} \cdot 0.971 + \frac{4}{14} \cdot 0 + \frac{5}{14} \cdot 0.971 \right)$ $IG_{\text{Weather}} = 0.940 - 0.694 \approx 0.246$
Information gain for "Weather" is 0.246.
"Weather" is chosen as the root split.
Next Steps:
This example demonstrates:
Example: Splitting a dataset on "Age > 30" or "Job Type" for classification.