ML learning notes (cmu10601) Lec3

6.0~7.7 min

lecture 3

pre-reading: https://www.cs.cmu.edu/~10315-s25/notes/10315_S25_Notes_Decision_Trees.pdf
lec ppt: https://www.cs.cmu.edu/~10315-s25/lectures/10315_S25_Lecture_3_Decision_Trees_inked.pdf

pre-reading content

Note: the pdf gave some examples about decision tree, but not explained the training pipeline of it

This document introduces Decision Trees as a supervised machine learning model that maps input feature vectors to predicted outputs by traversing from a root node to leaf nodes. It details the process of inference (prediction), evaluation using classification error rate, and the construction of trees via recursive splitting starting from "decision stumps" and using majority voting.

Lec content

The lecture introduces Decision Trees as an interpretable classification method that splits data recursively based on input features. It highlights that using Classification Error is often insufficient for choosing splits and instead establishes Information Theory concepts—specifically Entropy and Mutual Information—as the standard metrics for constructing trees (ID3 algorithm).

1. Core Concepts

  • Decision Tree: A hierarchical model that predicts class labels by recursively splitting data based on input features.
  • Decision Boundary: Partitions in the input space separating regions classified as different classes.
  • Decision Stump: A one-level decision tree (splits on a single attribute).
  • Prediction Method: Leaves predict the class via Majority Vote (most frequent class in that leaf).

2. Splitting Criteria

The algorithm must decide which feature to split on.

  • Classification Error Rate: \text{Error} = \frac{1}{N}\sum \mathbb{I}(y \neq \hat{y}).
    • Limitation: Often fails to distinguish between splits (ties) and does not capture "purity" improvements effectively.
  • Entropy (H(Y)): Measures uncertainty or impurity. Higher entropy = more uncertain (mixed classes).
    • Formula: H(Y) = -\sum_{y} P(Y=y) \log_2 P(Y=y)
  • Conditional Entropy (H(Y|X)): The expected entropy of Y after splitting on feature X (weighted average of children's entropy).
    • Formula: H(Y|X) = \sum_{x} P(X=x) H(Y|X=x)
  • Mutual Information (I(Y;X)): (Also known as Information Gain). The reduction in uncertainty about Y provided by knowing X.
    • Formula: I(Y;X) = H(Y) - H(Y|X)
    • Goal: Maximize Mutual Information at every split.

3. Tree Building Algorithm (ID3)

  • Approach: Greedy algorithm. It makes the best local decision (split) at each step without considering future steps.
  • Recursive Process (BuildTree):
    1. Check Base Case (Stopping Criteria).
    2. Find attribute X that maximizes Mutual Information.
    3. Split data D into subsets based on X.
    4. Recursively call BuildTree on subsets.

4. Stopping Criteria & Model Selection

To prevent overfitting (leaves becoming too specific to training data), stop when:

  • Leaves are "pure" (all outputs are same).
  • Tree reaches a maximum depth.
  • Leaf contains fewer than a minimum number of datapoints.

5. Pros & Cons

  • Pros: Interpretable (human-readable), efficient, handles both categorical and real-valued features.
  • Cons: Greedy (not guaranteed to find global optimum), liable to overfit.