Decision Trees - SI420 Artificial Intelligence

Strictly speaking, a Decision Tree (DT) is not a machine learning algorithm itself, but rather a decision tool for which various machine learning algorithms exist.

How a Decision Tree Works¶

In the testing phase, a series of questions are arranged in a tree structure. When an input comes in, the process starts at the root. You ask each question, draw the answer from the input, and move to the next question based on that answer. Here is an example tree:

Each path through the tree represents a logical statement where every question serves as a predicate in a boolean formula:

$Seller(Reliable) \wedge Cheaper(close) \wedge Ram(sufficient) \wedge Waranty(yes) \wedge Type(laptop) \wedge Weight(light) \rightarrow Buy(yes)$
$Seller(Reliable) \wedge Cheaper(no) \rightarrow Buy(no)$

These trees are highly expressive (capable of representing any boolean statement), compact (usually smaller than a truth table), and intuitive for human users.

Building the Tree¶

To generate a tree, you start with data, labels (e.g., yes/no or 0/1), and a set of possible predicates. Every example in the training set will have all its values filled, identifying features like reliability, price, color, speed, and RAM.

The Minimal Tree Concept¶

While a trivial method would be to build a path for every individual data item, this results in a massive tree that fails to generalize to new data. Instead, the goal is to find the minimal tree that still characterizes the data.

This is achieved by placing the “best” predicate at the root[cite: 108]. A perfect predicate is one that divides the data set into pure groups of just “yes” or just “no”.

Measuring Purity with Entropy¶

To measure the purity of a set, we use a concept from information theory called Entropy. It is calculated as:

E(S) = \sum_{i \in c} -p_i \log_2 p_i

(1)

Where:

$c$ is the set of tree outputs (e.g., yes and no)
$p_i$ is the proportion of the data labeled $i$

If a set is evenly divided, the entropy is 1; if the set is perfectly pure, the entropy is 0.

Information Gain¶

Once purity is measurable, we can calculate the Information Gain ( $G$ ) of splitting a set $S$ based on a predicate $P$ :

G(S, P) = E(S) - \sum_{v \in V_p} \frac{|S_v|}{|S|} E(S_v)

(2)

This represents the entropy of $S$ minus the weighted sum of the entropies of the resulting sub-sets. We select the predicate that yields the highest information gain and repeat the process for each split set until a branch is pure. This ultimately determines the order of questions and shape of the tree.

Potential Challenges¶

Specific Predicates: Attributes like “date” may split data perfectly but offer no generalization power.
Noise: Growing a perfect tree can be detrimental if the data contains noise.
Continuous Values: Continuous predicates require additional processing to determine split points.
Missing Values: Handling data points with missing values requires specific strategies.