Gini impurity & Entropy: Decision tree fundamentals

Is Gini impurity better that entropy

Vizuara AI Labs

and

Sreedath Panat

Jan 25, 2025

Decision tree from scratch full course

Quick recap: Entropy

In this dataset, you want to classify a given animal to one of the 3 classes. Which feature would you use?

Height < 90cm or
Weight < 120kg?

You choose the feature that minimizes the weighted entropy of children nodes compared to the root node. So you go with the option in the left in the figure below.

Gini impurity introduction

Gini Impurity measures how often a randomly chosen element from a dataset would be incorrectly labeled if it was randomly labeled according to the distribution of labels.

If all data points belong to the same class in a node, the gini impurity = 0. These ndoes can be called as “pure leaves”.

Why is Gini impurity used?

It’s a measure of node impurity used in decision trees to determine node splits.
Lower Gini Impurity means better homogeneity (less impurity) within a node.

Comparison with other metrics:

Contrast Gini with Entropy (used in information gain).
Example: Gini is computationally simpler than Entropy.

Properties:

Perfectly pure split: Gini=0.
Completely random split: Gini>0.

Numerical example

Consider the below dataset with 1 feature with 2 values (red and blue) and 2 classes (yes or no). Calculate the gini impurity for parent node and the weighted impurity of children nodes.

Step 1: Initial node impurity

Step 2: Split the data

Step 3: Weighted Gini impurity

Step 4: Interpret results

After splitting, the weighted Gini impurity reduced from 0.5 to 0.444

Gini impurity v/s entropy

Gini impurity v/s entropy: When to use which?

Coding a node split and Gini impurity from scratch

Function for Gini Impurity

# Define a function to calculate Gini Impurity
function gini_impurity(data)
    total = length(data)
    if total == 0
        return 0.0
    end
    
    # Count occurrences of each label
    label_counts = Dict{String, Int}()
    for row in data
        label = row[:label]
        label_counts[label] = get(label_counts, label, 0) + 1
    end
    
    # Calculate Gini Impurity
    gini = 1.0
    for count in values(label_counts)
        prob = count / total
        gini -= prob^2
    end
    return gini
end

Function for Splitting the Dataset

# Function to split data based on a feature value
function split_data(data, feature, value)
    left = [row for row in data if row[feature] == value]
    right = [row for row in data if row[feature] != value]
    return left, right
end

Weighted Gini Impurity Calculation

# Function to calculate weighted Gini Impurity for a split
function weighted_gini(data, splits)
    total = length(data)
    weighted_gini = 0.0
    for split in splits
        split_gini = gini_impurity(split)
        weighted_gini += (length(split) / total) * split_gini
    end
    return weighted_gini
end

Example Usage

# Sample dataset
data = [
    Dict(:color => "Red", :label => "Yes"),
    Dict(:color => "Red", :label => "Yes"),
    Dict(:color => "Red", :label => "No"),
    Dict(:color => "Blue", :label => "No"),
    Dict(:color => "Blue", :label => "No"),
    Dict(:color => "Blue", :label => "Yes")
]

# Calculate initial Gini Impurity
println("Initial Gini Impurity: ", gini_impurity(data))

# Split the data by the feature `color` with value `Red`
red_split, blue_split = split_data(data, :color, "Red")
println("Red Split: ", red_split)
println("Blue Split: ", blue_split)

# Calculate Gini Impurity after the split
split_gini = weighted_gini(data, [red_split, blue_split])
println("Gini Impurity after split: ", split_gini)