Using Genetic Algorithms to Build Decision Trees


CS 478 Final Project
Paras Shelawala
Steven Yam

Introduction

Our motivation for this project was to explore a different method of building decision trees superior to the ID3 algorithm. ID3 has three major shortcomings which we can avoid using a genetic algorithm.

ID3 greedily builds a tree by maximizing the local information gain and recursively building a tree. This algorithm has been shown to work quite well but we would like to create something better. We have looked at several articles which discuss the possibility of using a genetic algorithm to build trees and they all explain how it is possible but none of them explicitly explained how to accomplish the blending of the trees nor claimed to have actually implemented such an algorithm.

Problem Definition

Given a set of examples, specifically binary properties about some real world system, and their respective outcomes, we must break the examples into a training set and a validation set, and build a tree based on the training set which performs well at classifying the examples in the validation set. The output shall be in the form of the finished tree and its average fitness.

The Learning System

The problem we are addressing is determining if a given image is an advertisement. We have a large set of binary attributes collected about each sample image, including whether or not it is an ad. We want to build a tree which can examine the attributes of an image and determine whether or not it is an advertisement. A tree system is ideal for this problem because of its binary properties. Once the tree is generated, we can determine whether or not an image is an ad by simply executing the example on the tree which requires only as many steps as the tree depth.

Target Function

We have implemented a genetic algorithm which takes data collected about randomly selected images from the web and classifies them as either an ad or non-ad. The data set contains 3279 examples about the randomly selected images, and each example consists of 1555 different binary attributes and an element indicating whether the image is an ad.

Given the input data, we randomly divide the examples into a training set and a validation set. We then build a pool of random trees and randomly blend them to create a new set of trees until the pool size doubles. Using the training data we compute the fitness of each tree in the gene pool and dispose of the weaker trees. With some probability, we give some of the trees a ‘second chance’ by not deleting them even though their fitness is low, as their genes might be very good.

Representation

Our input is merely a large set of examples; we store these examples in a vector of arrays, each array the size of the number of attributes. The results are stored in a separate vector such that their indices are aligned with those of their respective examples.

More importantly, our trees are represented as nodes with two children unless they are leaves. We chose this representation because of its simplicity over dependency matrices when blending sub-trees. The gene pool is merely a vector of pointers to the root of each tree in the pool.

System Structure

To choose the best trees, we compute the fitness of each tree in the system and delete the weaker ones with the exception of those which were given a ‘second chance’. To compute the fitness of a tree, we merely count the number of correctly classified examples when executed on the decision tree.

The Learning Algorithm

Our system learns by finding good decision trees and blending them together in hopes of generating even better trees. Initially we start with a gene pool of n random trees. We then pair off the trees in the pool and create a new generation of trees by blending the trees and adding them to the pool. We blend trees by finding a common node in the two trees we which to blend and swap the sub-trees under each respective node this always results in a valid (though not always efficient) tree, as the tree may contain the same node twice along one path between the and a leaf. Then with probability p, we make selected trees immune to the pruning algorithm. Once a specified number of the trees have been made immune to the pruning algorithm, we delete the weakest non-immune trees. We repeat this process until only one tree remains or we hit a desired pool size such that we can pick the tree which best fits the training and validation sets.

Randomly create n trees

while(numTrees>1 and numTrees>desiredSize)

{

- Keep a weaker tree

Else,

- Delete the weaker trees as

determined by the training

examples

}

Using the validation set, select the tree that best classifies the training and validation sets.

Improvements / Modifications

In the descriptions of the genetic algorithm to generate decision trees we did not see any concept of a shrinking gene pool. We decided to add this feature to provide an indicator of when the algorithm is done and also a means to dispose of inherently bad genetic data, which we completely eliminate instead of spreading throughout the entire gene pool. The shrinking gene pool also allows us to limit the expansion of blended trees. This means we can add more initial trees (more starting genetic material) and only increase the processing time by n log n, where n is the number of initial trees.

Experiments

Our experiments are based on the performance of a tree generated by our learning system. We varied the parameters of our system and collected data about how well the resulting tree performed on our training and validation sets.

Data Sets

The initial set of data is from data collected about randomly selected images from the web and attributes about each image. As we parsed though the data file, we insert each example either into a training or validation set based on a user-preset probability.

Learning

We ran our experiments until the number of trees was reduced to one or it was reduced to a fixed, preset constant. If we stopped at the constant, we would use both the training and validation sets to select the end tree.

Testing / Evaluation

Using the same data, Kushmerick was able to get 97% accuracy using his AdEater (http://www.cs.ucd.ie/staff/nick/research/research/ae/) which uses the used the C4.5rules machine learning algorithm. Using the whole dataset as the training data, we achieved an accuracy of 93%. (Kuchmerick actually had three more non-binary attributes that our system could not represent). Our results are comparable to AdEater, and probably builds its set of rules faster.

Results

We experimented by varying the survival rate, the initial number of trees, pruning cutoff and size of the validation set and seeing their effect on performance.

Higher survival rates seemed to slightly increase the performance but significantly slowed down the computation

Increasing the initial number of trees increased performance nominally.

Stopping the pruning algorithm earlier hurts training data performance slightly but significantly improves validation performance and overall performance.

Large validation sets result in poor performance on the training data and large training sets results in poor performance in on the validation sets. This is logical because the more training examples used, the better the end decision tree is going to fit the training data and the worse it fits the validation set.

Data Analysis

From the results we gathered, we find the resulting tree classifies the training data correctly 87% to 93% of the time and classifies the validation set correctly 88% to 92% of the time. We have such a good result on the training data because we cut off the pruning while there are many trees, and select the tree which best fits the validation set.

Summary

We have found that using a genetic algorithm to generate decision trees is an extremely promising alternative to ID3. We believe that further research in building context sensitive decision trees will provide us with a new powerful, and computationally tractable means of generating a decision tree.

Future work

The current method is still quite greedy but this is only for the sake of computational tractability. As better hardware becomes available, we can create trees with greater depth, larger gene pools and even permit an expanding gene pool.

Because our trees are very limited in depth, (number of nodes doubles for every time the depth is increased by one) each classification of an example is limited to only considering only the depth number of attributes to making a decision. This can be overcome by creating more complex and expressive nodes. For example, we can allow each node to have more than two sub-trees or have them contain Boolean expressions instead of a single decision (it is sunny outside and it is not raining).

Works Cited

Sample Data for Experiments:

Kushmerick, Nicholas, "This dataset represents a set of possible advertisements on Internet pages. The features encode the geometry of the image (if available) as well as phrases occuring in the URL, the image's URL and alt text, the anchor text, and words occuring near the anchor text. The task is to predict whether an image is an advertisement ("ad") or not ("nonad")." http://www.ics.uci.edu/~mlearn/MLSummary.html, ftp://ftp.ics.uci.edu/pub/machine-learning-databases/internet_ads/

N. Kushmerick (1999). "Learning to remove Internet advertisements", 3rd Int Conf Autonomous Agents. http://www.cs.ucd.ie/staff/nick/research/download/kushmerick-aa99.ps.gz