CS 540 HW #1: Learning Decision Trees

ECS170

Homework Assignment

Winter 2003

ECS 170 HW #4b:

Decision Tree Learning

Assigned: 18 February 2003

Due: 25 February 2003

Do only Problem 1. This problem can be done using paper and pencil. But if you rather write a program to do these, it is OK with me. But thi is NOT a programming assignment.

Problem 1 - Decision Trees (25 points)

Decision tree learning is a form of supervised inductive learning. A set of training examples with their correct classifications is used to generate a decision tree that, hopefully, classifies each example in the test set correctly. To get started, consider the problem of learning the concept of whether or not to purchase a music CD. To keep things simple enough to work through this problem by hand, we use a very small number of examples from which we want to learn the concept. In Problem 2 you'll use a larger dataset.

Assume you are using the following attributes to describe the examples:

     TYPE         possible values:     Rock, Jazz, HipHop

     PRICE        possible values:     Cheap, Expensive

(Since each attribute's value starts with a different letter, for shorthand we'll just use that initial letter, e.g., 'J' for Jazz.)

Our output decision is binary-valued, so we'll use '+' and '-' as our concept labels, indicating a "buy" recommendation or not, respectively.

Here is our TRAIN set:

     TYPE = H    PRICE = E    CATEGORY = +

     TYPE = R    PRICE = C    CATEGORY = +

     TYPE = R    PRICE = E    CATEGORY = +

     TYPE = H    PRICE = C    CATEGORY = +

     TYPE = J    PRICE = C    CATEGORY = +

     TYPE = R    PRICE = E    CATEGORY = -

     TYPE = J    PRICE = E    CATEGORY = -

     TYPE = J    PRICE = C    CATEGORY = -

     TYPE = H    PRICE = E    CATEGORY = +

     TYPE = J    PRICE = E    CATEGORY = -

     TYPE = R    PRICE = E    CATEGORY = -

     TYPE = J    PRICE = C    CATEGORY = +

     TYPE = R    PRICE = E    CATEGORY = -

And our TEST set:

     TYPE = R    PRICE = C    CATEGORY = +

     TYPE = J    PRICE = C    CATEGORY = -

     TYPE = J    PRICE = E    CATEGORY = -

     TYPE = R    PRICE = E    CATEGORY = +

     TYPE = H    PRICE = E    CATEGORY = +

(a) Constructing the Initial Decision Tree

Apply the decision tree algorithm given in Fig 18.7 of the text (we'll call this algorithm C5.0 from now on) to the TRAIN set. Show all your work.

If multiple features tie for the best one, choose the one whose name appears earliest in alphabetical order (e.g., AGE before PRICE before TYPE). If there is a tie in computing MajorityValue, choose '-'.

Some useful logarithmic formulas here are:

log_a (x/y) = log_a x - log_a y
log_a x = ( log_b x ) / ( log_b a )

(b) Plotting the Results

Plot each example in the TRAIN set as a point on a two-dimensional graph. (You can use any plotting software you like or do it neatly by hand.) Use two different colors or symbols to distinguish positive from negative examples. The musical TYPE should be graphed on one axis and the PRICE should be graphed on the other axis. Next, add a decision boundary for each leaf node in your decision tree from (a). Indicate the correspondence between the boundaries you add and the associated decision tree leaf node.

(c) Estimating Future Accuracy

Use the decision tree produced part (a) to classify the TEST examples. Report the accuracy (i.e., percent correct classification) on these examples. Briefly discuss your results.

Since Problem 1 is a paper and pencil exercise, make sure that your answers can be read by the TA (in other words, type your answers or print neatly).

(This problem is NOT assigned) Problem 2 - Building Decision Trees in Java (This is still under construction. DO NOT FOLLOW THIS for now)

In this problem you are to implement in Java the decision-tree induction algorithm given in Fig 18.7.

You should create a class whose calling convention is as follows:

  java hw1 <attributes.dat> <train.dat> <test.dat>

Use the TRAINING SET of examples to build a decision tree.
Print out the decision tree
Classify the examples in the TESTING SET using the decision tree, reporting which examples were INCORRECTLY classified. Just print out the labels of the examples incorrectly classified.

You can test your program with the example files we have provided.

Run the Decision-Tree-Learning program using the training set to build two trees for your domain, once using RANDOM and once using MAX-GAIN for the Choose-Attribute function. (Hard code the parameters RANDOM and MAX-GAIN and document this as part of your lab report, i.e. label which function generated which tree that you will print up and hand in)

As you know, you can compute "log base 2" as follows:
log2(x) = log(x) / log(2) where "log2" means "log base 2"

Calculate the classification accuracy of the trees for the test set.

(Update) Clarifying Russel and Norvig pg. 535, Use the majority vote at the leaf or the closest ancestor with a majority vote when classifying a leaf node.

(Update) In the case that the root node does not have a majority, use the first value of the classifiaction in the attributes.names file.

The Java classes BufferedReader and stringTokenizer should prove useful for parsing the data files.

Make sure you write a general solution for all decision trees using the specified file format because we'll test your program with different data sets. Collecting the data and writing the file reading procedures might take some time, so start early!

What to Turn In

Put all of your files in your own private handin directory, say called hw4-handin. Don't forget to put all your classes and data into that directory (including your train.dat, test.dat and attributes.dat files). If your program does not compile, we're unable to grade it. Next, add the following line to your .cshrc.local file:

set path = ($path /s/handin/bin)

Finally, execute the following command from the directory containing your handin directory:

handin -c cs170-1 -a hw4 -d hw1-handin

For instructions on how to hand your work in, click here.

Also turn in a printout of your well-documented source code and the results of the runs. Finally, don't forget to hand in your solution to Problem 1; either type your answers or neatly print them. For Problem 2, write a short lab report (1-2 pages) that describes your domain, how you acquired the data and discuss the results of the runs. Staple all pages together and put your name, your login, and the date on top of the front page.