Data File Format for HW1

Comments

Comments start with // and extend to the next end-of-line character.

The Attributes file: attributes.dat

The purpose of this file is to enumerate every possible attribute and, for each attribute, enumerate its possible values. Each line will contain a single attribute and its possible values. The attribute name will be the first token on a line. The name will be followed by at least one of the following delimiters: space, new line, carriage return, tab, and comma. Using SringTokenizer in Java, the delimiter string is " \n\r\t,". Each possible value of the attribute will then be listed on the same line, with each value separated by one or more delimiter. The last line will contain the name of the concept followed by one or more delimiters, and then a list of the possible values, again separated by one or more delimiters. Thus your code should work for not only Yes/No concepts, but also for any classification problem where the number of classes is equal to the number of possible values listed with the concept name. Each line may have a comment at the end of the line, designated by '//'.

For example, an attributes file might look like:

        handicapped_infants              y, n                 //  Bill# 7031
        water_project_cost_sharing       y, n                 //  Bill# 5396
        party_of_this_congress_person    republican, democrat // Is this person a Republican or a Democrat?

One way to implement reading the attributes file is:

  1. Create an attribute object, with a String for the name of the attribute and an ArrayList of values as data members.
  2. Create an attributespace object, with an ArrayList of attributes as members.
  3. The constructor will:
    1. Read in a line of the attributes file, remove any comments, and trim the line.
    2. Tokenize the line.
    3. Create an attribute.
    4. Store the name of the attribute.
    5. Insert each value into the ArrayList of the attribute.
    6. Insert the attribute into the Arraylist of the attributespace.
    7. Repeat until every line has been read.

The Example files: train.dat and test.dat

For each application domain there will be two data files, one containing the training set of examples, called train.dat, and the other containing the testing set of examples, called test.dat. The format for the contents of each file is the same. Each line will define one example. Within each line, a list of values for each of the attributes given in the attributes.dat file are given, separated by one or more the delimiters given above. The values will be in the same order as the attributes in the attributes.dat file. Following the last value will be a delimiter and then the desired classification for this example. The rest of the line is ignored, including any comments that occur after '//'. For example, the following is a possible list of three examples:

        n,n, n,y,y,n,n,n,n ,y,n,y,y,y,n,n,republican
        n,n,n,y,y,y,n,n,n,y,,n,y,y,y,n,y,republican
        n,n,n,y,y,y,n,n,n,n,n,y,y,y,n,n,  republican

**IMPORTANT**
Be sure that the files you create for your own concept, and your file-reading code accurately conform to these specifications. Also, the code you submit must compile and run on the Linux machines so that the TA can test additional domains with your code.

One way to implement reading a data file is as follows:

  1. Read in a line of the data file remove any comments and trim the line.
  2. Store each line in a list to be used by the method that builds the tree.

A Sample Pair of Files

To help you develop your decision-tree software, you may use the following domain and small sets of training and testing examples. The concept in this domain is "Is this congressman a Republican or a Democrat?" given the person's voting record on 16 bills in the U.S. House of Representatives in the early 1980's. Each of the 16 attributes is binary valued indicating if the person voted for ("y") or against ("n") the bill. Of course, your code should not assume attributes are all binary valued. Instead of indicating the classification as Yes/No or +/-, the desired output is specified as Republican/Democrat in this dataset.