Comments start with // and extend to the next end-of-line character.
The purpose of this file is to enumerate every possible attribute and, for each attribute, enumerate its possible values. Each line will contain a single attribute and its possible values. The attribute name will be the first token on a line. The name will be followed by at least one of the following delimiters: space, new line, carriage return, tab, and comma. Using SringTokenizer in Java, the delimiter string is " \n\r\t,". Each possible value of the attribute will then be listed on the same line, with each value separated by one or more delimiter. The last line will contain the name of the concept followed by one or more delimiters, and then a list of the possible values, again separated by one or more delimiters. Thus your code should work for not only Yes/No concepts, but also for any classification problem where the number of classes is equal to the number of possible values listed with the concept name. Each line may have a comment at the end of the line, designated by '//'.
For example, an attributes file might look like:
handicapped_infants y, n // Bill# 7031
water_project_cost_sharing y, n // Bill# 5396
party_of_this_congress_person republican, democrat // Is this person a Republican or a Democrat?
One way to implement reading the attributes file is:
For
each application domain there will be two data files, one containing the training
set of examples, called train.dat, and the other containing the testing
set of examples, called test.dat. The format for the contents of each
file is the same. Each line will define one example. Within each line, a list
of values for each of the attributes given in the attributes.dat file
are given, separated by one or more the delimiters given above. The values will
be in the same order as the attributes in the attributes.dat
file. Following the last value will be a delimiter and then the desired classification
for this example. The rest of the line is ignored, including any comments that
occur after '//'. For example, the following is a possible list of three
examples:
n,n, n,y,y,n,n,n,n ,y,n,y,y,y,n,n,republican
n,n,n,y,y,y,n,n,n,y,,n,y,y,y,n,y,republican
n,n,n,y,y,y,n,n,n,n,n,y,y,y,n,n, republican
**IMPORTANT**
Be sure that the files you create for your own concept, and your file-reading
code accurately conform to these specifications. Also, the code you submit must
compile and run on the Linux machines so that the TA can test additional
domains with your code.
One way to implement reading a data file is as follows:
To help you develop your decision-tree software, you may use the following domain and small sets of training and testing examples. The concept in this domain is "Is this congressman a Republican or a Democrat?" given the person's voting record on 16 bills in the U.S. House of Representatives in the early 1980's. Each of the 16 attributes is binary valued indicating if the person voted for ("y") or against ("n") the bill. Of course, your code should not assume attributes are all binary valued. Instead of indicating the classification as Yes/No or +/-, the desired output is specified as Republican/Democrat in this dataset.