## CS170 Program #2: Handwritten Digit Recognition using Neural Networks

### Due: 4 March 2003

For this assignment you may work either alone or with one other person. Your team will implement a feedforward neural network with one hidden layer that learns how to recognize handwritten digits. This problem is important in applications such as automatic zip code reading on letters, which is in current use by the U.S. Postal Service.

You are not allowed to collaborate with anyone other than your team partner on this assignment. Your team must write all of the code you use. In particular, you are not to look at other code related to back-propagation that you might find on the Web or elsewhere.

# Implement the Back-Propagation Algorithm

Implement the back-propagation algorithm given in Figure 19.14 on page 581 of the textbook for a two-layer network, i.e., one input layer, one hidden layer, and one output layer, to be used for recognizing the five handwritten digits 5, 6, 7, 8, and 9. Each input is an 8 x 8 image given as a row-major ordered sequence of pixel brightness values, which are integers in the range [0..16]. (Actually, each 8 x 8 image is a pre-processed version of a 32 x 32 binary input image.) Thus the input layer will have 64 units. The output layer will have five units, one for each of the possible classifications of the input. The goal is to train the network so that if the input image contains a "5," for example, then the first output unit's activation should be near 1 and all the other output unit's activations should be close to 0. Similarly, if the input image contains a "6" then the second output unit's value should be near 1, etc. The hidden layer should have 10 units. Each unit in the input layer is connected to every unit in the hidden layer, and each unit in the hidden layer is connected to every unit in the output layer. So the entire network contains ((64 + 1) * 10) + ((10 + 1) * 5) = 705 weights. The extra "+1"s in this formula are because each unit in the hidden layer and in the output layer also has an associated bias, which is treated as an extra weight with constant input value -1. Each unit in the hidden layer and the output layer should compute the sigmoid function, defined in Figure 19.5. This function will return a real value in the range [0..1], not a binary value as the linear threshold unit does. Initialize all weights in the network to random values in the range [-1.0, 1.0]. Experimentally pick your own value for the learning rate parameter, alpha; initially try a value of 0.2. Implement all real-valued objects including unit activation values and weights as doubles.

If you use Java: Put all of your classes and methods in one Java file called program2.java You should call the code as follows:

java program2 <examples.dat>

You can implement this any way you like, though you may want to implement routines called ForwardProp, BackProp, Train, and Test, among others.

### The Data

The data file, called digits5-9.dat, is an ASCII file containing 100 examples, one per line. Each example is a comma-separated (without any white space) list of 65 integer values, the first 64 specifying the input and the last value specifying the digit which is the desired output. The input values are integers in the range [0..16]. You should first normalize the input values by converting them to reals in the range [0..1] by dividing every value by 16.0. This is useful because the derivative of the sigmoid function is often very close to 0, which can cause the network to converge very slowly to a good set of weights.

You'll have to convert the desired output digit to a target output vector for the five output units. For example, if the digit is a "5" then create the target vector [0.9 0.1 0.1 0.1 0.1]. Using this set of teacher output values is preferred because the sigmoid function cannot produce the exact output values of 0 and 1 using finite weights, and so the weight values may get very, very large causing overflow problems.

### Training

You are to implement a 5-fold cross-validation experiment to evaluate the performance of your network. To do this, you should divide the input file into five parts, each containing 100/5 = 20 examples. For each of the five runs, 4 of the parts will be used as the training set and the 5th part will be the test set. Train your network for at least 200 epochs of the training set. (Note: You should experiment with the number of epochs to train based on the sum-of-MSE curve (see below) your results produce.) After every 10 epochs compute the sum of the mean squared error (MSE) on the training set as follows:
MSE = sum_from_j=1_to_m ( 1/2 * sum_from_i=1_to_5 (Tij - Oij)^2 )
where m is the number of examples in the training set, Tij is the target output value (either 0.1 or 0.9) of the ith output unit for the jth training example, and Oij is the actual real-valued output of the ith output unit for the jth training example. You should compute this sum-of-MSE value after every 10 epochs by stopping backprop and running through all of the training examples to compute this error value. You should not compute this "on the fly" after each training example is used to update the weights because then the error for each example would be based on a different network (i.e., set of weights).

Plot these sum-of-MSE values as a training curve as shown in Figure 19.15(a). That is, the x-axis will show the epoch number and the y-axis will show the sum-of-MSE value. You can create this plot manually or by entering the data into any plotting program (e.g., in Matlab or Gnuplot).

### Testing

Test your network using the examples in the test set created for each of the five cross-validation runs. Report the percentage correct classification on the test set for each of the five runs, plus the average of the 5 runs. Define the output digit computed by the network as the corresponding output unit with maximum activation (i.e., output) value. In case you are interested, the complete dataset that includes examples of all ten digits and includes over 5,000 examples is available at ftp://ftp.ics.uci.edu/pub/machine-learning-databases/optdigits/. Another dataset of handwritten digits, called MNIST, is also available, containing about 60,000 digits.

### Experiments with Varying Numbers of Hidden Units

Repeat the training and testing steps given above after varying the number of hidden units in your network. Use values of 5 and 50. Show the training plot and the results of 5-fold cross-validation for each of these networks using the same training and testing sets. Use a separate graph for each network. Comment on how performance changes with the number of hidden units, both in terms of the percentage correctly classified on the training set and the test set, and also in terms of the number of iterations that seem necessary for the network to "converge." Specify what you consider to be the "best" set of parameters (alpha, number of hidden units, number of iterations) for this problem based on your experiments.

## What to Turn In

I expect you to write this program in C or C++.  If you wish to do this in Java, first check with the TA before you spend any time.

If you write your code in Java, a possible way of turning in your results was described in an earlier paragraph. But please ask the TA on what he prefers.

In addition to making an electronic submission of the code, you need to do a bit of documentation. This documentation includes (a) any instructions you may have to run the code, (b) a report that indicates the value of alpha that you used, the training curve plot, and the final percentage correct results for the training and test sets. Put this all together, stapled, with a cover sheet showing all team member’s names.

## Extra Credit

For extra credit, implement your network so as to recognize all ten digits, 0, ..., 9. Or, experiment with varying the number of training examples, once you've selected the "best" number of hidden units and number of iterations.

Another possible extension is to take into account the knowledge you have about the problem domain. That is, some pixels should likely be weighted more than others because they are in important positions in the input image for distinguishing between the different digits. For more ideas on this, see the following papers, both of which are available online.

1. Y. LeCun, B. Boser, J. Denker, D. Henderson, R. Howard, W. Hubbard, and L. Jackel, Handwritten digit recognition with a back-propagation network, in Advances in Neural Information Processing Systems 2 (NIPS 89), D. Touretzky, ed., Morgan Kaufman Publishing, 1990.
2. Y. LeCun, L. Jackel, L. Bottou, C. Cortes, J. Denker, H. Drucker, I. Guyon, U. Muller, E. Sackinger, P. Simard, and V. Vapnik, Learning algorithms for classification: A comparison on handwritten digit recognition, in Neural Networks: The Statistical Mechanics Perspective, J. Oh, C. Kwon, and S. Cho, eds., World Scientific, 1995.

Finally, you can implement other extensions as you like. Write a separate report on the extra credit work you did.