For this assignment you may work either alone or with one other person. Your team will implement a feedforward neural network with one hidden layer that learns how to recognize handwritten digits. This problem is important in applications such as automatic zip code reading on letters, which is in current use by the U.S. Postal Service.

You are not allowed to collaborate with anyone other than your team partner on this assignment. Your team must write all of the code you use. In particular, you are not to look at other code related to back-propagation that you might find on the Web or elsewhere.

Implement the back-propagation algorithm given in Figure
19.14 on page 581 of the textbook for a two-layer network, i.e., one input
layer, one hidden layer, and one output layer, to be used for recognizing the
five handwritten digits 5, 6, 7, 8, and 9. Each input is an 8 x 8 image given
as a row-major ordered sequence of pixel brightness values, which are integers
in the range [0..16]. (Actually, each 8 x 8 image is a pre-processed version of
a 32 x 32 binary input image.) Thus the input layer will have 64 units. The
output layer will have five units, one for each of the possible classifications
of the input. The goal is to train the network so that if the input image
contains a "5," for example, then the *first* output unit's
activation should be near 1 and *all the other* output unit's activations
should be close to 0. Similarly, if the input image contains a "6"
then the *second* output unit's value should be near 1, etc. The hidden layer
should have 10 units. Each unit in the input layer is connected to every unit
in the hidden layer, and each unit in the hidden layer is connected to every
unit in the output layer. So the entire network contains ((64 + 1) * 10) + ((10
+ 1) * 5) = 705 weights. The extra "+1"s in this formula are because
each unit in the hidden layer and in the output layer also has an associated
bias, which is treated as an extra weight with constant input value -1. Each
unit in the hidden layer and the output layer should compute the sigmoid
function, defined in Figure 19.5. This function will return a real value in the
range [0..1], not a binary value as the linear threshold unit does. Initialize
all weights in the network to random values in the range [-1.0, 1.0]. Experimentally
pick your own value for the learning rate parameter, alpha; initially try a
value of 0.2. *Implement all real-valued objects including unit activation
values and weights as* `double`*s*.

**If you use Java**: Put all of your classes and methods in one Java file
called `program2.java`
You should call the code as follows:

`java program2
<examples.dat>`

You can implement this any way you like, though you may want to implement
routines called `ForwardProp`,
`BackProp`,
`Train`,
and `Test`,
among others.

The data file, called `digits5-9.dat`,
is an ASCII file containing 100 examples, one per line. Each example is a
comma-separated (without any white space) list of 65 integer values, the first
64 specifying the input and the last value specifying the digit which is the
desired output. The input values are integers in the range [0..16]. You should
first normalize the input values by converting them to reals in the range
[0..1] by dividing every value by 16.0. This is useful because the derivative
of the sigmoid function is often very close to 0, which can cause the network
to converge very slowly to a good set of weights.

You'll have to convert the desired output digit to a target output vector for the five output units. For example, if the digit is a "5" then create the target vector [0.9 0.1 0.1 0.1 0.1]. Using this set of teacher output values is preferred because the sigmoid function cannot produce the exact output values of 0 and 1 using finite weights, and so the weight values may get very, very large causing overflow problems.

You are to implement a 5-fold cross-validation experiment to
evaluate the performance of your network. To do this, you should divide the
input file into five parts, each containing 100/5 = 20 examples. For each of
the five runs, 4 of the parts will be used as the training set and the 5th part
will be the test set. Train your network for *at least* 200 epochs of the
training set. (Note: You should experiment with the number of epochs to train
based on the sum-of-MSE curve (see below) your results produce.) After every 10
epochs compute the sum of the mean squared error (MSE) on the training set as
follows:

MSE = sum_from_j=1_to_m ( 1/2 * sum_from_i=1_to_5 (Tij - Oij)^2 )

where *m* is the number of examples in the training set, *Tij* is the
target output value (either 0.1 or 0.9) of the *i*th output unit for the *j*th
training example, and *Oij* is the actual real-valued output of the *i*th
output unit for the *j*th training example. You should compute this
sum-of-MSE value after every 10 epochs by stopping backprop and running through
all of the training examples to compute this error value. You should *not*
compute this "on the fly" after each training example is used to
update the weights because then the error for each example would be based on a
different network (i.e., set of weights).

Plot these sum-of-MSE values as a training curve as shown in Figure
19.15(a). That is, the x-axis will show the epoch number and the y-axis will
show the sum-of-MSE value. You can create this plot manually or by entering the
data into any plotting program (e.g., in `Matlab` or `Gnuplot`).

Test your network using the examples in the test set created
for each of the five cross-validation runs. Report the percentage correct
classification on the test set for each of the five runs, plus the average of
the 5 runs. Define the output digit computed by the network as the
corresponding output unit with *maximum* activation (i.e., output) value.
In case you are interested, the complete dataset that includes examples of all
ten digits and includes over 5,000 examples is available at ftp://ftp.ics.uci.edu/pub/machine-learning-databases/optdigits/.
Another dataset of handwritten digits, called MNIST, is also available,
containing about 60,000 digits.

Repeat the training and testing steps given above after varying the number of hidden units in your network. Use values of 5 and 50. Show the training plot and the results of 5-fold cross-validation for each of these networks using the same training and testing sets. Use a separate graph for each network. Comment on how performance changes with the number of hidden units, both in terms of the percentage correctly classified on the training set and the test set, and also in terms of the number of iterations that seem necessary for the network to "converge." Specify what you consider to be the "best" set of parameters (alpha, number of hidden units, number of iterations) for this problem based on your experiments.

I expect you to write this program in C or C++. If you wish to do this in Java, first check with the TA before you spend any time.

** **

**If you write your code in Java, a
possible way of turning in your results was described in an earlier paragraph. But please ask the
TA on what he prefers.**

In addition to making an electronic submission of the code, you need to do a bit of documentation. This documentation includes (a) any instructions you may have to run the code, (b) a report that indicates the value of alpha that you used, the training curve plot, and the final percentage correct results for the training and test sets. Put this all together, stapled, with a cover sheet showing all team member’s names.

For extra credit, implement your network so as to recognize all ten digits, 0, ..., 9. Or, experiment with varying the number of training examples, once you've selected the "best" number of hidden units and number of iterations.

Another possible extension is to take into account the knowledge you have about the problem domain. That is, some pixels should likely be weighted more than others because they are in important positions in the input image for distinguishing between the different digits. For more ideas on this, see the following papers, both of which are available online.

- Y. LeCun, B. Boser, J.
Denker, D. Henderson, R. Howard, W. Hubbard, and L. Jackel, Handwritten
digit recognition with a back-propagation network, in
*Advances in Neural Information Processing Systems 2*(NIPS 89), D. Touretzky, ed., Morgan Kaufman Publishing, 1990. - Y. LeCun, L. Jackel, L.
Bottou, C. Cortes, J. Denker, H. Drucker, I. Guyon, U. Muller, E.
Sackinger, P. Simard, and V. Vapnik, Learning algorithms for
classification: A comparison on handwritten digit recognition, in
*Neural Networks: The Statistical Mechanics Perspective*, J. Oh, C. Kwon, and S. Cho, eds., World Scientific, 1995.

Finally, you can implement other extensions as you like. Write a separate report on the extra credit work you did.