For this assignment you'll write a utility, freq, which identifies the most frequently occurring words in a document. It will also count the total number of words and the number of distinct words. We want you to choose good data structures and algorithms so that your program will run fast.
We are not going to tell you how to find these things well; you yourself need to think about it and make some good choices. But do read sections 5.1 - 5.4 (hashing) and 6.3 (binary heaps) from your text and explain your choices in your design document.
For purposes of this assignment, a word is a maximal-length string of upper-case or lower-case letters. Words are case-insensitive: hello, Hello and HELLO are all the same word. A file consists of zero or more words, separated by filler. Filler is anything that isn't an upper-case or lower-case letter. Your program will ignore filler and pay attention only to words.
First, your program should report the number of words and the number of distinct words in a file. For example, the first line of your output might look like: "561 words, 325 distinct words".
Allow your program to take from the command line a number k. Give it a default value (ie, the value of k when no number is explicitly specified) of k=100. Use the "-k" flag to specify k. Print out the k most common words, identifying for each the percentage of the words that this word comprises. For example, one line of your output might say:
Print this list of popular words and their percentages in order of most-popular-word, second-most-popular-word, ..., k-th-most-popular-word. (Of course if there are fewer than k distinct words, just list however many there are, in order of popularity.)
Have your program report what percentage of the total words the k most popular words comprise. For example, your program might say "The 100 most popular words make up 7.45% of the total."
Use the CPUTimer (or AutoCPUTimer) to output how many CPU seconds your program takes from start to finish. Start timing before reading in any data, and stop timing after you have written all of the required statistics.
Call your executable freq. It should read from a named file, if you give it one, or from standard input, if you don't. So to run your program on a file named myfile you'd either say freq myfile or freq < myfile. To specify a non-default value of k you might say freq -k 25 myfile.
Make your program efficient; it should run fast on large files. Achieve efficiency by making good choices for the data structures and their algorithms. (E.g., don't sort all of the words, and don't maintain a linked list of distinct ones.) Do not try to make your program speedy at the expense of making it clear and well-structured; you can (and should) have both. Concentrate on performing well when k is small (100 or less).
Show us how fast your program runs by running it on the test case ~cs110/hw3/big which we are providing. This file (consisting of gnu documentation) is 2 MBytes.
As always, submit softcopy and hardcopy. Softcopy includes a high-quality writeup (including efficiency analysis), plus executions and a listing of your code. Test your program on a few small files and then the big one, big, that we provide.
Same rules as before - best time wins! Run your program on the DECs and bring in your executions on file big. We'll hold the contest on May 20. Who knows; maybe the prize will be even BETTER than last time!