homework 1

ECS 271, Machine Learning: Project Guidelines
Due: Last day of classes

Instructor: Prof. Rao Vemuri, rvemuri@ucdavis.edu

The project is the MOST important part of your course and the grading of this is largely subjective.

If I feel that you did not take this seriously, that belief will be reflected in the grade. If I feel YOU did not grade your fellow student's paper conscientiously, that too will reflect in my grading.

To get an A in this course, you should do this project so well and write it up so well that I should feel compelled to ask you to submit this for publication – at least at a conference level. So no amount of vigorous hand waving will earn an A grade.

The secret in getting a good grade in this class is to start very early, identify what data sets you would be using for learning (training) and select a method you want to use. You should do all of this during the first week and spend about 8 weeks in actually working on the project and spend the last week or two writing it up. The center piece of the paper is your analysis of the problem, justification of your approach selected and the quality of your results.

This is NOT a MS thesis or a Ph. D thesis. So you should choose a small subset of a bigger problem and do that little piece, and do it well. You should show your talent in chunking out a piece that can be done reasonably in 8 weeks. I'd estimate that you should spend about 30 hours on this project.

How to choose a project.

Step 1. Pick a problem that you have been thinking about for a while. The problem domain should be familiar to you. So familiar in fact that you are thinking of it for a potential thesis or dissertation.

Step 2. Pick a method or two (from the book). It does not matter whether or not I cover that method in the class. For example, it is NOT likely that I will cover Genetic Programming (GP). I will try to cover Genetic Algorithms fairly late in the quarter. So if you choose GP or GA then you are mostly on your own. You do the reading and I am available for consultation and while grading I will factor this fact into consideration. If you choose, Version Spaces (which will be covered early in the Quarter) then I expect you to do some additional reading and bring in additional items I could not cover in the class.

Step 3. Search the literature to see if anyone has done anything in this area. Your paper should contain a summary of what others have done in this area and what you expect to do.

Step 4. Implement your method, gather the results and arrange the results nicely using graphs, tables etc.

Step 5. A good discussion of what is accomplished, what remains to be done.

Step 6. Write this up as if you are writing to submit this to a journal. Follow all the protocols. You must give sufficient detail that someone else should be able to repeat the experiment to verify by reproducing your results.

Your score on the term paper depends on how carefully you defined what the problem is, how this problem relates to what others are doing, why you think the method you have chosen is a good choice over other methods, the quality of the results and your write up. Even the quality of your English carries some points.

What to submit:

Bring 3 hard copies of your paper to the class on the last day of classes (In the standard two-column format used by journals). I can give you precise definition of the template later. The paper should be no less than 6 pages and no more than 10 pages. This limit includes everything (analysis, figures, tables, references, appendixes, if any, etc.). Send one pdf version electronically.

The program and its documentation is NOT counted in the page count. Please submit the code electronically with instructions on how to run your program.

If it is important for you to show portions of a program, you can do so in the main body of the paper, as long as you stay within the 10 page limitation.

You should not ask the reader of the paper to refer to the code. The paper should stand alone.

Those papers that meet a minimum standard of quality will be published as a report and you will get a copy of the report.

List of Potential Topics

TITLE 1: Text Classification with Bayesian Methods

DESCRIPTION: Given the growing volume of online text, automatic document classification is of great practical value, and an increasingly important area for research. Naive Bayes has been applied to this problem with considerable success, however, naive Bayes makes many assumptions about data distribution that are clearly not true of real-world text. This project will aim to improve upon naive Bayes by selectively removing some of these assumptions. I imagine beginning the project by removing the assumption that document length is independent of class---thus, designing a new version of naive Bayes’ that uses document length in order to help classify more accurately. If we finish this, we'll move on to other assumptions, such as the word independence assumption, and experiment with methods that capture some dependencies between
words. The paper at
http://www.cs.cmu.edu/~mccallum/papers/multinomial-aaai98w.ps
is a good place to start some reading. You should be proficient in C programming, since you can modify rainbow
http://www.cs.cmu.edu/~mccallum/bow/rainbow.

===============================================================

TITLE2: Support vector machines for face recognition

DESCRIPTION: Face recognition is a learning problem that has recently received a lot of attention. One standard approach involves reducing the dimensionality of the problem using Principal Component Analysis (PCA) and then selecting the nearest class (eigenfaces). Support Vector Machines (SVM) are becoming very popular in the machine learning community as a technique for tackling high-dimensional problems. No one has yet (to my knowledge) applied SVMs to face recognition. Can SVMs outperform standard face recognition algorithms?

Issues that the student should address:
- How best to apply SVM to the n-class problem of face recognition;
- Figure out training and/or image preprocessing strategies (wavelets?);
- Compare how SVMs compare to other techniques (see notes)
Notes:
- A good implementation of SVMs is available (Thorsten's SVMlight);
- You can try to get access to two datasets used widely in the community, ORL and FERET, for training & testing;
- Results for eigenfaces, fisherfaces and JPRC's face recognition system on these datasets, as well as implementations, so comparing SVM to other algorithms
will be straightforward.
- We can recommend tutorials and papers on SVMs to supplement what was covered in class, if needed.

===============================================================
TITLE3: Predictive Exponential Models

DESCRIPTION: A new predictive model recently introduced by Chen & Rosenfeld can incorporate arbitrary features using exponential distributions and sampling (see www.cs.cmu.edu/~roni/wsme.ps). Although the model was originally developed for language modeling, it can be used for prediction or classification in any domain. In this project you will be expected to read and understand this paper, and to apply the model to a Machine Learning problem of your choice. For example, you could choose one of the ML problem cases used in the course, and try to improve on the existing, "baseline", solution.

COMMENT: This project is open to more than one student. Each student could work on their own ML problem, or we can choose a larger problem for joint work.

===============================================================

TITLE4: Natural Language Feature Selection for Exponential Models

DESCRIPTION: A new predictive model recently introduced by Chen & Rosenfeld can incorporate arbitrary features using exponential distributions and sampling (see
www.cs.cmu.edu/~roni/wsme.ps). The model was originally developed for modeling of natural language, and has highlighted feature selection as the main challenge in that
domain. In this project you will be expected to read and understand this paper. Then, you will be given two corpora. The first one consists of transcribed over-the-phone
conversations. The second corpus is artificial, and was generated from the best existing language model (which was trained on the first corpus). Your job is to use
machine learning and statistical methods of your choice (and other methods if you wish) to find systematic differences between the two corpora. These differences
translate directly into new features, which will be added to the model in an attempt to improve on it (an improvement in language modeling can increase the quality of
language technologies such as speech recognition, machine translation, text classification, spellchecking etc.)

COMMENT: This project is open to several students, who would be working separately.

===============================================================

TITLE5: Learning of strategies for energy trading

“In my design course, students write software agents for energy trading. Each agent is given an energy quota and money to spend to fill this quota, for each of about 50 consecutive periods. There are penalties for not filling the quota, including death (elimination) if 5 the quotas are not filled in 5 consecutive periods. Agents obtain energy through a double auction. They submit bids and the highest bids win. After each round, all the bids are made public, so each agent knows what its competitors did in the past. The purpose is to spend as little money as possible and still meet one's quotas, that is, to anticipate what the other agents will bid in the next period, and then bid just enough to get the energy one needs.

“The students know little or nothing about automatic learning. They rely strictly on their intuition to devise their bidding algorithms. It would be interesting to see if some of your students could use automatic learning technique to build winning agents.”

===============================================================

TITLE6: Learning from labeled and unlabeled data

DESCRIPTION: The recent paper by Blum & Mitchell on co-training proposes an algorithm for learning from unlabeled as well as labeled data in certain problem settings (see www.cs.cmu.edu/afs/cs.cmu.edu/project/theo11/www/wwkb/colt98_final.ps). In this project you will be expected to read and understand this paper, and to extend the experimental results in this paper. In particular, I have some ideas for creating synthetic data sets that test the robustness of the algorithm to changes in the problem setting discussed in the paper.

===============================================================

TITLE7: Similarity matching in high-dimensional space on discrete data

DESCRIPTION: Given a database with hundreds of attributes (or fields) and thousands of tuples (or records), finding similar tuples (records) is very difficult and we do not have any efficient algorithms to accomplish this task. I am looking for some ideas for new algorithms that may prove to be effective. In this project, you will implement these algorithms and explore variants to determine their effectiveness.

===============================================================

TITLE8: Using a repository of old text to answer new questions

DESCRIPTION: Consider a repository of email messages in which discussion center around living with a disease, such as heart disease or diabetes. Frequently new people become diagnosed and join the list, resulting in a good number of questions being asked repeatedly. Unfortunately, messages do not adhere to a restricted vocabulary and so traditional web-based keyword searching is often ineffective. In this project, you will use and evaluate algorithms to generate responses to new email
messages based on the repository of old email messages. You can begin with a Bayesian text classifier [as discussed in class: Lewis, 1991; Lang, 1995; Joachims, 1996] and a semantic generalization algorithm I have constructed and based on your analysis, explore interesting variants to determine the effectiveness of this new approach.

===============================================================

TITLE10: Newsletter that learns user interests

Create a personalized news letter that looks at selected news sources, and brings news stories that are of interest to a user. In the easier case, the user gives a list of keywords and your program brings the most relevant news stories and presents them to the user. In the more interesting case, you observe the behavior odf the user, learn his/her interests and bring the relevant news stories.

TITLE 11: Data Mining to extract security-relevant knowledge

The objective of this project is to develop a demonstrable information mining concept that relies on flexible and smart software tools to mine extant security-relevant network information from databases and use that in conjunction with data visualization techniques with the objective of extracting actionable knowledge.

To reflect reality, not all databases under study are considered identical in terms of the data attributes they recorded; so the methods proposed should be flexible enough to allow for the treatment of available data points in some databases and for the inclusion of others if they are available in others. It is further assumed that the data entries in the databases include Netflow records for all data entering and leaving networks, IDS tool alerts from Snort for the same traffic, but may contain only a scattering of syslog data and even less host based data such as Tripwire. It is also assumed that packet level data for selected sites is in libpcap format. No assumptions are made about sensor placement other than they are located somewhere on the base between the outside network and the internal one. Some sites may have a richer content, but this project does not explicitly count on the availability of such data, although all available data can be used to benefit the final analyses. Similar comments are valid with respect to the availability of metadata, i. e., data about data.

The questions we propose to answer are these: What do we do with this data? What actionable information can be extracted from these records? What methodologies would be suitable to interface the tools with existing databases?

TITLE12: Threat Assessment and Adversary Characterization using Computational Intelligence

Threat Assessment and Adversary Characterization revolve around (a) theoretical issues to gain an understanding of and an ability to anticipate an adversary in order to build improved threat models, and (b) pragmatic issues to develop improved profiling of attackers at post attack and forensic levels.  In the absence of a reliable marker to characterize an adversary, and in view of the evolution in technologies, vulnerabilities and attack modes, there is a need for a corresponding evolutionary component in adversary models, analysis methods and tools; that is,  a need to develop methods for creating and maintaining a "dynamically refined" picture of threats and adversaries.  A first step in developing an historical perspective on this problem is to start with attack information (e. g. techniques and tools used, network generated alerts, systems’ operational logs, etc.). Though this type of data continues to increase, the resources to analyze it remain relatively fixed (i.e. human analysts).  Use of commercial “correlation” packages that normalize data streams and insert them into a relational database can do little more than organizing the data and sheds little insight or awareness of the security situation of monitored networks. There has been very little effort expended on intelligent automated processes that can make use of these collected stores of security relevant information to yield actionable intelligence.

Using raw and compressed payload data (i.e., input to and output from Bloom filters), header data and other collateral data such as inter-arrival times, packet types (e.g. audio, video, text), histograms, etc.- we can develop and validate computational intelligence techniques such as those based on neural networks, hierarchical clustering methods, self-organizing maps, and Bayesian nets that can detect slowly evolving “attack onset signatures” that can be used to characterize attacking entities. A success at this stage means the possibility of developing an early warning system to thwart an attack or catch an attacker.

TITLE13: Intrusion detection

Use of Decision trees, RBF neural networks, Principal component analysis, clustering, etc. for in intrusion detection. Treatment of incomplete and incorrect data sets.

How much data is sufficient for training, etc. How do you handle cases when you do not have sufficient training data

TITLE: New Approaches to Sharing Data Across Multiple Security Domains

OBJECTIVE: Define secure, innovative new methods for transferring as much—but no more—of the operational data needed to enable effective cooperation between groups that are trying to accomplish a common mission.

Despite the use of many techniques for limiting sharing of information, actual sharing too often occurs in one of two modes: nothing, or everything. This is especially true in crisis situations in which there is simply not enough time available for human users of a system to evaluate what information should be shared and what should not. New concepts of trusted platforms operating across multiple security domains are needed to help automate and simplify the use of automated policies to control the data sharing processes, while at the same time ensuring that the levels of rapid data sharing needed to maintain functionality and meet joint goals remain intact. Such techniques would have immediate beneficial effects on a broad range of common communication methods and situations, including file transfers, email, and collaboration, and will be integral to coalition-oriented future concepts of military operations.

Define and demonstrate a set of well-defined, innovative concepts for addressing the problem of how to share just enough data, but no more, between diverse groups in a cooperative endeavor. The definitions should address the issue of how to characterize both the data to be shared and the receiving entities in a consistent, broadly applicable fashion. The approach should be scalable, so that requirements for processing a sharing request remain relatively flat during normal operational use. The demonstration of the principles involved should explore them in sufficient detail and with a wide enough range of examples to permit realistic evaluation of the approach.

TITLE14: An agent that learns how to play the game of Checkers

There are many interesting problems dealing with inserting intelligence into games. Here you have to be careful with choosing a game that you know and enjoy and ask your self two separate questions: (a) Given the rules of the game, how can you teach an “agent” to play the game, (b) If the game rules are not explicitly given (but are deterministic), can you train an agent to the opponent (who knows the rules) and learn to play the game, and (c) If the rules are not explicitly spelled out, but the opponent plays the game according to some policy, then can you learn the rules and play the game?

TITLE 15: Use of Belief nets to model a practical problem of interest to you

TITLE 16: Use of a genetic algorithm other than the standard GA

and so on and on.