CS 224 Fall 2011 String Algorithms and Algorithms in Computational Biology - Tues, Thurs 1:40 - 3:00 244 Olson Dan Gusfield There is also a discussion section that may (possibly) be used for student presentations at the end of the quarter. We will schedule that later if needed. Otherwise there will be no discussion section. The formal prerequisite of this class is CS 222A. But, whether you have had that course or not, what is critical is that you understand the basic game-plan of worst-case algorithm analysis, that you can think about algorithms without the need to see low-level programming details, and that you can follow and construct proofs of algorithmic properties. The focus of the class is on algorithms that illustrate techniques and problems that are relevant in computational biology, but it is not a course on practical computational biology or bioinformatics. That course is CS 124, which is not a prerequisite for CS 224. The class will be a mixture of three major topic areas: 1) string algorithms, particularly based on suffix arrays and trees 2) algorithms and combinatorial structure for phylogenetic networks (for this topic we will refer to the recent book ``Phylogenetic Networks" by Huson et al. 3) algorithms to deduce phylogenetic networks involving recombination as well as mutation. In all three areas of concentration, algorithmic efficiency and combinatorial structure will be emphasized. The phylogenetic topics also discuss combinatorial optimization methods and integer programming. Open problems will be identified. In the five times that this class has been offered, at least two published papers resulted from open problems examined in the class. There will be regular homeworks and a final exam (most likely take-home). Depending on enrollment etc. students may also have the opportunity to read a current paper (I will suggest some later), and present it in the scheduled discussion section. The following lecture topic list is subject to revision as we go. There is no assigned textbook but the Huson book mentioned above will be used for the second part of the course (get it through Amazon). Course notes or copied materials will be handed out on the other topics. The main topics are: 1. Non-trivial applications of dynamic programming: hybrid dynamic programming with suffix trees, linear space applications, four Russians algorithms, improvements in RNA folding algorithms, circular string edit distance. 2. Linear-time construction of suffix trees, suffix arrays, and LCP information; linear-time preprocessing methods for constant-time least common ancestor and least common extension queries. Many possible applications such as rapid identification of pathogens, finding all tandem repeats in linear time, use of LCA for accumulating common substring statistics, approximate tandem repeat finding, fast unique decipherability using suffix trees, LZ sequence compression. 3. The Burrows-Wheeler transform and its use in sequence analysis in computational biology. 4. Phylogenetic Networks as treated in Huson's book. 5. Phylogenetic Networks involving recombination: galled-trees, decomposition, construction of good networks, lower bounds on the number of recombinations needed, many uses of integer programming, applications to topics such as genome-wide association studies. Phylogenetic networks involving hybridization and other reticulation events; relation to SPR and maximum agreement forests. Introduction to the open field of extending the binary case to the multi-state case.