In addition to the several papers already posted on the website, here are some other suggestions for student presentations. I may add to this list later. Students need to pick soon. 1. Material on factorizations of strings, called the critical factorization, and Lyndon factorization - see the book Text Algs by Crochemore et al, and Combinatorics on Words by Lothaire 2. DAWGS (Directed Acyclic Word Graphs) and their relationship to suffix trees. Proof of the Linear size of DAWGs and linear-time to construct a DAWG - see Text Algs. 3. A polynomial-time algorithm for the perfect phylogeny problem when the number of character states is fixed. Agarwala and Fernandez-Baca. Siam J. computing, Vol. 23, No. 6 pp. 1216-1224 4. Mitochondrial Portraits of Human Populations using Median Nets. Bandelt et al. Genetics 141: 743-755 (october 1995) The key thing here is to understand the relationship between median nets and Buneman graphs, and the theorem proven in the paper that these nets or graphs contain all the min. parsimony solutions. 5. Four characters suffice to convexly define a phylogenetic tree. - Huber, Moulton and Steel SIAM J. discrete math, vol. 18, no. 4, pp. 835-843 2005. This shows that with multistate characters (i.e, more than binary) one can uniquely identify the tree with only four characters. There is an earlier paper that shows the result for five characters, which might be easier. 6. A note on unique dedipherability. Christoph Hoffman LNCS vol. 176, p. 50-63 Sept. 1984. This gives a faster algorithm for the problem in some special cases. It is the outcome of a more general attempt to get a linear-time algorithm for the problem. This is a long mathematical paper and one would need to find the core material to present. This would be a hard paper I think. 7. Reconstructing Trees from Subtree Weights. Lior Pachter and David Speyer July7, 2006
http://arxiv.org/pdf/math/0311156 Also, Applied Math Letters - vol (17) #6 615-621 (2004) Abstract: The tree-metric theorem provides anecessary and sufficient condi- tion for a dissimilarity matrix to be a tree metric, and has served as the foundation for numerous distance-based reconstruction methods in phy- logenetics. Our main result is an extension of the tree-metric theorem to more general dissimilarity maps. In particular, we show that a tree with n leaves is reconstructible from the weights of the m-leaf subtrees provided that n >= 2m-1. 8. Solving the String Statistics Problem in Time O(nlogn) Brodal et al Volume Volume 2380/2002 Book Automata, Languages and Programming DOI 10.1007/3-540-45465-9 Copyright 2002 ISBN 978-3-540-43864-9 DOI 10.1007/3-540-45465-9_62 Page 772 Subject Collection Computer Science SpringerLink Date Tuesday, January 01, 2002 9.@ARTICLE{BryantLagergren06, author = {Bryant, D. and Lagergren, J.}, title = {Compatibility of unrooted phylogenetic trees is {FPT}}, journal = {Theoretical Computer Science}, year = {2006}, volume = {351}, pages = {296--302} } 10. Pachter et al. Why Neighbor-Joining Works. Algorithmica 54(1): 1-24 (2009) 11. Vingron et al. An Improved Algorithm for the Macro-evolutionary Phylogeny Problem. CPM 2006:177-187 12. Zhi-Zhong Chen, Lusheng Wang, Zhanyong Wang: Approximation Algorithms for Reconstructing the Duplication History of Tandem Repeats. Algorithmica 54(4):501-529 (2009) 13. Wang et al. Efficient Algorithms for the Closest String and Distinguishing String Selection Problems. LNCS 5598 Springer 2009 p. 261- 14. For someone very mathematical, one project is to prove a surprising NASC condition for a code to be maximally uniquely decypherable. A code is defined to be MUD iff it is UD and no additional codeword can be added to the code without making it not UD. There is an amazing NASC for this, but the only proof I know of it is contained in two `homework' exercises in a very mathematical book on coding theory. I have been able to do one of those homeworks, but I am stuck on the other, and I am not even sure how those two homeworks add up to a proof of the NASC. But someone more mathematical than I am might want to have a go at this. The NASC for a UD code C to be maximally UD is that S(C,p) = 1, when p is the uniform distribution. I suspect this is also true for any distribution p. Recall that S(C,p) is defined in the posted notes on unique decypherability, and that a necessary condition for a code C to be UD is that S(C,p) =< 1, for any p. 15. Another project for anyone interested in the topic of UD codes is to read up and present some of the problems and results on related definitions of decypherability. See me if interested.