* 1. historical of assembly Here are some historically important papers on the theory of de novo assembly (not in the practical aspect). It was edited from the note I took for writing the fermi paper. I was quite confused with the history and theory on de novo assembly at that time. The note is a little lousy but might be useful to someone in case. Staden, 1979; Gingeras et al, 1979: first attempt to use a computer program to assist sequence assembly. Peltola et al, 1984: first mathematical formulation sequence assembly and the emerge of the overlap-layout-consensus (OLC) paradidgm. OLC consists of three steps: 1) overlap: finding overlaps between all pairs of reads; 2) layout: arrange the reads based on overlaps; 3) consensus: derive the final sequence. In that paper, the authors tried to find a layout that minimizes the assembly. Myers, 1995 (a legendary figure in computional biology); Kececioglu and Myers, 1995: probably the first graph representation of sequence assembly - the overlap graph. In an overlap graph, a vertex represents a read and an edge represents an overlap. We can find the layout by finding the optimal path through each vertex - an NP-hard Hamilton problem. However, Myers et al. did not stop there. They proposed transitive reduction, a procedure to dramatically reduce the redundancy of overlaps. And after transitive reduction, solving sequence assembly is not a Hamilton problem any more. Idury and Waterman, 1995 (the Waterman in Smith-Waterman): the birth of the de Bruijn paradigm. I remember that a paper mentioned that both Myers and Waterman participated the 4th DIMACS Implementation Chanllenges in 1994. They proposed two distinct graph representations of sequence assembly, which have such a fundamental incluence even today. Myers et al, 2000: publication of Celera assembler, arguably the best assembler for short-gun data at that time. This assembler is largely a practical implementation of the theory in their 1995 papers. Pevzner et al, 2000: publication of the Euler assembler, the first assembler based on the de Bruijn graph. It solved two major problems with the original theory published in 1995 with spectrum error correction and read threading, a procedure to re-impose the full-length read information to the de Bruijn graph. The paper unfortunately spread a wrong notion which can still be seen in a few reviews nowadays: OLC is an NP-hard Hamilton problem, but de Bruijn graph can be solved in linear time and thus is more advanced. However, in fact, Gene Myers with his Celera assembler never tried to reduce OLC to a Hamilton problem. I will come back to the time complexity of de Bruijn later. Myers, 2005: string graph. String graph is largely a different representation of overlap graph. Medvedev et al, 2007: de novo assembly is NP-hard, which is true for both the OLC and the de Bruijn paradigm. A catch is that although constructing the de Bruijn graph and finding the optimal path is linear, resolving the optimal read threading is NP-hard and without threading, we cannot fully use all the information in reads. Zerbino et al, 2009: the publication of velvet. This is a landmark paper on short-read assembly. Graph cleaning (e.g. tip trimming and bubble popping) was first introduced in this paper and is now widely used in essentially every short-read assembler. Simpson and Durbin, 2010: the first application of FM-index in sequence assembly. This paper resolved a long-standing difficulty in applying OLC to short reads: finding the overlaps. After this paper, two other fast overlap finder were also published (Dinh and Rajasekaran, 2011; Gonnella and Kurtz, 2012) which are claimed to be much faster and more lightweight. Earl et al, 2011: Assemblathon 1. In my view, this is undoubtedly the best paper on the evaluation of de novo assemblers. ADD COMMENT • link modified 2.9 years ago by Gjain ♦ 4.6k • written 2.9 years ago by lh3 ♦ 26k Also, the correct ref. is Pevzner et al, 2001 PNAS, not Pevzner et al, 2000 for the EULER assembler. I think these two older classic papers are deserve citations too (they are not in the link above): Hutchinson, G. 1969. Evaluation of polymer sequence fragment data using graph theory. Bull. Math. Biophys. 31, 541–562. http://link.springer.com/article/10.1007%2FBF02476636 Gallant, J.K. 1983. The complexity of the overlap method for sequencing biopolymers. J. Theor. Biol. 101, 1–17. http://www.sciencedirect.com/science/article/pii/0022519383902709 * 2. current comparison of assembly http://omictools.com/genome-assembly-category - Sequence assembly demystified - A Practical Comparison of De Novo Genome Assembly Software Tools for Next-Generation Sequencing Technologies - Assembly algorithms for next-generation sequencing data - A.M. Phillippy, M.C. Schatz, M. Pop. Genome assembly forensics: finding the elusive mis-assembly. Genome Biol., 9 (2008), p. R55 - The Theory and Practice of Genome Sequence Assembly Annual Review of Genomics and Human Genetics Vol. 16: 153-172 (Volume publication date August 2015) First published online as a Review in Advance on April 22, 2015 DOI: 10.1146/annurev-genom-090314-050032 - Current challenges in de novo plant genome sequencing and assembly - Assembly of large genomes using second-generation sequencing. - A field guide to whole-genome sequencing, assembly and annotation assesement of genome assembly: -REAPR: a universal tool for genome assembly evaluation -QUAST: Quality Assessment Tool for Genome Assemblies http://bioinf.spbau.ru/en/spades - https://www.biostars.org/p/52178/ ** Assemblathon, Assemblathon 2: evaluating de novo methods of genome assembly in three vertebrate species * 3. long reads MIRA and Celera Assembler (a upcoming renaissance for Phrap ** 3.1 https://github.com/PacificBiosciences/Bioinformatics-Training/wiki/Large-Genome-Assembly-with-PacBio-Long-Reads PacBio long reads can be used in a number of ways to generate and improve de novo assemblies for large genomes. You can take several different approaches: PacBio-only de novo assembly. Using just PacBio reads from a long insert library, the reads are often preprocessed before being assembled using an Overlap-Layout-Consensus algorithm. The best known implementation of this is HGAP. Hybrid de novo assembly. Using a combination of PacBio and short read data, the reads are used together during assembly to generate a hybrid assembly. ** 3.2 bioRxiv A practical guide to de novo genome assembly using long reads