Difference between revisions of "How-to/de novo assembly"

From SEQwiki
< How-to
How-toHow-to/de novo assembly
Jump to: navigation, search
(Allpaths-LG)
(Correcting the link)
 
(33 intermediate revisions by 7 users not shown)
Line 1: Line 1:
= De-novo short read assemblers =
+
<big>
 +
<big>
  
The generation of short reads by next generation sequences has lead to an increased need to be able to assemble the vast amount of short reads, that are generated. This is no trivial problem, as the sheer number of reads makes it near impossible to use e.g. the overlay layout consensus (OLC) approach that had been used with longer reads. Therefore, most of the available assemblers that can cope with typical data generated by Illumina use a de Bruijn graph based k-mer based approach.
+
'''Note''': This content has been ported to the new WikiBook project:
 +
http://en.wikibooks.org/wiki/Next_Generation_Sequencing_(NGS)/De_novo_assembly
  
 +
Previous contributors should ensure they take proper credit for their work here:
 +
http://en.wikibooks.org/w/index.php?title=Next_Generation_Sequencing_(NGS)/Authors
  
A clear distinction has to be made by the size of the genome to be assembled.
+
</big>
* small (e.g. bacterial genomes: few Megabases)
+
</big>
* medium (e.g. lower plant genomes: several hundred Megabases)
 
* large (e.g. mammalian and plant genomes: Gigabases)
 
 
 
All de-novo assemblers will be able to cope with small genomes, and - given decent sequencing libraries - will produce relatively good results.
 
Even for medium sized genomes, most de-novo assemblers mentioned here and many others will likely fare well and produce a decent assembly. That said, OLC based assemblers might take weeks to assemble a typical genome.
 
Large genomes are still difficult to assemble when having only short reads (such as those provided by Illumina reads).
 
Assembling such a genome with Illumina reads will probably will require having a machine having about 256 GB and potentially even 512GB RAM, unless one is willing to use a small cluster ([[Abyss]], [[RAY]], [[Contrail]]), or invest into commercial software (CLC).
 
 
 
 
 
Some steps which are likely common to most assemblies
 
 
 
# If it is within reason and would not tamper with the biology: Try to get DNA from haploid or at least mostly homozygous individuals.
 
# Make sure that all libraries are really ok quality-wise and that there is no major concern (e.g. use FastQC)
 
# For paired end data you might also want to estimate the insert size based on draft assemblies or assemblies which you have made already.
 
# Before submitting data to a de-novo assembler it might often be a good idea to clean the data, e.g. to trim away bad bases towards the end and/or to drop reads altogether. As low quality bases are more likely to contain errors, these might complicate the assembly process and might lead to a higher memory consumption. (More is not always better) That said, several general purpose short read assemblers such as SOAP de-novo and ALLPATHS-LG can perform read correction prior to assembly.
 
# Before running any large assembly double and triple check the parameters you feed the assembler.
 
# Post assembly it is often advisable to check how well your read data really agrees with the assembly and if there are any problemantic regions
 
# If you run de Bruijn graph based assemblies you will want to try different k-mer sizes. Whilst there is no rule of thump for any individual assembly, smaller k-mers would lead to a more tangled graph if the reads were error free. Larger k-mer sizes would yield a less tangled graph, given error free reads. However, a lower k-mer size would likely be more resistant to sequencing errors. And a too large k might not yield enough edges in the graph and would therefore result in small contigs.
 
 
 
= Decision Helper =
 
 
 
This is based both on personal experience as well as on published studies. Please note however that genomes are different and software packages are constantly evolving.
 
 
 
An Assemblathon challenge which uses a synthetic diploid genome assembly was reported on by [http://www.nature.com/news/2011/110323/full/471425a.html Nature] to call '''SOAP ''de novo'', Abyss and ALLPATHS-LG the winners'''.
 
 
 
However a talk on the result website http://assemblathon.org/assemblathon-1-results names ''' SOAP ''de novo'', sanger-sga and ALLPATHS-LG''' to be consistently amongst the '''best performers''' for this synthetic genome.
 
 
 
I want to assemble:
 
* Mostly 454 data
 
** small Genome =>MIRA, Newbler
 
** all others use Newbler
 
* Mixed data (454 and Illumina)
 
** small genome => MIRA, but try other ones as well
 
** medium genome => no clear recommendation
 
** large genome, assemble Illumina data with ALLPATHS-LG and SOAP, add in other reads or use them for scaffolding
 
* Mostly Illumina (or Colorspace)
 
** small genome => MIRA, velvet
 
** medium genome => no clear recommendation
 
** large genome, assemble Illumina data with ALLPATHS-LG and SOAP, add in other reads or use them for scaffolding
 
(For large genomes this is based on the fact that not many assemblers can deal with large genomes, and based on the assemblathon outcome. For 454 data this is based on Newbler's good general performance, and MIRA's different outputs, its versatility and the theoretical consideration that de Bruijn based approaches might fare worse)
 
 
 
I want to start a large genome project for the least cost
 
* Use Illumina reads with ALLPATHS-LG specification (i.e. overlapping), the reads will work in e.g. SOAP de novo as well
 
(This recommendation is based on the assemblathon outcome, the original ALLPATHS publication ([http://www.ncbi.nlm.nih.gov/pubmed/21187386 Gnerre et al., 2011]) as well as a publication that used ALLPATHS for the assembly of Arabidopsis genomes ([http://www.pnas.org/content/108/25/10249.long Schneeberger et al., 2011]).
 
 
 
Each software has its particular strength, if you have specific requirement, the result from [http://assemblathon.org/ Assemblathon] will guide you. Another comparison site [http://gage.cbcb.umd.edu/ GAGE] is yet to release its comparison.
 
 
 
= Software Packages =
 
 
 
== Free Software ==
 
 
 
 
 
=== ABySS ===
 
 
 
[[ABySS]] is a de-novo assembler which can run on multiple nodes where it uses the message parsing inerface (MPI) interface for communication. As ABySS distributes tasks, the amount of RAM needed per machine is smaller and thus Abyss is able to cope with large genomes
 
 
 
* Pros
 
** distributed interface a cluster can be used
 
** a large genome can be assembled with relatively little RAM per compute node. A human genome was assembled on 21 nodes having '''16GB RAM each'''
 
 
 
* Cons
 
** relatively slow
 
 
 
=== Allpaths-LG ===
 
Allpath-LG is a novel assembler requiring specialized libarieris.
 
The authors of the software benchmarked ALLPATH-LG against SOAP-denovo and ALLPATH-LG and reported superior performance. However it must be noted that they might not have used the SOAP-denovo gap filling module for one of the data set due to time constraints. This would probably have improve the SOAP assembly contigous sequence length.
 
In our own hand (usadellab) we have seen similar good N50 results and also ([http://www.pnas.org/content/108/25/10249.long Schneeberger et al. 2011]), reported good N50 values for ALLPATHS-LG Arabidopsis assemblies. Similarly ALLPATHS-LG was named as well performing in the assemblathon.
 
 
 
* Pros
 
** relatively fast runtime (slower than SOAP)
 
** good scaffold length (likely better than SOAP)
 
** can use long reads (e.g. PAC Bio) but only for small genomes
 
 
 
* Cons
 
** specially tailored libariers are necessary
 
** large genomes (mammalian size) need a lot of RAM. The publications estimates about '''512GB''' would be sufficient though
 
** slower than SOAP
 
 
 
=== Euler-SR USR ===
 
[[EULER]]
 
 
 
=== MIRA ===
 
[[MIRA]] is a general purpose assembler that can integrate various platform data and perform true hybrid assemblies.
 
 
 
* Pros
 
** very well documented and many switches
 
** can combine different sequencing technologies
 
** likely relatively good quality data
 
 
 
* Cons
 
** Only partly multithreaded thus and due to the technology slow
 
** Probably not recommended to assemble larger genomes
 
 
 
=== SOAP de novo ===
 
[[SOAPdenovo]] is an all purpose genome assembler. It was used to assemble the giant panda genome.
 
 
 
* Pros
 
** SOAP de novo uses a medium amount of RAM
 
** SOAP de novo is relatively fast (probably the fastest free assembler)
 
** SOAP de novo contains a scaffolder and a read-corrector
 
** SOAP de novo is relatively modular (read-corrector, assembly, scaffold, gap-filler)
 
 
 
* Cons
 
** potentially somewhat confusing way in which contigs are built.
 
** Relatively large amount of RAM needed, [http://soap.genomics.org.cn/soapdenovo.html BGI] states ca. '''150GB''' (less than ALLPATHS though)
 
 
 
=== Velvet ===
 
[[Velvet]]
 
 
 
* Pros
 
* Cons
 
** Velvet might need large amounts of RAM for large genomes, potentially ''' > 512 GB''' for a human genome based if at all possible. This is based on an approximation formula derived by [http://listserver.ebi.ac.uk/pipermail/velvet-users/2009-July/000474.html Simon Gladman] for smaller genomes  -109635 + 18977*ReadSize + 86326*GenomeSize in MB + 233353*NumReads in million - 51092*Kmersize
 
 
 
== Commercial ==
 
=== CLC cell ===
 
The CLC assembly cell is a commercial assembler released by CLC. It is most likely based on a kmer approach.
 
 
 
* Pros
 
** CLC uses very little RAM
 
** CLC is very fast
 
 
 
* Cons
 
** CLC is not free
 
** CLC might be a bit more liberal in folding repeats based on our own plant data.
 
** CLC doesn't perform any real scaffolding , paired end data is used.
 
 
 
=== Newbler ===
 
[[Newbler]] is an assembler released by the Roche company.
 
 
 
* Pros
 
** Newbler has been used in many assembly projects
 
** Newbler seems to be able to produce good N50 values
 
** Newbler is often relatively precise
 
 
 
* Cons
 
** Newbler is tailored to (mostly) 454 data. Since Ion Torrent PGM data has a similar error profile (predominance of miscalled homopolymer repeats), it may be a good choice there also. Whilst it can accomodate some limited amount of Illumina data as has been described [http://contig.wordpress.com/2011/01/21/newbler-input-ii-sequencing-reads-from-other-platforms/ here], this is not possible for larger data sets.
 
** As Newbler at least partly uses the OLC approach large assemblies can take time
 
 
 
= Further Reading Material and References =
 
 
 
* Background
 
** [http://www.cbcb.umd.edu/research/assembly_primer.shtml Genome Sequence Assembly Primer]
 
 
 
* Original publications
 
** [http://genome.cshlp.org/content/19/6/1117.full Simpson et al., 2009] ABySS
 
** [http://genome.cshlp.org/content/18/5/821.long Zerbino and Birney, 2008] Velvet
 
 
 
 
 
* Comparisons
 
** [http://genomebiology.com/2011/12/3/R31 Ye et al., 2011] Comparison of Sanger/PCAP; 454/Roche and Illumina/SOAP assemblies. '''Illumina/SOAP''' had '''lower substitution, deletion and insertion rates''' but '''lower contig and scaffold N50 sizes than 454/Newbler'''.
 
** [http://bib.oxfordjournals.org/content/11/5/457.abstract Paszkiewicz et al., 2010] General review about short read assemblers
 
** [http://www.plosone.org/article/info%3Adoi%2F10.1371%2Fjournal.pone.0017915 Zhang et al., 2011] In depth comparison of different genome assemblers on simulated Illumina read dat. Unfortunately only up to medium genomes were tested. For '''eukaryotic genomes''' and '''short reads Soap denovo''' is suggested for '''longer reads ALLPATHS-LG'''.
 
** [http://www.plosone.org/article/info%3Adoi%2F10.1371%2Fjournal.pone.0023501 Chapman JA et al. 2011] Introduces the new assembler [[Meraculous] gathered literature data on the assembly of E. coli K12 MG1655 for Allpaths 2, Soapdenovo, Velvet, Euler-SR, Euler, Edena, AbySS and SSAKE. '''Allpaths2''' had by far the '''largest Contig and Scaffold N50''' and was apart from Meraculous the only '''misassembly free'''. '''Meraculous''' was shown to even contain '''no errors'''.
 
** The [http://assemblathon.org Assemblathon] comparing ''de novo'' genome assemblies of many different teams based on a synthetic genome
 

Latest revision as of 13:17, 13 November 2012

Note: This content has been ported to the new WikiBook project: http://en.wikibooks.org/wiki/Next_Generation_Sequencing_(NGS)/De_novo_assembly

Previous contributors should ensure they take proper credit for their work here: http://en.wikibooks.org/w/index.php?title=Next_Generation_Sequencing_(NGS)/Authors