Difference between revisions of "How-to/de novo assembly"

From SEQwiki
< How-to
How-toHow-to/de novo assembly
Jump to: navigation, search
(De-novo short read assemblers)
(Newbler)
Line 140: Line 140:
  
 
* Cons
 
* Cons
** Newbler is tailored to (mostly) 454 data. Whilst it can accomodate some limited amount of Illumina data as has been described [http://contig.wordpress.com/2011/01/21/newbler-input-ii-sequencing-reads-from-other-platforms/ here], this is not possible for larger data sets.  
+
** Newbler is tailored to (mostly) 454 data. Since Ion Torrent PGM data has a similar error profile (predominance of miscalled homopolymer repeats), it may be a good choice there also. Whilst it can accomodate some limited amount of Illumina data as has been described [http://contig.wordpress.com/2011/01/21/newbler-input-ii-sequencing-reads-from-other-platforms/ here], this is not possible for larger data sets.  
 
** As Newbler at least partly uses the OLC approach large assemblies can take time
 
** As Newbler at least partly uses the OLC approach large assemblies can take time
  

Revision as of 15:57, 9 September 2011

De-novo short read assemblers

The generation of short reads by next generation sequences has lead to an increased need to be able to assemble the vast amount of short reads, that are generated. This is no trivial problem, as the sheer amount of data makes it near impossible to use e.g. the overlay layout consensus approach that had been used with short reads. Therefore, most of the available assemblers that can cope with typical data generated by Illumina use a k-mer based approach.


A clear distinction has to be made by the size of the genome to be assembled.

  • small (e.g. bacterial genomes: few Megabases)
  • medium (e.g. lower plant genomes: several hundred Megabases)
  • large (e.g. mammalian and plant genomes: Gigabases)

All de-novo assemblers will be able to cope with small genomes, and - given decent sequencing libraries - will produce relatively good results. Even for medium sized genomes, most de-novo assemblers mentioned here and many others will likely fare well and produce a decent assembly. That said, OLC based assemblers might take weeks to assemble a typical genome. Large genomes are still difficult to assemble when having only short reads (such as those provided by Illumina reads). Assembling such a genome with Illumina reads will probably will require having a machine having about 256 GB and potentially even 512GB RAM, unless one is willing to use a small cluster (Abyss, RAY, Contrail), or invest into commercial software (CLC).


Some steps which are likely common to most assemblies

  1. If it is within reason and would not tamper with the biology: Try to get DNA from haploid or at least mostly homozygous individuals.
  2. Make sure that all libraries are really ok quality-wise and that there is no major concern (e.g. use FastQC)
  3. For paired end data you might also want to estimate the insert size based on draft assemblies or assemblies which you have made already.
  4. Before submitting data to a de-novo assembler it might often a good idea to clean the data, e.g. to trim away bad bases towards the end and/or to drop reads altogether. As low quality bases are more likely to contain errors, these might complicate the assembly process and might lead to a higher memory consumption. That said, several general purpose short read assemblers such as SOAP de-novo and ALLPATHS-LG perform read correction prior to assembly.
  5. Before running any large assembly double and triple check the parameters you feed the assembler.
  6. Post assembly it is often advisable to check how well your read data really agrees with the assembly and if there are any problemantic regions

Decision Helper

This is based on personal experience and prevalence and only meant to give you a quick primer. Please note that genomes are different and software packages are constantly evolving.

An Assemblathon challenge which uses a synthetic diploid genome assembly was reported on by Nature to call SOAP de novo, Abyss and ALLPATHS-LG the winners.

However a talk on the result website http://assemblathon.org/assemblathon-1-results names SOAP de novo, sanger-sga and ALLPATHS-LG to be consistently amongst the best performers for this synthetic genome.

  • Mostly 454 data
    • small Genome MIRA, Newbler
    • all others use Newbler
  • Mixed data (454 and Illumina)
    • small genome => MIRA, but try other ones as well
    • medium genome =>
    • large genome, assemble Illumina data with ALLPATHS-LG and SOAP, add in other reads or use them for scaffolding
  • Mostly Illumina (or Colorspace)
    • small genome => MIRA, velvet
    • medium genome =>
    • large genome, assemble Illumina data with ALLPATHS-LG and SOAP, add in other reads or use them for scaffolding

I want to start a large genome project for the least cost

  • Use Illumina reads with ALLPATHS-LG specification, the reads will work in e.g. SOAP de novo as well
  • Each software has its particular strength, If you have specific requirement, The result from Assemblathon will guide you. Another comparison site GAGE is yet to release its comparison.

Software Packages

Free Software

ABySS

ABySS is a de-novo assembler which can run on multiple nodes where it uses the message parsing inerface (MPI) interface for communication. As ABySS distributes tasks, the amount of RAM needed per machine is smaller and thus Abyss is able to cope with large genomes

  • Pros
    • distributed interface a cluster can be used
    • a large genome can be assembled with relatively little RAM per compute node. A human genome was assembled on 21 nodes having 16GB RAM each
  • Cons
    • relatively slow

Allpaths-LG

Allpath-LG is a novel assembler requiring specialized libarieris. The autthors of the software benchmarked ALLPATH-LG aganst SOAP-denovo and ALLPATH -LG and reported superior performance. However it must be noted that they might not have used the SOAP-denovo gap filling module for the mouse data set due to time constraints which would probably have improve the mouse contigous sequence length. In our own hand (usadellab) we have seen similar good N50 results and also Schneeberger et al., reported good N50 values for ALLPATHS asemblies in the case of Arabidopsis assemblies.

  • Pros
    • relatively fast runtime (slower than SOAP)
    • good scaffold length (likely better than SOAP)
    • can use long reads (e.g. PAC Bio) but only for small genomes
  • Cons
    • specially tailored libariers are necessary
    • large genomes (mammalian size) need a lot of RAM. The publications estimates about 512GB would be sufficient though
    • slower than SOAP


Euler-SR USR

EULER

MIRA

MIRA is a general purpose assembler that can integrate various platform data and perform true hybrid assemblies.

  • Pros
    • very well documented and many switches
    • can combine different sequencing technologies
    • likely relatively good quality data
  • Cons
    • Only partly multithreaded thus and due to the technology slow
    • Probably not recommended to assemble larger genomes

SOAP de novo

SOAPdenovo is an all purpose genome assembler. It was used to assemble the giant panda genome.

  • Pros
    • SOAP de novo uses a medium amount of RAM
    • SOAP de novo is relatively fast (probably the fastest free assembler)
    • SOAP de novo contains a scaffolder and a read-corrector
    • SOAP de novo is relatively modular (read-corrector, assembly, scaffold, gap-filler)
  • Cons
    • potentially somewhat confusing way in which contigs are built.
    • Relatively large amount of RAM needed, BGI states ca. 150GB (less than ALLPATHS though)

Velvet

Velvet

  • Pros
  • Cons
    • Velvet might need large amounts of RAM for large genomes, potentially > 512 GB for a human genome based if at all possible. This is based on an approximation formula derived by Simon Gladman for smaller genomes -109635 + 18977*ReadSize + 86326*GenomeSize in MB + 233353*NumReads in million - 51092*Kmersize

Commercial

CLC cell

The CLC assembly cell is a commercial assembler released by CLC. It is most likely based on a kmer approach.

  • Pros
    • CLC uses very little RAM
    • CLC is very fast
  • Cons
    • CLC is not free
    • CLC might be a bit more liberal in folding repeats based on our own plant data.
    • CLC doesn't perform any real scaffolding , paired end data is used.

Newbler

Newbler is an assembler released by the Roche company.

  • Pros
    • Newbler has been used in many assembly projects
    • Newbler seems to be able to produce good N50 values
    • Newbler is often relatively precise
  • Cons
    • Newbler is tailored to (mostly) 454 data. Since Ion Torrent PGM data has a similar error profile (predominance of miscalled homopolymer repeats), it may be a good choice there also. Whilst it can accomodate some limited amount of Illumina data as has been described here, this is not possible for larger data sets.
    • As Newbler at least partly uses the OLC approach large assemblies can take time

Further Reading Material and References

  • Comparisons
    • Ye et al., 2011 Comparison of Sanger/PCAP; 454/Roche and Illumina/SOAP assemblies
    • Paszkiewicz et al., 2010 General review about short read assemblers
    • Zhang et al., 2011 In depth comparison of different genome assemblers on simulated Illumina read dat. Unfortunately only up to medium genomes were tested.
    • The Assemblathon comparing de novo genome assemblies of many different teams based on a synthetic genome