Difference between revisions of "How-to/de novo assembly"

From SEQwiki
< How-to
How-toHow-to/de novo assembly
Jump to: navigation, search
(ABySS)
(ABySS)
Line 245: Line 245:
 
For more details about the installation process see the [[ABySS]] entry or install the .deb if you have a debian compliant system.
 
For more details about the installation process see the [[ABySS]] entry or install the .deb if you have a debian compliant system.
  
 +
We now run ABySS with a kmer size of 25 and require at least n=10 links to join contigs.
  
 
   cd bin
 
   cd bin
 
   ./abyss-pe k=25 n=10 in='SRR001665_1.fastq SRR001665_2.fastq' name=ecoli
 
   ./abyss-pe k=25 n=10 in='SRR001665_1.fastq SRR001665_2.fastq' name=ecoli

Revision as of 19:35, 25 September 2011

De-novo short read assemblers

The generation of short reads by next generation sequences has lead to an increased need to be able to assemble the vast amount of short reads, that are generated. This is no trivial problem, as the sheer number of reads makes it near impossible to use e.g. the overlay layout consensus (OLC) approach that had been used with longer reads. Therefore, most of the available assemblers that can cope with typical data generated by Illumina use a de Bruijn graph based k-mer based approach.


A clear distinction has to be made by the size of the genome to be assembled.

  • small (e.g. bacterial genomes: few Megabases)
  • medium (e.g. lower plant genomes: several hundred Megabases)
  • large (e.g. mammalian and plant genomes: Gigabases)

All de-novo assemblers will be able to cope with small genomes, and - given decent sequencing libraries - will produce relatively good results. Even for medium sized genomes, most de-novo assemblers mentioned here and many others will likely fare well and produce a decent assembly. That said, OLC based assemblers might take weeks to assemble a typical genome. Large genomes are still difficult to assemble when having only short reads (such as those provided by Illumina reads). Assembling such a genome with Illumina reads will probably will require having a machine having about 256 GB and potentially even 512GB RAM, unless one is willing to use a small cluster (ABySS, RAY, Contrail), or invest into commercial software (CLC).


Some steps which are likely common to most assemblies

  1. If it is within reason and would not tamper with the biology: Try to get DNA from haploid or at least mostly homozygous individuals.
  2. Make sure that all libraries are really ok quality-wise and that there is no major concern (e.g. use FastQC)
  3. For paired end data you might also want to estimate the insert size based on draft assemblies or assemblies which you have made already.
  4. Before submitting data to a de-novo assembler it might often be a good idea to clean the data, e.g. to trim away bad bases towards the end and/or to drop reads altogether. As low quality bases are more likely to contain errors, these might complicate the assembly process and might lead to a higher memory consumption. (More is not always better) That said, several general purpose short read assemblers such as SOAP de-novo and ALLPATHS-LG can perform read correction prior to assembly.
  5. Before running any large assembly double and triple check the parameters you feed the assembler.
  6. Post assembly it is often advisable to check how well your read data really agrees with the assembly and if there are any problemantic regions
  7. If you run de Bruijn graph based assemblies you will want to try different k-mer sizes. Whilst there is no rule of thump for any individual assembly, smaller k-mers would lead to a more tangled graph if the reads were error free. Larger k-mer sizes would yield a less tangled graph, given error free reads. However, a lower k-mer size would likely be more resistant to sequencing errors. And a too large k might not yield enough edges in the graph and would therefore result in small contigs.

Decision Helper

This is based both on personal experience as well as on published studies. Please note however that genomes are different and software packages are constantly evolving.

An Assemblathon challenge which uses a synthetic diploid genome assembly was reported on by Nature to call SOAP de novo, Abyss and ALLPATHS-LG the winners.

However a talk on the result website http://assemblathon.org/assemblathon-1-results names SOAP de novo, sanger-sga and ALLPATHS-LG to be consistently amongst the best performers for this synthetic genome.

I want to assemble:

  • Mostly 454 data
    • small Genome =>MIRA, Newbler
    • all others use Newbler
  • Mixed data (454 and Illumina)
    • small genome => MIRA, but try other ones as well
    • medium genome => no clear recommendation
    • large genome, assemble Illumina data with ALLPATHS-LG and SOAP, add in other reads or use them for scaffolding
  • Mostly Illumina (or Colorspace)
    • small genome => MIRA, velvet
    • medium genome => no clear recommendation
    • large genome, assemble Illumina data with ALLPATHS-LG and SOAP, add in other reads or use them for scaffolding

(For large genomes this is based on the fact that not many assemblers can deal with large genomes, and based on the assemblathon outcome. For 454 data this is based on Newbler's good general performance, and MIRA's different outputs, its versatility and the theoretical consideration that de Bruijn based approaches might fare worse)

I want to start a large genome project for the least cost

  • Use Illumina reads with ALLPATHS-LG specification (i.e. overlapping), the reads will work in e.g. SOAP de novo as well

(This recommendation is based on the assemblathon outcome, the original ALLPATHS publication (Gnerre et al., 2011) as well as a publication that used ALLPATHS for the assembly of Arabidopsis genomes (Schneeberger et al., 2011).

Each software has its particular strength, if you have specific requirement, the result from Assemblathon will guide you. Another comparison site GAGE is yet to release its comparison.

Software Packages

Free Software

ABySS

ABySS is a de-novo assembler which can run on multiple nodes where it uses the message parsing inerface (MPI) interface for communication. As ABySS distributes tasks, the amount of RAM needed per machine is smaller and thus Abyss is able to cope with large genomes

  • Pros
    • distributed interface a cluster can be used
    • a large genome can be assembled with relatively little RAM per compute node. A human genome was assembled on 21 nodes having 16GB RAM each
  • Cons
    • relatively slow

Allpaths-LG

Allpath-LG is a novel assembler requiring specialized libarieris. The authors of the software benchmarked ALLPATH-LG against SOAP-denovo and ALLPATH-LG and reported superior performance. However it must be noted that they might not have used the SOAP-denovo gap filling module for one of the data set due to time constraints. This would probably have improve the SOAP assembly contigous sequence length. In our own hand (usadellab) we have seen similar good N50 results and also (Schneeberger et al. 2011), reported good N50 values for ALLPATHS-LG Arabidopsis assemblies. Similarly ALLPATHS-LG was named as well performing in the assemblathon.

  • Pros
    • relatively fast runtime (slower than SOAP)
    • good scaffold length (likely better than SOAP)
    • can use long reads (e.g. PAC Bio) but only for small genomes
  • Cons
    • specially tailored libariers are necessary
    • large genomes (mammalian size) need a lot of RAM. The publications estimates about 512GB would be sufficient though
    • slower than SOAP

Euler SR USR

EULER is an assembler that includes an error correction module.

  • Pros
    • Has an error correction module
  • Cons

MIRA

MIRA is a general purpose assembler that can integrate various platform data and perform true hybrid assemblies.

  • Pros
    • very well documented and many switches
    • can combine different sequencing technologies
    • likely relatively good quality data
  • Cons
    • Only partly multithreaded thus and due to the technology slow
    • Probably not recommended to assemble larger genomes

SOAP de novo

SOAPdenovo is an all purpose genome assembler. It was used to assemble the giant panda genome.

  • Pros
    • SOAP de novo uses a medium amount of RAM
    • SOAP de novo is relatively fast (probably the fastest free assembler)
    • SOAP de novo contains a scaffolder and a read-corrector
    • SOAP de novo is relatively modular (read-corrector, assembly, scaffold, gap-filler)
  • Cons
    • potentially somewhat confusing way in which contigs are built.
    • Relatively large amount of RAM needed, BGI states ca. 150GB (less than ALLPATHS though)

Velvet

Velvet

  • Pros
  • Cons
    • Velvet might need large amounts of RAM for large genomes, potentially > 512 GB for a human genome based if at all possible. This is based on an approximation formula derived by Simon Gladman for smaller genomes -109635 + 18977*ReadSize + 86326*GenomeSize in MB + 233353*NumReads in million - 51092*Kmersize

Commercial

CLC cell

The CLC assembly cell is a commercial assembler released by CLC. It is most likely based on a kmer approach.

  • Pros
    • CLC uses very little RAM
    • CLC is very fast
  • Cons
    • CLC is not free
    • CLC might be a bit more liberal in folding repeats based on our own plant data.
    • CLC doesn't perform any real scaffolding , paired end data is used.

Newbler

Newbler is an assembler released by the Roche company.

  • Pros
    • Newbler has been used in many assembly projects
    • Newbler seems to be able to produce good N50 values
    • Newbler is often relatively precise
    • Newbler can usually be obtained free of charge
  • Cons
    • Newbler is tailored to (mostly) 454 data. Since Ion Torrent PGM data has a similar error profile (predominance of miscalled homopolymer repeats), it may be a good choice there also. Whilst it can accomodate some limited amount of Illumina data as has been described here, this is not possible for larger data sets.
    • As Newbler at least partly uses the OLC approach large assemblies can take time

Further Reading Material and References


  • Comparisons
    • Ye et al., 2011 Comparison of Sanger/PCAP; 454/Roche and Illumina/SOAP assemblies. Illumina/SOAP had lower substitution, deletion and insertion rates but lower contig and scaffold N50 sizes than 454/Newbler.
    • Paszkiewicz et al., 2010 General review about short read assemblers
    • Zhang et al., 2011 In depth comparison of different genome assemblers on simulated Illumina read dat. Unfortunately only up to medium genomes were tested. For eukaryotic genomes and short reads Soap denovo is suggested for longer reads ALLPATHS-LG.
    • Chapman JA et al. 2011 introduce the new assembler Meraculous gathered literature data on the assembly of E. coli K12 MG1655 for Allpaths 2, Soapdenovo, Velvet, Euler-SR, Euler, Edena, AbySS and SSAKE. Allpaths2 had by far the largest Contig and Scaffold N50 and was apart from Meraculous the only misassembly free. Meraculous was shown to even contain no errors.
    • Liu et al., 2011 benchmark their new assembler PASHA against SOAP de novo (v 1.04), velvet (1.0.17) and ABySS (1.2.1) using three bacterial data sets. Whilst PASHA usually the largest NG50 and NG80 (N50 and N80 calculated with the true genome sizes) SOAP de novo produced the highest number of contigs and soemtimes worse NG50 and NG80. However for one dataset SOAP denovo showed the best genome coverage.
    • The Assemblathon comparing de novo genome assemblies of many different teams based on a synthetic genome. The Assemblathon 1 competition is summarized in Earl et al., 2011.

Examples

SOAP denovo

We get some E coli data from SRR001665 you could type

 wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR001/SRR001665/SRR001665_1.fastq.gz
 wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR001/SRR001665/SRR001665_2.fastq.gz

unpack the two files

 gunzip SRR001665_1.fastq.gz
 gunzip SRR001665_2.fastq.gz

You will need to get SOAPdenovo and the data prepare module

 wget http://soap.genomics.org.cn/down/x86_64.linux/SOAPdenovo31mer.tgz
 tar xvzf SOAPdenovo31mer.tgz
 


Also we have to make a config file. We name this cont.config

 #maximal read length
 max_rd_len=36
 [LIB]
 #average insert size
 avg_ins=200
 #if sequence needs to be reversed 
 reverse_seq=0
 #use for contig building only
 asm_flags=1
 #in which order the reads are used while scaffolding
 rank=1
 #fastq files
 q1=./SRR001665_1.fastq
 q2=./SRR001665_2.fastq


And then we scaffold using a Kmer size of 31 (the read length is 36). We use the whole SOAP pipeline by specifying the "all" parameter By setting asm_flags to 3 the same library would be used for scaffolding as well. In this case SOAP will terminate in the scaffolding step with a floating point exception as there is nothing to scaffold with. Contigs will be found nevertheless in EC.contigs.

 ./SOAPdenovo31mer all -K 31 -s cont.config -o EC


ABySS

We need to get google sparsehash first assuming you don't have root priviligies and your name is USER we install in a local include directory

 wget http://google-sparsehash.googlecode.com/files/sparsehash-1.11.tar.gz
 tar xvzf sparsehash-1.11.tar.gz
 cd sparsehash-1.11
 ./configure -prefix=/home/USER
 make
 make install


Now get ABySS (assuming you have the BOOST libraries installled)

 wget http://www.bcgsc.ca/downloads/abyss/abyss-1.2.7.tar.gz
 tar xzvf abyss-1.2.7.tar.gz
 cd abyss-1.2.7
 ./configure CPPFLAGS=-I/home/USER/include
 make

For more details about the installation process see the ABySS entry or install the .deb if you have a debian compliant system.

We now run ABySS with a kmer size of 25 and require at least n=10 links to join contigs.

 cd bin
 ./abyss-pe k=25 n=10 in='SRR001665_1.fastq SRR001665_2.fastq' name=ecoli