Difference between revisions of "How-to/scaffolding"

From SEQwiki
< How-to
How-toHow-to/scaffolding
Jump to: navigation, search
(Evaluation)
(Example Summary)
Line 258: Line 258:
  
 
== Example Summary ==
 
== Example Summary ==
In total SOAP is much faster then SSPACE on this example which is much faster than SOPRA.
+
In total SOAP runs faster than SSPACE on this example but both are much faster than SOPRA.
In terms of ease of use SSPACE requires a bit less than SOAP but both are much easier than SOPRA.
+
In terms of ease of use SSPACE requires a bit less than SOAP but both are much easier to use than SOPRA.
 
N50 wise SOAP and SSPACE are given similar results here whcih are better than SOPRA. Please note though that we did not test the correctnedd of the data.
 
N50 wise SOAP and SSPACE are given similar results here whcih are better than SOPRA. Please note though that we did not test the correctnedd of the data.

Revision as of 18:26, 25 September 2011

Scaffolding

Scaffolding is the use of higher order information to group, orient and connect islands of sequence in a de novo genome assembly. This information could include paired-end sequences, mate paired sequences, restriction maps, clone (cosmid/fosmid/BAC/PAC) maps. Scaffolds may also be formed by inferring such information from a related genome. Many de novo assembler programs (Velvet, ABySS, SOAP, ALLPATHS-LG) have a scaffolding step included already supporting this read based scaffolding. One notable exception is CLC which does not provide large scaffolds.

Decision Helper

  • In any case you might consider using a de novo genome assembler program that has a built in scaffolder.
  • If you have only Illumina data use SSPACE (or SOAPdenovo)
  • If you have relatively few links and multiple other data types use Bambus
  • If you manage to get Bambus to run with your data use it
  • If you need a complete software package that does everything in one step use SSPACE

Software Packages

SSPACE

SSPACE is scaffolding simply based on paired end and mate pair data. SSPACE maps illumina reads to contigs and build contigs based on these. It can be configured how many links have to be found and what the link ratio should be (i.e. maximal number of links to a second best hit). SSPACE uses bowtie to map the reads to contig parts and is discussed in the forum here.

  • Pros
    • relatively easy to use for Illumina data
    • relatively easy to modify with perl knowledge
    • a one stop shop solution, SSPACE just needs contigs in FASTA format and reads in FASTQ format
  • Cons
    • a more flexible version is no longer free

Bambus

Bambus

  • Pros
    • Can use various kinds of scaffolding information
  • Cons
    • Is hard to configure
    • Might not be able to cope with millions of links as produced by Illumina libraries

SOAP de novo

SOAPdenovo as a modular tool can be used as a scaffolder for other assembly tools as well.

  • Pros
    • Tried and tested
  • Cons
    • Requires some convivincing and scripting to perform scaffolding

SOPRA

SOPRA is a perl based tool which can be integrated with SSAKE or velvet. The original publication did not benchmark it against any stand-alone scaffolders.

  • Pros
    • Can also integrate marker information
  • Cons
    • Requires a lot of manual steps in between and some reformatting of reads

Further Reading Material and References

Examples

We will just provide walk through examples based on the SSPACE scaffolder example dataset.

Example 1: SSPACE

This is just based on the SSPACE example. Download SSPACE v1.2. and unzip it (note that the linux64 version is contained in a .zip file)

 unzip 
 cd SSPACE-1.2_linux-x86_64

as unzip doesn't contain the proper permissions cd into the bowtie directory and make bowtie and bowtie build exectuable

 cd bowtie
 chmod +x bowtie
 chmod +x bowtie_build

cd into the example directory

 cd example

In the directory you already find two files the contigs contigs_abyss.fasta and a library description libraries.txt. It just contains a single line

lib1 SRR001665_1.fastq SRR001665_2.fastq 200 0.75 0

meaning there is a single library lib1 conssiting of the forward reads SRR001665_1.fastq and the reverse reads SRR001665_2.fastq which have an insert size of 200 base pairs with a fraction of 0.75 (*200) error tolerance for the insert size and the reads are in --><--- format (0) as ooped to typical mate pair data <---|---> (1).


Now obtain the SRR001665 data you could type

 wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR001/SRR001665/SRR001665_1.fastq.gz
 wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR001/SRR001665/SRR001665_2.fastq.gz

unpack the two files

 gunzip SRR001665_1.fastq.gz
 gunzip SRR001665_2.fastq.gz


run SSPACE without extension (-x 0) having at least 5 links (- k5) and a link ratio of 0.3 (-a 0.3)

 perl ../SSPACE_v1-2.pl -l libraries.txt -s contigs_abyss.fasta -k 5 -a 0.7 -x 0 -b ecoli_scaffolds

You should now have a file called ecoli_scaffolds.summaryfile.txt which tells you that the N50 went from 18242 to 95879 after scaffolding.

As a little trick in general you might want to chnage the file mapWithBowtie.pl in the bin directory to include the -p #Threadnumber in the lines system("$bowtiepath

Evaluation

We use fac.pl from ABySS you can get it from this email http://www.bcgsc.ca/pipermail/abyss-users/2009-September/000330.html.

 perl fac.pl ecoli_scaffolds.final.scaffolds.fasta
 n       n:100  n:N50  min    median mean   N50    max    sum
 123     123    16     100    16737  36906  95877  267461 4539527

Example 2: SOPRA

SOPRA Step 0 preparation

Get SOPRA from http://www.physics.rutgers.edu/~anirvans/SOPRA/ we work with release 1.4.6.

 wget http://www.physics.rutgers.edu/~anirvans/SOPRA/SOPRA_v1.4.6.zip
 unzip SOPRA_v1.4.6.zip

The archive (one again) doesn't contain permission information but a lot of MACOSX files

Get the micro assembly and the reads from the SSPCACE assembly (see Example 1).

We need to combine the two FASTQ libraries into one FASTA libary where the reads from the two llibraries alternate. In this particular case the following script can be used. However this expects the FASTQ files to be complete and having only one line per sequence information and quality information, so get something more elaborate for real data.

 #!/usr/bin/perl
 
 open FIL1,"<$ARGV[0]" or die $!;
 open FIL2,"<$ARGV[1]" or die $!;
 
 my $line1;
 my $line2;
 my $line;
 while ($line1=<FIL1>) {
 	$line2=<FIL2>;
 	$line1=~s/^@/\>/; #convert @ to >
 	$line2=~s/^@/\>/; #convert @ to >
 	print $line1;
 	$line=<FIL1>;
 	print $line; #print sequence information
 	
 	print $line2;
 	$line=<FIL2>;
 	print $line;  #print sequence information
 	
 	<FIL1>;
 	<FIL1>;
 	<FIL2>;
 	<FIL2>;
 	#discard quality information
 }
 close(FIL1);
 close(FIL2);


if you name the above script combiner.pl you can use it as followas

 perl combiner.pl SRR001665_1.fastq SRR001665_2.fastq >srr.fasta

SOPRA Step 1

change into the sopra with prebuilt contigs directory

 cd source_codes_v1.4.6/SOPRA_with_prebuilt_contigs/

and execute the first step as follows

 perl s_prep_contigAseq_v1.4.6.pl -contig contigs_abyss.fasta -mate srr.fasta -a outdir

This should generate the new files contigs_sopra.fasta and srr_sopra.fasta. We have to manually align the reads srr_sopra.fasta against the contigs. We do this using bowtie using the -v 0 -f --sam -m 1 options (no mismatches, fasta format sam output and only report one reportable alignment, suppress alignments where multiple alignments are valid. The latter is not necessary as SOPRA would do this as well.)

 ./bowtie-build contigs_sopra.fasta SOPRA
 ./bowtie -v 0 -m 1 -f --sam SOPRA srr_sopra.fasta  >  mysam_mate_illu1

now run the next SOPRA step

SOPRA Step 2

Assuming we are in the output directory from the previous step already we set the output (-a to the current directory .)

  perl  ../s_parse_sam_v1.4.6.pl -sam mysam_mate_illu1 -a .

This needs around 20GB of RAM

SOPRA Step 3

Now we just give the insert size (200) of the mate pairs

 perl ../s_read_parsed_sam_v1.4.6.pl -parsed mysam_mate_illu1_parsed -d 200 -a .

SOPRA Step 4

And finally we are ready to scaffold

   perl ../s_scaf_v1.4.6.pl -w 5 -o orientdistinfo_c5 -a .

We require at least 5 links (-w 5)

Evaluation

running fac.pl (You can get it from this email http://www.bcgsc.ca/pipermail/abyss-users/2009-September/000330.html)

 perl fac.pl scaffolds_h2.2_L150_w5.fasta
 n       n:100  n:N50  min    median mean   N50    max    sum
 252     252    21     100    655    18022  65959  173822 4541546 scaffolds


Example 3: SOAP

You will get SOAPdenovo and the data prepare module

 wget http://soap.genomics.org.cn/down/prepare.tgz
 wget http://soap.genomics.org.cn/down/x86_64.linux/SOAPdenovo31mer.tgz
 tar xvzf SOAPdenovo31mer.tgz
 tar xzvf prepare.tgz


Now we prepare the data using a kmer size of e.g. 24 and the name prefix EC for e coli

 ./prepare -c contigs_abyss.fasta -K 24 -g EC

Also we have to make a config file. We name this scaf.config

 #maximal read length
 max_rd_len=36
 [LIB]
 #average insert size
 avg_ins=200
 #if sequence needs to be reversed 
 reverse_seq=0
 #use for scaffolding only
 asm_flags=2
 #in which order the reads are used while scaffolding
 rank=1
 #fastq files
 q1=./SRR001665_1.fastq
 q2=./SRR001665_2.fastq

First reads are mappeed

 ./SOAPdenovo31mer map -s scaf.config  -g EC
 

And then we scaffold

 ./SOAPdenovo31mer scaff -g EC

One can add -F for gap filling

Evaluation

Once again we use ABySS fac.pl

perl fac.pl EC.scafSeq
 n       n:100  n:N50  min    median mean   N50    max    sum
 174     174    16     100    3099   26040  95277  266933 4531120 EC.scafSeq


if we used gap filling in the last SOAP step

 n       n:100  n:N50  min    median mean   N50    max    sum
 174     174    16     100    3121   26105  95307  267763 4542343 EC.scafSeq


Example Summary

In total SOAP runs faster than SSPACE on this example but both are much faster than SOPRA. In terms of ease of use SSPACE requires a bit less than SOAP but both are much easier to use than SOPRA. N50 wise SOAP and SSPACE are given similar results here whcih are better than SOPRA. Please note though that we did not test the correctnedd of the data.