How-to/scaffolding

From SEQwiki
< How-to
Revision as of 16:36, 25 September 2011 by Usad (talk | contribs) (SOPRA Step 4)
How-toHow-to/scaffolding
Jump to: navigation, search

Scaffolding

Scaffolding is the use of higher order information to group, orient and connect islands of sequence in a de novo genome assembly. This information could include paired-end sequences, mate paired sequences, restriction maps, clone (cosmid/fosmid/BAC/PAC) maps. Scaffolds may also be formed by inferring such information from a related genome. Many de novo assembler programs (Velvet, ABySS, SOAP, ALLPATHS-LG) have a scaffolding step included already supporting this read based scaffolding. One notable exception is CLC which does not provide large scaffolds.

Decision Helper

  • In any case you might consider using a de novo genome assembler program that has a built in scaffolder.
  • If you have only Illumina data use SSPACE (or SOAPdenovo)
  • If you have relatively few links and multiple other data types use Bambus
  • If you manage to get Bambus to run with your data use it
  • If you need a complete software package that does everything in one step use SSPACE

Software Packages

SSPACE

SSPACE is scaffolding simply based on paired end and mate pair data. SSPACE maps illumina reads to contigs and build contigs based on these. It can be configured how many links have to be found and what the link ratio should be (i.e. maximal number of links to a second best hit). SSPACE uses bowtie to map the reads to contig parts and is discussed in the forum here.

  • Pros
    • relatively easy to use for Illumina data
    • relatively easy to modify with perl knowledge
    • a one stop shop solution, SSPACE just needs contigs in FASTA format and reads in FASTQ format
  • Cons
    • a more flexible version is no longer free

Bambus

Bambus

  • Pros
    • Can use various kinds of scaffolding information
  • Cons
    • Is hard to configure
    • Might not be able to cope with millions of links as produced by Illumina libraries

SOAP de novo

SOAPdenovo as a modular tool can be used as a scaffolder for other assembly tools as well.

  • Pros
    • Tried and tested
  • Cons
    • Requires some convivincing and scripting to perform scaffolding

SOPRA

SOPRA is a perl based tool which can be integrated with SSAKE or velvet. The original publication did not benchmark it against any stand-alone scaffolders.

  • Pros
    • Can also integrate marker information
  • Cons
    • Requires a lot of manual steps in between and some reformatting of reads

Further Reading Material and References

Examples

We will just provide walk through examples based on the SSPACE scaffolder example dataset.

Example 1: SSPACE

This is just based on the SSPACE example. Download SSPACE v1.2. and unzip it (note that the linux64 version is contained in a .zip file)

 unzip 
 cd SSPACE-1.2_linux-x86_64

as unzip doesn't contain the proper permissions cd into the bowtie directory and make bowtie and bowtie build exectuable

 cd bowtie
 chmod +x bowtie
 chmod +x bowtie_build

cd into the example directory

 cd example

In the directory you already find two files the contigs contigs_abyss.fasta and a library description libraries.txt. It just contains a single line

lib1 SRR001665_1.fastq SRR001665_2.fastq 200 0.75 0

meaning there is a single library lib1 conssiting of the forward reads SRR001665_1.fastq and the reverse reads SRR001665_2.fastq which have an insert size of 200 base pairs with a fraction of 0.75 (*200) error tolerance for the insert size and the reads are in --><--- format (0) as ooped to typical mate pair data <---|---> (1).


Now obtain the SRR001665 data you could type

 wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR001/SRR001665/SRR001665_1.fastq.gz
 wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR001/SRR001665/SRR001665_2.fastq.gz

unpack the two files

 gunzip SRR001665_1.fastq.gz
 gunzip SRR001665_2.fastq.gz


run SSPACE without extension (-x 0) having at least 5 links (- k5) and a link ratio of 0.3 (-a 0.3)

 perl ../SSPACE_v1-2.pl -l libraries.txt -s contigs_abyss.fasta -k 5 -a 0.7 -x 0 -b ecoli_scaffolds

You should now have a file called ecoli_scaffolds.summaryfile.txt which tells you that the N50 went from 18242 to 95879 after scaffolding.

As a little trick in general you might want to chnage the file mapWithBowtie.pl in the bin directory to include the -p #Threadnumber in the lines system("$bowtiepath

Example 2: SOPRA

Step 0 preparation

Get SOPRA from http://www.physics.rutgers.edu/~anirvans/SOPRA/ we work with release 1.4.6.

 wget http://www.physics.rutgers.edu/~anirvans/SOPRA/SOPRA_v1.4.6.zip
 unzip SOPRA_v1.4.6.zip

The archive (one again) doesn't contain permission information but a lot of MACOSX files

Get the micro assembly and the reads from the SSPCACE assembly (see Example 1).

We need to combine the two FASTQ libraries into one FASTA libary where the reads from the two llibraries alternate. In this particular case the following script can be used. However this expects the FASTQ files to be complete and having only one line per sequence information and quality information, so get something more elaborate for real data.

 #!/usr/bin/perl
 
 open FIL1,"<$ARGV[0]" or die $!;
 open FIL2,"<$ARGV[1]" or die $!;
 
 my $line1;
 my $line2;
 my $line;
 while ($line1=<FIL1>) {
 	$line2=<FIL2>;
 	$line1=~s/^@/\>/; #convert @ to >
 	$line2=~s/^@/\>/; #convert @ to >
 	print $line1;
 	$line=<FIL1>;
 	print $line; #print sequence information
 	
 	print $line2;
 	$line=<FIL2>;
 	print $line;  #print sequence information
 	
 	<FIL1>;
 	<FIL1>;
 	<FIL2>;
 	<FIL2>;
 	#discard quality information
 }
 close(FIL1);
 close(FIL2);


if you name the above script combiner.pl you can use it as followas

 perl simplescript.pl SRR001665_1.fastq SRR001665_2.fastq >srr.fasta

SOPRA Step 1

change into the sopra with prebuilt contigs directory

 cd source_codes_v1.4.6/SOPRA_with_prebuilt_contigs/

and execute the first step as follows

 perl s_prep_contigAseq_v1.4.6.pl -contig contigs_abyss.fasta -mate srr.fasta -a outdir

This should generate the new files contigs_sopra.fasta and srr_sopra.fasta. We have to manually align the reads srr_sopra.fasta against the contigs. We do this using bowtie using the -v 0 -f --sam -m 1 options (no mismatches, fasta format sam output and only report one reportable alignment, suppress alignments where multiple alignments are valid. The latter is not necessary as SOPRA would do this as well.)

 ./bowtie-build.exe contigs_sopra.fasta SOPRA
 ./bowtie -v 0 -m 1 -f --sam SOPRA srr_sopra.fasta  >  mysam_mate_illu1

now run the next SOPRA step

SOPRA Step 2

Assuming we are in the output directory from the previous step already we set the output (-a to the current directory .)

  perl  ../s_parse_sam_v1.4.6.pl -sam mysam_mate_illu1 -a .

SOPRA Step 3

Now we just give the insert size (200) of the mate pairs

 perl ../s_read_parsed_sam_v1.4.6.pl -parsed mysam_mate_illu1_parsed -d 200 -a .

SOPRA Step 4

And finally we are ready to scaffold

   perl ../s_scaf_v1.4.6.pl -o orientdistinfo_cN -a .