How-to/SNP detection

From SEQwiki
< How-to
Revision as of 15:37, 13 September 2011 by Mmartin (talk | contribs) (samtools)
How-toHow-to/SNP detection
Jump to: navigation, search

SNP detection

SNPs, or single nucleotide polymorphisms, are heritable single base changes in a genome versus a reference sequence. They are part of the more generic set of Single Nucleotide Variations (SNVs), which also encompasses somatic single base changes which are not passed to offspring and are due to environmental damage. Tools for SNP identification can also be used for SNV identification, though tools specific for SNV identification exist as well. In some contexts, such as cancer genomes, SNV identification is complicated by heterogeneous DNA samples.

SNP identification programs must distinguish system noise (instrument errors, PCR errors, etc) from actual variation. They generally do so by modeling various error types and the expected distribution of calls under homozygous reference (AA), homozygous variant (BB) and heterozygous variant (AB) states. Confidence in calls is generally affected by the reported sequence quality values and read depth. Some SNP/SNV callers work by comparing individual samples to a reference, whereas others can simultaneously call in multiple samples using information from each sample to assist calling in the other samples. SNP callers for mixed population samples also exist.

A common source of error in SNP/SNV calling is misalignment due to pseudogenes, repeated genomic segments or close orthologs; in these cases the co-alignment of reads arising from different genomic regions can result in a false positive call. Another source of error can be local misalignment (or ambiguous alignment) due to indels in reads (either true indel variations or sequencing errors); realignment tools such as Dindel and those found in GATK can generate more consistent treatment of indels to reduce this source of err.r Many SNP/SNV callers are designed for diploid DNA, and may not work well in samples with higher ploidy. As noted above, heterogeneity in samples such as tumor samples can frustrate SNV calling, and some callers are specifically designed to cope with this. Tumor samples may also have altered copy number due to gene or chromosomal amplification, meaning they are effectively of triploid or higher ploidy in some regions.

SNP/SNV callers often call only these polymorphisms, and not (for example) small indels. Users of these tools should also take care when calling adjacent pairs of SNPs/SNVs, as the phasing of these (or more distant SNPs) is not reported in many callers' reports.

Decision Helper

I want to quickly call SNP versus a reference =>Freebayes, samtools


Software Packages

Free Software

Freebayes

Freebayes is the successor of Poly- Giga- and BAMBayes and should be much faster than these. Like these it relies on BAM files. It has also been described in some more detail by its developer on Biostar

  • Pros
    • very easy to run for simple SNP calling
    • Does not assume any ploidy
    • can read BAM files via STDIN

GATK

The Genome Analysis toolkit GATK allows multiple steps. The authors used their pipeline for variant calling using the NA12878 exome data set and compared their results to those of Crossbow (which uses SOAPsnp). Based on these results they concluded that crossbow had a lower spcecificity.

One easy way to to run GATK and other tools might be to use this variant pipeline mentioned on Biostar

  • Important reminder
    • If you run GATK framework in your own pipeline, you have to bear in mind GATK has Stringent file formatting requirement.
    • e.g. chromosomes ordering in genome reference file has to be in canonical order.
    • BAM header has to be present in every BAM file.
    • The BAM file has to be sorted, preferably by Picards because it write the proper header after sorting
    • Read-group tag has to be present in each BAM. Either input the correct tag during mapping or you may waste your time in fixing the BAM file afterwards
  • Pro
    • Likely relatively specific (The authors show higher specificity than crossbow)
  • Con
    • relatively complex pipelines

MAQ

MAQ

  • Pros
    • performed slightly better than sopasnp and beter than snvnmix according to an independent comparison

samtools

samtools using the mpileup command http://samtools.sourceforge.net/mpileup.shtml

samtools pileup (without the m) is deprecated and has been removed in recent SAMtools versions.

A SEQanswers forum thread describes some potential problems that occur when using the BAQ parameter. (In effect it recommends to turn it off, if one uses e.g. BWA that finds indels.

SOAPsnp

SOAPsnp is e.g. used in the Crossbow pipeline.

SNVMix

SNVMix The authors of SNVMix compared their tool to MAQ v0.6.8 and found better performance as judged by area under the curve when using Affymetrix SNP 6.0 data. However in an independent comparison using MAQ 0.71 MAQ performed better.

  • Cons
    • Might be unstable in high coverage region according to an independent comparison.
    • Might be less precise than MAQ and SOAPsnp

Commercial Software

CLCBio

Further Reading Material and References

  • Further Reading
  • Original Publications
  • Comparisons
    • Nielsen R, Paul JS, Albrechtsen A, Song YS Genotype and SNP calling from next-generation sequencing data. Nat Rev Genet. (2011) 12:443-51. The article gives general reccommendations for a workflow and suggests to use a calibration step as implemented by GATK or SOAPsnp
    • Wang et al., 2011 A comparison of short read aligners and performance assesment of MAQ (0.71), SOAPsnp (1.03) and SNVmix(2-0.11.8-r4)