Difference between revisions of "How-to/SNP detection"

From SEQwiki
< How-to
How-toHow-to/SNP detection
Jump to: navigation, search
(SNP detection)
(Is this really a good idea....)
 
(14 intermediate revisions by 3 users not shown)
Line 1: Line 1:
= SNP detection =
+
<big>
SNPs, or single nucleotide polymorphisms, are heritable single base changes in a genome versus a reference sequence.  They are part of the more generic set of Single Nucleotide Variations (SNVs), which also encompasses somatic single base changes which are not passed to offspring and are due to environmental damage.  Tools for SNP identification can also be used for SNV identification, though tools specific for SNV identification exist as well.  In some contexts, such as cancer genomes, SNV identification is complicated by heterogeneous DNA samples.
+
<big>
  
SNP identification programs must distinguish system noise (instrument errors, PCR errors, etc) from actual variation. They generally do so by modeling various error types and the expected distribution of calls under homozygous reference (AA), homozygous variant (BB) and heterozygous variant (AB) states.  Confidence in calls is generally affected by the reported sequence quality values and read depth.  Some SNP/SNV callers work by comparing individual samples to a reference, whereas others can simultaneously call in multiple samples using information from each sample to assist calling in the other samples.  SNP callers for mixed population samples also exist.
+
'''Note''': This content has been ported to the new WikiBook project:
 +
http://en.wikibooks.org/wiki/Next_Generation_Sequencing_(NGS)/DNA_Variants
  
A common source of error in SNP/SNV calling is misalignment due to pseudogenes, repeated genomic segments or close orthologs; in these cases the co-alignment of reads arising from different genomic regions can result in a false positive call. Another source of error can be local misalignment (or ambiguous alignment) due to indels in reads (either true indel variations or sequencing errors); realignment tools such as [[Dindel]] and those found in [[GATK]] can generate more consistent treatment of indels to reduce this source of err.r Many SNP/SNV callers are designed for diploid DNA, and may not work well in samples with higher ploidy.  As noted above, heterogeneity in samples such as tumor samples can frustrate SNV calling, and some callers are specifically designed to cope with this.  Tumor samples may also have altered copy number due to gene or chromosomal amplification, meaning they are effectively of triploid or higher ploidy in some regions.
+
Previous contributors should ensure they take proper credit for their work here:
 +
http://en.wikibooks.org/w/index.php?title=Next_Generation_Sequencing_(NGS)/Authors
  
SNP/SNV callers often call only these polymorphisms, and not (for example) small indels.  Users of these tools should also take care when calling adjacent pairs of SNPs/SNVs, as the phasing of these (or more distant SNPs) is not reported in many callers' reports.
+
</big>
 
+
</big>
= Decision Helper =
 
I want to quickly call SNP versus a reference
 
=>Freebayes, samtools
 
 
 
 
 
= Software Packages =
 
 
 
== Free Software ==
 
 
 
=== Freebayes ===
 
[[Freebayes]] is the successor of Poly- Giga- and BAMBayes and should be much faster than these. Like these it relies on BAM files. It has also been described in some more detail by its developed on [http://biostar.stackexchange.com/questions/613/what-methods-do-you-use-for-in-del-snp-calling Biostar]
 
 
 
* Pros
 
** very easy to run for simple SNP calling
 
** Does not assume any ploidy
 
** can read BAM files via STDIN
 
 
 
=== GATK ===
 
The Genome Analysis toolkit [[GATK]] allows multiple steps. The authors used their pipeline for variant calling using the NA12878 exome data set and compared their results to those of [[Crossbow]] (which uses [[SOAPsnp]]). Based on these results they concluded that crossbow had a lower spcecificity.
 
 
 
One easy way to to run GATK and other tools might be to use this [https://github.com/vlandham/variant_pipeline variant pipeline] mentioned on [http://biostar.stackexchange.com/questions/8260/workflow-or-tutorial-for-snp-calling Biostar]
 
 
 
* Important reminder
 
**If you run GATK framework in your own pipeline, you have to bear in mind GATK has Stringent file formatting requirement.
 
**e.g. chromosomes ordering in genome reference file has to be in [http://www.broadinstitute.org/gsa/wiki/index.php/Input_files_for_the_GATK canonical] order.
 
**BAM header has to be present in every BAM file.
 
**The BAM file has to be sorted, preferably by Picards because it write the proper header after sorting
 
**Read-group tag has to be present in each BAM. Either input the correct tag during mapping or you may waste your time in fixing the BAM file afterwards
 
 
 
* Pro
 
** Likely relatively specific
 
* Con
 
** relatively complex pipelines
 
 
 
=== MAQ ===
 
[[MAQ]]
 
 
 
=== samtools ===
 
[[samtools]] using the pilepup or mpileup pipeline
 
http://samtools.sourceforge.net/mpileup.shtml
 
 
 
This thread here describes some potential problems that occur when using the BAQ parameter. (In effect it recommends to turn it off, if one uses e.g. BWA that finds indels.
 
http://seqanswers.com/forums/showthread.php?t=11965
 
 
 
 
 
 
 
=== SOAPsnp ===
 
[[SOAPsnp]] is e.g. used in the [[Crossbow]] pipeline.
 
 
 
=== SNVMix ===
 
[[SNVMix]]
 
The authors of SNVMix compared their tool to MAQ v0.6.8 and found better performance as judged by area under the curve when using  Affymetrix SNP 6.0 data.
 
 
 
== Commercial Software ==
 
[[CLCBio]]
 
 
 
== Further Reading Material and References ==
 

Latest revision as of 13:53, 13 November 2012

Note: This content has been ported to the new WikiBook project: http://en.wikibooks.org/wiki/Next_Generation_Sequencing_(NGS)/DNA_Variants

Previous contributors should ensure they take proper credit for their work here: http://en.wikibooks.org/w/index.php?title=Next_Generation_Sequencing_(NGS)/Authors