21245279

From SEQwiki
Jump to: navigation, search

This reference describes Mzip.

PMID PMID 21245279
Title Efficient storage of high throughput sequencing data using reference-based compression.
Year 2011
Journal Genome Research
Author Hsi-Yang Fritz M, Leinonen R, Cochrane G, Birney E.
Volume
Start page


Error: No contents found at URL http://www.ebi.ac.uk/europepmc/webservices/rest/MED/21245279/citations/4000.

According to Europe PubMed Central, this reference has Error: no local variable "citations" was set. " Error: no local variable "citations" was set. " is not a number. citations.

For reference, you can check Google Scholar, which lacks an API because Google ...


Error: Invalid JSON. According to Almetric, this reference has an Altmetric score of Error: no local variable "altscore" was set. " Error: no local variable "altscore" was set. " is not a number..

Full text description

Data storage costs have become an appreciable proportion of total cost in the creation and analysis of DNA sequence data. Of particular concern is that the rate of increase in DNA sequencing is significantly outstripping the rate of increase in disk storage capacity. In this paper we present a new reference-based compression method that efficiently compresses DNA sequences for storage. Our approach works for re-sequencing experiments that target well-studied genomes. We align new sequences to a reference genome and then encode the differences between the new sequence and the reference genome for storage. Our compression method is most efficient when we allow controlled loss of data in the saving of quality information and unaligned sequences. With this new compression method we observe exponential efficiency gains as read lengths increase, and the magnitude of this efficiency gain can be controlled by changing the amount of quality information stored. Our compression method is tunable: the storage quality scores and unaligned sequences may be adjusted for different experiments to conserve information or to minimize storage costs, and provides one opportunity to address the threat that increasing DNA sequence volumes will overcome our ability to store the sequences.