HTSeq

From SEQwiki
Jump to: navigation, search

Application data

Created by Anders S
Principal bioinformatics method(s)
Created at EMBL
Maintained? Yes
Programming language(s) Python
Licence GPLv3

Summary: Python framework to process and analyse high-throughput sequencing (HTS) data

"Error: no local variable "counter" was set." is not a number.

With the many short-read aligners available now, HTS analysis seems simple. In practice, however, data often needs to be converted, tweaked, filtered, or otherwise pre-processed before they can be given to the aligner, and the results require similar processing to do the statistical analysis one needs.

HTSeq is meant to render such tasks easy and convenient, and so act as a "glue" between aligners and other existing tools.

Some examples of typical use cases for HTSeq: Quality assessment of reads: Check the dependence of the proportions of base calls and quality scores on the position in the reads, stratify by alignment status. Counting: How many reads fall onto each exon, or each gene? For such tasks, you may want to design and implement rules on how to deal with overlapping features or ambiguous assignments. Calculating coverage: HTSeq helps you not only to produce a Wiggle file for visualization in a genome browser, but also to do customized statistics on this. Multiple alignments: Many aligners can output multiple alignments for each read, but what to do with this? HTSeq makes it easy to implement post-processing to choose the right alignment according to your criteria. Adapter trimming: In miRNA-Seq, you often sequence into the adapter at the other end and need to cut this off before aligning. In multiplexed sequencing, you may need to cut off and sort by the mutiplex tag.

Have a look and give it a try: http://www-huber.embl.de/users/anders/HTSeq/

To use HTSeq you only need a basic understanding of Python, as can be obtained by reading the first few chapters of a Python book. For users without programming knowledge, stand-alone scripts for common tasks are provided: htseq-count to count the overlap of reads with features (such as exons), htseq-qa to get a quick overview of the quality of your sequencing run, and htseq-bedgraph (coming soon) to convert an alignment file into a Bedgraph Wiggle file for visualization with a genome browser.

For programmers, HTSeq has been designed to keep thing simple: All classes have extensive reference documentation, and a tutorial demonstrates their use. All supported file formats (Fasta, Fastq, SAM, SolexaPipeline files, GFF, GTF, etc.) can be read in a loop, providing an object describing one record at a time to the loop body. This object describes the data in a convenient and consistent way. The 'GenomicArray' class is the Swiss army knife of HTSeq. It is a container that can efficiently store anything that has a position on the genome: integer number to represent coverage, objects with feature data to represent exons, sets of objects to handle overlapping features, etc.