Ibis

From SEQwiki
Jump to: navigation, search

Application data

Created by Kircher M, Stenzel U, Kelso J
Biological application domain(s) Sequencing
Principal bioinformatics method(s) Base-calling
Created at Max Planck Institute for Evolutionary Anthropology
Maintained? Yes
Input format(s) Intensity files (int.txt.gz, int.txt.p.gz, cif) and Cluster coordinates (idx, pos)
Output format(s) FASTQ
Software features Statistical learning of base calling parameters and calibrated quality scoring
Programming language(s) Python, C, C++
Licence Non-commercial
Operating system(s) Linux, Windows (Cygwin)

Summary: Ibis (Improved base identification system), is an accurate, fast and easy-to-use base caller for the Illumina sequencing system, which significantly reduces the error rate and increases the output of usable reads. Ibis is faster and makes fewer assumptions about chemistry and technology

"Error: no local variable "counter" was set." is not a number.

Intensities values extracted in the image analysis are the actual input for the base calling on Illumina systems. Due to technical and chemical limitations these intensities show the following two main effects which have to be considered in the base calling process:

(1) A strong correlation of the A and C intensities as well as of the G and T intensities due to similar emission spectra of the fluorophores and limited separation by the filters used.

(2) Dependence of the signal of a specific cycle by the signal of the cycles before and after, called phasing and pre-phasing respectively. Phasing and pre-phasing are caused by incomplete removal of the 3' terminators and fluorophores, sequences in the cluster missing an incorporation cycle, as well as by the incorporation of nucleotides without effective 3' terminators.

The Illumina base caller (Bustard) uses a so-called crosstalk matrix estimated from the first and second imaging cycle to orthogonalize the correlated channels. Further this matrix is used to scale the different intensities measured for each of the fluorophores. The estimation of the crosstalk matrix is based on the assumption that the four nucleotides are almost equally frequent in the library being sequenced. If the sample does not fulfill this assumption this estimate can be inaccurate and lead to incorrect base calling. Bustard estimates the phasing and pre-phasing as two channel-independent parameters from the increasing correlation of intensities in the first few cycles of the sequencing run. Using the crosstalk matrix and the two phasing parameters, it creates corrected intensity values and calls the base with the highest corrected intensity for each cluster and cycle. In the case of equal intensity values or small intensity differences an N is called.

The Bustard base calling uses a specific model of the sequencing process for estimating and correcting the raw intensity values extracted from the images. There are at least two effects observed in the intensities that Bustard does not explicitly model/correct for:

(1) Decreasing intensity values over the course of the run, due to the degradation of the fluorophores, or the effect of a decreasing number of sequences being elongated in each cluster when nucleotides for which the termination cannot be removed are incorporated.

(2) Unequal effects of phasing for the four nucleotides (e.g. T accumulation - in chemistries FC-104-100x or FC-204-20xx due to a lower removal rate of T fluorophores).

Ibis performs considerably better in base calling due to its model independent training process, which only relies on the assumption that the vast majority of the signal needed for base calling is captured by the intensity values of the last, the current and the next cycle and learning everything else using a training data set and a statistical learning approach. It thus uses cycle specific model parameters. All other base callers published (i.e. Bustard, AltaCyclic and Rolexa) use a specific model of the whole sequencing process for estimating and correcting the raw intensity values extracted from the images before base calling. If their models are incomplete or model assumptions not fulfilled, their results are imperfect.

The gain in speed compared to packages like AltaCyclic or Rolexa is due to the fast extraction of a training data set using the fast mapping tool SOAP v1.11 and the utilization of a fast Support Vector Machine (SVM) package. For the SVM classifiers of each cycle, we use the computational fast implementation of multiclass SVMs (with in this case linear kernels) by Thorsten Joachims, called SVMmulticlass.

Which input files does Ibis need? How does it work?

There are three different file formats storing intensity values and cluster identifiers depending on the analysis pipeline version used. Each of them can be the input of the base calling process, however Ibis does not support a mixture of different file formats provided:

(1) The format created by the IPAR software stores the intensity values in one file per lane and tile with the four intensities values of a cluster per line and #-comment lines separating the different cycles. The coordinates for each cluster are saved one per line in the very same order in a corresponding idx-file or pos-file depending on the exact SCS-version and GA Pipeline version.

(2) The format created by the Firecrest program stores the same information per lane and tile in one file starting with the identifying quadruple at the beginning and each four intensity values of one cycle separated by space and the cycles separated by tabulator in one line per cluster.

(3) The binary CIF format and the corresponding pos-files created by RTA (GA analysis pipeline 1.4 and higher).

In addition to the intensity files, Ibis also needs a training data set, which can be created from the Bustard base calling results. Here fore, Ibis extracts raw sequences from the Bustard folder of the sequencing run, which depending on the Genome Analyzer Pipeline version, are either available as qseq files or seq files. For a fraction of the tiles, the Bustard raw reads are aligned to the corresponding reference (in general PhiX 174 RF1) using the fast mappers SOAP v1.11 or bowtie. For each mapped sequence, the sequence of the reference is considered to be the correct and the raw intensity values of the previous, the current and the next cycle are extracted from the intensity files. For each cycle/position of the read one model consisting out of four support vector machines is trained. These models are later applied to the data of the complete run using a new C++ interface to the SVMmulticlass package. This program then creates for each cluster in the intensity files an entry in a FastQ file containing the sequence and PHRED-like quality scores in the Sanger encoding (offset 33).

What are the technical requirements for Ibis? With which versions of the Genome Analyzer pipeline does it work?

Ibis is based on Python scripts (tested with 2.5.2) and C++ programs (including the SVMmulticlass C package) and runs on normal x86 architectures. Multi-core systems are recommended but not needed. The training process creates big temporary files (depending on the read length and size of the sequencing run up to several gigabytes) but typically needs little RAM (< 500 MB). The final models used for prediction are typically smaller than 5 MB in total. In the final base calling step, several prediction instances can be started in parallel on multi-core systems - depending on the number of cycles, per instance about 500 to 2000 MB of RAM are needed. Ibis has been tested on Illumina pipeline versions 0.3.0, 1.0, 1.3.2 and 1.4.0. Recently, Illumina showed that it is possible to port their pipeline to Microsoft Windows systems using cygwin. We have not tested to do the same with Ibis, however with an appropriate cygwin installation, the installation procedure should directly apply for Windows systems.

Which license is applied?

Even though distributed as one package, the source distributed as the Ibis package applies different licences. The authors got permission by Thorsten Joachims to distribute Ibis packaged with SVM struct, however SVM struct is only granted free of charge for non-commercial research and education purposes. So you must obtain a license from the SVM struct author to use it for commercial purposes. In this case please directly contact Thorsten Joachims via his website http://svmlight.joachims.org/. Ibis authors are currently testing alternative packages to avoid this limited license.

All other parts, i.e. the interface to the SVM struct package (svm_struct_classify_firecrest.cpp), the Python scripts as well as SOAP v1.11 apply GPLv3, Gzstream is licensed with GPL v2.3.

Links


References

  1. . 2009. Genome Biology


To add a reference for Ibis, enter the PubMed ID in the field below and click 'Add'.

 


Search for "Ibis" in the SEQanswers forum / BioStar or:

Web Search Wiki Sites Scientific