FlowSim

From SEQwiki
Jump to: navigation, search

Application data

Principal bioinformatics method(s) Sequence error correction, Modelling and simulation
Technology 454
Maintained? Maybe
Programming language(s) Haskell

Summary: Tool for simulating errors in 454 sequencing data

"Error: no local variable "counter" was set." is not a number.

Description

FlowSim is a suite of tools for simulating the 454 pyrosequencing process. It is based on the characteristics of real 454 data, and attempts to model the known aspects of the process.

The following sections describe the utilities that comprise FlowSim.

clonesim

Clonesim simulates the shearing step, which typically breaks an input genome into random fragments. It supports user selected distribution of lengths (e.g. uniform, normal, or log-normal distributions), but only uniformly distributed positions at this point.

gelfilter

Often, a gel is used for selecting a certain range of clone lengths. This (embarassingly simple) utility takes a set of input sequences, and removes sequences that are either shorter than the minimum or longer than the maximum sizes. It will currently only read from stdin and write to stdout.

kitsim

As a preliminary step to sequencing, synthetic sequences are attached to the ends of each clone. For 454, the A-adapter is attached to the 5’ end, and the B-adapter is attached to the 3’ end. These adapters contain the primers for the emulsion PCR amplification, that copies up each clone in sufficient quantity for the light signal from luciferase to be detectable during sequencing.

The A-adaptor is found at the beginning of each sequences as the TCAG “key”, while the B-adaptor is sometimes found at the end of sequences when the clone is short enough for it to be fully sequenced.

By default, kitsim uses Titanium adapters, but this is user selectable.

mutator

In theory, these sequences are then sequenced, and any error is introduced in the final stage, due to overlapping distributions of the light intensity generated by the different homopolymer lengths (see below). In practice, we find evidence for mutations in the sequences (Balzer et al, submitted), and we provide a general utility for introducing random substitutions and indels in the sequences.

duplicator

It has been reported by a wide variety of authors that second generation sequencing (and not only 454) generates artificial duplicates of many clones. This may be caused by too low amounts of input DNA. The duplicator is an attempt to simulate this. Currently, it only supports stdin/stdout.

flowsim

The final stage is the actual pyrosequencing process, where each input clone is converted to a series of light signals, the intensity of which corresponding to homopolymer lengths. Each homopolymer length is converted to a flow value which is adjusted according to its flow distribution. Then the resulting flows are base called, and quality filters applied (including adapter masking), and the resulting reads are output in an SFF file.





Links


References

  1. . 2011. Bioinformatics


To add a reference for FlowSim, enter the PubMed ID in the field below and click 'Add'.

 


Search for "FlowSim" in the SEQanswers forum / BioStar or:

Web Search Wiki Sites Scientific