Genbank
GenBank is the global repository wp:GenBank. e.g. gbNuc:AC233532, entrez:AC233532.
See also:
Note: The nucleotide entries in GenBank are mirrored in EMBL.
Rough notes on GenBank
Is the unifying concept a division?
An accession number will be consistent for a given clone as the sequences are updated from phase 1 to 3. Only the underlying geninfo number (gi) will change, allowing previous versions of the sequence be referenced.
Contents
Types of GenBank record
For example:
Chromosome records
Example:
LOCUS CM000224 96160479 bp DNA linear CON 10-AUG-2007 DEFINITION Mus musculus chromosome 16, whole genome shotgun sequence. ACCESSION CM000224 AAHY01000000 VERSION CM000224.2 GI:74229896 PROJECT GenomeProject:11785 KEYWORDS WGS. SOURCE Mus musculus (house mouse)
Actually a chromosome record is usually a CONtig record...
Contig records
- Phase 1
- The definition line reads: ..., *** SEQUENCING IN PROGRESS ***, n unordered pieces, and the keywords "HTG; HTGS_PHASE1." are included in the KEYWORDS field. The sequences are found in the HTG division.
- Phase 2
- The definition line reads: ..., *** SEQUENCING IN PROGRESS ***, n ordered pieces, and the keywords "HTG; HTGS_PHASE2." are included in the KEYWORDS field. The sequences are found in the HTG division.
- Phase 3
- The definition line reads: ..., complete sequence. The the KEYWORDS field is "HTG." but the sequences are added to the relevant organismal division (such as Primate or Invertebrate).
Types of GenBank database
Genome Project
http://www.ncbi.nlm.nih.gov/sites/entrez?db=genomeprj
For example,
- http://www.ncbi.nlm.nih.gov/sites/entrez?cmd=Search&db=genomeprj&term="Viridiplantae"[Organism]
High-Throughput Genomic Sequences (HGT)
The High Throughput Genomic (HTG) Sequences division was created to accommodate a growing need to make unfinished genomic sequence data rapidly available to the scientific community
Information: http://www.ncbi.nlm.nih.gov/projects/HTGS/ **
Notes:
Phase 1 and 2 records are retrievable from the HTG division of GenBank (Phase 3 records are retrievable from the relevant organismal division (such as Primate or Invertebrate) and are included in the nr database [1].
Paper:
http://www.ncbi.nlm.nih.gov/projects/HTGS/paper.html
Whole Genome Shotgun sequencing projects
- Whole Genome Shotgun Sequence Submissions
- Each WGS project is assigned a stable 4-letter WGS project_ID, which does not change as the project is updated. In addition to the WGS project_ID, the contig identifiers have a version number corresponding to a particular project update. Finally, each individual contig within the assembly is assigned a unique accession number prefixed by the WGS project_ID and version number. For instance, if a project's assigned accession number is XXXX00000000, then that project's first assembly version would be XXXX01000000, and the first contig of that version would be XXXX01000001. (The last six digits of this ID identify each individual contig).
What To Do
- Register your project with the Genome Project database (Done)
- Submit the contigs as the WGS project. WGS projects consist of only contigs (overlapping reads), not any supercontigs (assembled contigs separated by gaps), of a sequencing project. Supercontig or assembly information can be sent to us in AGP format, which will allow us to make CON records that indicate how the pieces of the WGS submission are put together.
- Submit your reads to the Trace Archive database as this information is useful for the scientific community. Contact trace@ncbi.nlm.nih.gov for questions about submitting to the Trace Archive.
Trace Archive
Currently about 623 thousand 'traces' submitted for potato.
http://www.ncbi.nlm.nih.gov/Traces/trace.cgi?cmd=retrieve&val=species_code%3D"SOLANUM TUBEROSUM"
Nucleotide / EST / GSS
Information: http://www.ncbi.nlm.nih.gov/dbGSS/index.html
Currently about 233 thousand records for potato.
For example, http://www.ncbi.nlm.nih.gov/sites/entrez?db=nucest&cmd=search&term=solanum+tuberosum
Note that BAC ends should be submitted as GSS records.