Genbank

From SEQwiki
Jump to: navigation, search

GenBank is the global repository wp:GenBank. e.g. gbNuc:AC233532, entrez:AC233532.

See also:


Note: The nucleotide entries in GenBank are mirrored in EMBL.


Rough notes on GenBank

Is the unifying concept a division?


An accession number will be consistent for a given clone as the sequences are updated from phase 1 to 3. Only the underlying geninfo number (gi) will change, allowing previous versions of the sequence be referenced.


Types of GenBank record

For example:

Annotated Contigs Annotated Scaffolds No Annotation
WGS contig with annotation WGS contig without annotation WGS contig without annotation
Chromosome CON with annotation Chromosome CON with annotation Chromosome CON without annotation
Scaffold CON with annotation Scaffold CON with annotation Scaffold CON without annotation


Chromosome records

Example:

LOCUS       CM000224            96160479 bp    DNA     linear   CON 10-AUG-2007
DEFINITION  Mus musculus chromosome 16, whole genome shotgun sequence.
ACCESSION   CM000224 AAHY01000000
VERSION     CM000224.2  GI:74229896
PROJECT     GenomeProject:11785
KEYWORDS    WGS.
SOURCE      Mus musculus (house mouse)


Actually a chromosome record is usually a CONtig record...

Contig records

Phase 1 
The definition line reads: ..., *** SEQUENCING IN PROGRESS ***, n unordered pieces, and the keywords "HTG; HTGS_PHASE1." are included in the KEYWORDS field. The sequences are found in the HTG division.
Phase 2 
The definition line reads: ..., *** SEQUENCING IN PROGRESS ***, n ordered pieces, and the keywords "HTG; HTGS_PHASE2." are included in the KEYWORDS field. The sequences are found in the HTG division.
Phase 3 
The definition line reads: ..., complete sequence. The the KEYWORDS field is "HTG." but the sequences are added to the relevant organismal division (such as Primate or Invertebrate).

Types of GenBank database

Genome Project

http://www.ncbi.nlm.nih.gov/sites/entrez?db=genomeprj


For example,

http://www.ncbi.nlm.nih.gov/sites/entrez?db=genomeprj&cmd=Link&LinkName=genomeprj_pubmed&from_uid=12984

High-Throughput Genomic Sequences (HGT)

The High Throughput Genomic (HTG) Sequences division was created to accommodate a growing need to make unfinished genomic sequence data rapidly available to the scientific community

Information: http://www.ncbi.nlm.nih.gov/projects/HTGS/ **

Notes:

Phase 1 and 2 records are retrievable from the HTG division of GenBank (Phase 3 records are retrievable from the relevant organismal division (such as Primate or Invertebrate) and are included in the nr database [1].


Paper:

http://www.ncbi.nlm.nih.gov/projects/HTGS/paper.html

Whole Genome Shotgun sequencing projects

Whole Genome Shotgun Sequence Submissions 
Each WGS project is assigned a stable 4-letter WGS project_ID, which does not change as the project is updated. In addition to the WGS project_ID, the contig identifiers have a version number corresponding to a particular project update. Finally, each individual contig within the assembly is assigned a unique accession number prefixed by the WGS project_ID and version number. For instance, if a project's assigned accession number is XXXX00000000, then that project's first assembly version would be XXXX01000000, and the first contig of that version would be XXXX01000001. (The last six digits of this ID identify each individual contig).


What To Do

  1. Register your project with the Genome Project database (Done)
  2. Submit the contigs as the WGS project. WGS projects consist of only contigs (overlapping reads), not any supercontigs (assembled contigs separated by gaps), of a sequencing project. Supercontig or assembly information can be sent to us in AGP format, which will allow us to make CON records that indicate how the pieces of the WGS submission are put together.
  3. Submit your reads to the Trace Archive database as this information is useful for the scientific community. Contact trace@ncbi.nlm.nih.gov for questions about submitting to the Trace Archive.

Trace Archive

Currently about 623 thousand 'traces' submitted for potato.

http://www.ncbi.nlm.nih.gov/Traces/trace.cgi?cmd=retrieve&val=species_code%3D"SOLANUM TUBEROSUM"


Nucleotide / EST / GSS

Information: http://www.ncbi.nlm.nih.gov/dbGSS/index.html

Currently about 233 thousand records for potato.

For example, http://www.ncbi.nlm.nih.gov/sites/entrez?db=nucest&cmd=search&term=solanum+tuberosum

Note that BAC ends should be submitted as GSS records.