GenBank/Submission

From SEQwiki
< GenBank
GenBankGenBank/Submission
Jump to: navigation, search

Notes and information regarding submission of sequences to GenBank

Overview

Sequences are prepared for submission by using tbl2asn. New or updated sequence files are placed on the FTP server (see below for details). Files appearing in the FTP directories are moved to a processing directory on a daily basis, and processing is begun immediately. Files are transferred from the FTP site SEQSUBMIT directory at 4 a.m. Eastern Standard Time (EST) and examined and processed the same day. Successful submissions are immediately released to the database. Any identified errors are communicated to the submitting center by email. At 6 p.m. EST each day, a .ac4htgs and a .GBFF file are put into the FTP site REPORT directory for each submitting center, listing the sequences that were fully processed and deposited into GenBank.

Links

Three important pages:

Tools:

Key pages:

Guidelines

The PGSC has established a few guidelines for submission to GenBank [1]. Specifically, the following conventions must be followed:

  • The keyword POTGEN must occur in the COMMENT field.
  • The BACNAME must be present in the DEFINITION field.

See:

See also: The BAC registry


Pipeline

See: BiO/Research/Potato/Assembly/PhredPipe/Submit/GenBank

Sequence files per BAC are prepared using the 'prepare2submit' Perl script. This script reads in a fasta file for the contigs, picks up the associated quality file and then trims the contigs by quality. The resulting (N padded) fasta sequence is written out along with the correspondingly trimmed quality file.

The ASN.1 format files are prepared for uploading into GenBank by using the tbl2asn tool. This tool combines an template file containing all the sequence meta-information with the sequence files created above. The template is generated by Sequin.

The first iteration of the processing pipeline looks like this:

for i in $DIR/*.fas; do
  j=`basename $i .fas`; echo $j
  ./tbl2asn -V vb -a r20u -C pgscukir -t template.sbt \
  -i $DIR/$j.fas \
  -j "[tech=htgs 1] [clone lib=RHPOTKEY] [strain=Diploid genotype RH89-039-16]"
done


Updates should look roughly like the following (an update must include the accession number). I know this is horrendous, but I was in a hurry for some reason (Milborn?).

for i in $DIR/*.fas; do
  j=`basename $i .fas`; echo $j
  k=`grep accession \
       $DIR/Accepted/pgscukir20081211.$j.sqn.ac4htgs | \
         perl -ne 'die unless /(AC\d+)\.\d/; print "$1\n"'`
  ./tbl2asn -V vb -a r20u -C pgscukir -t template.sbt \
  -A $k \
  -i $DIR/$j.fas \
  -j "[tech=htgs 1] [clone lib=RHPOTKEY] [strain=Diploid genotype RH89-039-16]"
done

Finally, the all the SQN files are transferred to the GenBank FTP site.

Upload site: ftp-private.ncbi.nih.gov

U:pgscukir
P:zrBRR8tx


See also