1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397
|
TBL2ASN AUTOMATED BULK SUBMISSION PROGRAM
tbl2asn is a program that automates the submission of sequence records to
GenBank. It uses many of the same functions as Sequin, but is driven
entirely by data files, and records need no additional manual editing before
submission. Entire genomes, consisting of many chromosomes with feature
annotation, can be processed in seconds using this method.
For a submission, tbl2asn expects a template file containing a text ASN.1
Submit-block object. These can be generated in Sequin and saved for use by
tbl2asn. The Submit-block contains contact information (to whom questions
on the submission can be addressed) and a submission citation (which lists
the authors who get scientific credit for the sequencing).
The template file can also contain one or more text ASN.1 Seq-descr objects
(such as Publication or BioSource) appended after the Submit-block. These
can also be generated in Sequin and saved to a file, and then appended to
the template with a text editor. They will become descriptors packaged at
the top of each submission file.
tbl2asn reads six other kinds of data files. Nucleotide sequence data is
expected in FASTA format, and these files are identified by having a .fsa
suffix. Feature table files, in the five-column format described later,
have a .tbl suffix. These can be easily generated by most genome centers
that maintain feature locations in a spreadsheet or database. For sets of
data records, a source qualifier table can be placed in a .src file. The
protein translations of CDS features can be supplied as FASTA sequences in
files with a .pep suffix. These will replace the tbl2asn-generated
conceptual translations, and can be used to verify correct CDS intervals.
The nucleotide sequence products of mRNA features can be provided as FASTA
files with a .rna suffix. Sequence quality scores can be supplied in files
with a .qvl suffix.
tbl2asn generates a .sqn file for submission to the database from these
input files.
COMMAND-LINE ARGUMENTS
To process a set of chromosomes, sets of .fsa and .tbl files (along with
optional .src, .pep, .rna, and .qvl files) are placed into a source
directory. The path to this directory is specified in the -p command-line
argument. The path for the resulting .sqn submission files is given in the
-r argument. If the -r argument is not given, the .sqn files are saved in
the source directory.
For example, if an organism has fifteen chromosomes, one would expect at
least the following files in the source directory:
chr01.fsa
chr01.tbl
chr02.fsa
...
chr14.tbl
chr15.fsa
chr15.tbl
The exact names of the files are not important, but when a file with a
suffix of .fsa is found, tbl2asn will look for a file with the same prefix
that has a .tbl suffix, and then generate a .sqn file.
The -t command-line argument specifies the template file.
Normally a single FASTA sequence per .fsa file is expected. If there are
multiple sequences, only the first is processed, unless one of two -a flag
variants are given. These are discussed below.
The -a s flag tells tbl2asn to package the multiple FASTA components as a set
of unrelated sequences. This accommodates users who create a single file
instead of one file per sequence. A single FASTA component can now have gap
indications (e.g., >?unk100) on a separate line as long as it is followed
immediately by more sequence lines, with no > and a mock identifier.
The -a d flag tells the program to make a delta sequence out of the multiple
components. This can be used for HTGS submissions where the sequence of
the BAC/PAC clone has not been completely determined. By convention, gaps
of 100 base pairs should be inserted in between the actual sequence
segments with lines containing an angle bracket '>', a question mark '?',
the letters "unk", and the length of the gap.
>?unk100
The -g flag causes tbl2asn to generate a genomic product set. Within the
set, the products of each related mRNA and CDS are packaged together in an
internal nuc-prot set. The feature table must provide reciprocal
protein_id and transcript_id qualifiers in order to correctly identify each
mRNA/CDS pair. From the resulting .sqn file, the genomic sequence, all
transcripts, and all proteins will be entered into the database and given
accessions. Note, however, that -g cannot be used for records submitted to
GenBank. It is only suitable for records going into RefSeq.
If a feature table is not given, the -k c flag tells tbl2asn to annotate the
longest Open Reading Frame (ORF) on each record. The -k m flag allows
alternative start codons to be used when finding the ORF. The protein will
be named 'unknown' unless the name is present in the .fsa file definition
line, e.g., [protein=helicase]. The two flags can be combined as -k cm.
Data records will be validated when the -V v flag is indicated. Output is
saved to files with a .val suffix. The validator checks for many things,
including internal stops in CDS features and mismatches between the CDS
translation and the supplied protein sequence. Errors need to be corrected
before submitting files to GenBank. Additional BarCode validation tests
are performed with -V c.
GenBank format output is generated when the -V b flag is used. Resulting
files have a .gbf suffix. The validation and flatfile generation flags can
be combined as -V vb.
NUCLEOTIDE SEQUENCE FORMAT
tbl2asn can read nucleotide sequences of any size in FASTA format. A FASTA
record consists of a single definition line, beginning with a '>' and
followed by optional text, and subsequent lines of sequence. At minimum,
all definition lines must contain an identifier for the sequence, called
the SeqID. The SeqID cannot begin with "assembly", as this is reserved for
entry of accession lists in Sequin. Other optional information about the
biological source of the organism can also be encoded in brackets on the
definition line. A sample definition line is
>Sc_16 [organism=Saccharomyces cerevisiae] [strain=S288C] [chromosome=XVI]
Other elements include [topology=circular] and [location=mitochondrion].
Rna viruses would be indicated by [molecule=rna] and [moltype=genomic]. The
sequencing technique can be supplied as [tech=fli cDNA]. Many other source
qualifiers, such as map, clone, isolate, cell-line, and cultivar, can be used.
For organisms that are not commonly submitted with tbl2asn, the nuclear,
mitochondrial, and plastid genetic codes can be indicated by [gcode=1],
[mgcode=3], and [pgcode=11], respectively. This will ensure proper
translation of CDS features.
Primary accessions of TPA (third party annotation) records are given by
[primary=xxx,xxx,...]. Finally, a general comment can be added with
[note=xxx].
Note that the definition line must be a single line, with no return or
newline characters. Some word-processors will word-wrap text, either during
display or when saving to a file, and care must be taken to avoid unwanted
newlines introduced by the editor.
>slpy [organism=Zea mays] [chromosome=9] Sleepy transposon
TGTAAGATCACTGCTGGGTTGTTGATGAGTTGAGCACCGCTCCCGGCACCCGTCTCCTCTCACGAAGATC
TTTAGGGTATGAAAAGTATCTGGAGTTCTTACACGACGGCGAGCCGCCTCTTCTCCGGACGCAGCCGGCC
AGCCTTCTTCTCCAAGTCACCTTTTACCGACTCCAAACCCCACCTCAAATACTCCACTCAATCCAGATCA
...
Multiple SeqIDs can be indicated in FASTA-style parsable strings.
>gnl|ZGP|chr1|gb|U28041
>gi|54465|emb|X16935|MMTCRAC
FEATURE TABLE FORMAT
tbl2asn reads features from a simple five-column tab-delimited table. This
is described in more detail at
http://www.ncbi.nlm.nih.gov/Sequin/table.html
The feature table specifies the location and type of each feature, and
tbl2asn processes the feature intervals and translates any CDSs into
proteins. The first line of the table contains the following basic
information.
>Feature SeqId table_name
The SeqId must be the same as that used on the sequence. The table name is
optional. Subsequent lines of the table list the features. Columns are
separated by tabs.
The first and second columns are the start and stop locations of the
feature, respectively, the third column is the type of feature (the feature
key, e.g., gene, mRNA, CDS), the fourth column is a qualifier name (e.g.,
"product", and the fifth is a qualifier value (e.g., the name of the protein
or gene).
A simple feature table is
>Feature sde3g
240 4084 gene
gene SDE3
240 1361 mRNA
1450 1641
1730 3184
3275 4084
product RNA helicase SDE3
579 1361 CDS
1450 1641
1730 3184
3275 3880
product RNA helicase SDE3
If a feature contains multiple intervals, each interval is listed on a
separate line by its start and stop position. Features that are on the
complementary strand are indicated by reversing the interval locations.
Locations of partial (incomplete) features are indicated with a '>' or '<'
next to the number.
Gene features are always a single interval, and their location should cover
the intervals of all the relevant features. If the gene feature spans the
intervals of the CDS or mRNA features for that gene, there is no need to
include gene qualifiers on those features in the table, since they will be
picked up by overlap. Use of the overlapping gene can be suppressed by
adding a gene qualifier with the value "-". This is important when, for
example, a tRNA is encoded within an intron of a housekeeping gene.
Translation exception qualifiers are parsed from the same style used in the
GenBank flatfile.
transl_except (pos:591..593,aa:Sec)
The codon recognized and anticodon position of tRNAs can also be given.
codon_recognized TGG
anticodon (pos:7591..7593,aa:Trp)
In addition to the standard qualifiers seen in GenBank format, several other
tokens are used to direct values to specific fields in the ASN.1 data.
These include gene_syn or gene_synonym, gene_desc, locus_tag, prot_desc,
prot_note, region_name, bond_type, and site_type.
Genomic product sets require protein_id and transcript_id qualifiers on each
mRNA and CDS feature. These are used to associate the correct pair of
features for packaging.
protein_id lcl|sde3p
transcript_id lcl|sde3m
Exceptional biological situations can be annotated by use of the exception
qualifier. For example
exception ribosomal slippage
The following are legal exception qualifier values
RNA editing
reasons given in citation
ribosomal slippage
trans-splicing
alternative processing
artificial frameshift
nonconsensus splice site
rearrangement required for product
modified codon recognition
alternative start codon
dicistronic gene
transcribed pseudogene
Since the International Nucleotide Sequence Database collaboration only
allows "RNA editing" and "reasons given in citation" to appear in release
mode, other exceptions are mapped to the /note qualifier in the flatfile.
However, each exception text string turns off specific validator tests that
would otherwise produce warning messages, so they should be entered as
exception qualifiers, not as notes.
Gene Ontology (GO) terms can be indicated with the following qualifiers
go_component endoplasmic reticulum|0005783
go_process glycolysis and gluconeogenesis|57|89197757|ACT,TEM
go_function excision repair|93||IPD
The value field is separated by vertical bars '|' into a descriptive
string, the GO identifier (leading zeroes are retained), and optionally a
PubMed ID (or GO Reference number starting with a leading 0) and one or more
GO evidence codes.
SOURCE TABLE FORMAT
For sets of sequences, a source qualifier table can optionally be placed in
a tab-delimited file with a .src extension. The first line gives the
source qualifier names, separated by tabs. The first column must be the
sequence identifier. For example
sequence_id organism strain isolate
The remaining lines each give the source qualifiers for one sequence. For
example
sde3g Zea mays A69Y JH90.6-2x12
The same information can be provided in the FASTA definition line or in the
source section of the five-column feature table.
PROTEIN SEQUENCE FORMAT
Protein sequences are FASTA files with a .pep extension that can substitute
for the translated product of a CDS feature. Supplying these files acts as
a reality check that the CDS intervals do in fact translate to the expected
protein sequence. The FASTA defline with a '>' and sequence identifier is
required.
>sde3p
MSVSUYKSDDEYSVIADKGEIGFIDYQNDGSSGCYNPFDEGPVVVSVPFPFKKEKPQSVTVGETSFDSFT
VKNTMDEPVDLWTKIYASNPEDSFTLSILKPPSKDSDLKERQCFYETFTLEDRMLEPGDTLTIWVSCKPK
...
The SeqID must match a protein_id in the .tbl file. In the table above,
the protein_id and transcript_id need to explicitly use a 'lcl|' prefix
before the SeqID string to indicate a local identifier. A local sequence
identifier is assumed when reading FASTA, but a database accession is
assumed in the feature table.
Sequin's Suggest Interval functionality, which can determine CDS intervals
from nucleotide and protein sequences plus the genetic code, is not used in
tbl2asn. Instead, the CDS intervals are required, and the supplied protein
sequence is just used to confirm proper translation.
MESSENGER RNA SEQUENCE FORMAT
mRNA sequences are FASTA files with a .rna extension that can substitute
for the transcribed product of an mRNA feature. Like the .pep files, they
act as a reality check that the supplied intervals do in fact encode the
expected mRNA sequence.
>sde3m
TTTTCATGTTTCTTCTCCTTTGAAGCCTGCCTGCGTTAGTCTGGCTTCATTGCTTCTCCATTTCTTGGTG
TGATCGAATCAAAGAGTGTAACCCATTTTGCTACTGATTCAGTACGTATGATCAATTCTCTCAATTTCAG
...
The SeqID must match the transcript_id from an mRNA feature.
QUALITY SCORES FORMAT
Phrap/Consed quality scores can be supplied in .qvl files. These generate
Seq-graph data that will be attached to the nucleotide sequence from the
.fsa file. Programs such as Sequin can display these in a graphical view.
>chr1
51 63 70 82 82 82 90 90 90 90 86 86 86 86 90 90 90 90 90 86
86 86 86 86 86 86 86 90 90 90 90 90 90 86 86 78 78 90 90 86
...
These values can be extracted from the output files of the Phrap and Consed
programs used to process raw data from automated sequencing machines.
SUBMISSION TEMPLATE FORMAT
The submission template is an ASN.1 Submit-block that can be generated by
Sequin. A simple example is shown below.
Submit-block ::= {
contact {
contact {
name
name {
last "Darwin" ,
first "Charles" ,
initials "C.R." ,
suffix "" } ,
affil
std {
affil "Oxbridge University" ,
div "Evolutionary Biology Department" ,
city "Camford" ,
country "United Kingdom" ,
street "1859 Tennis Court Lane" ,
email "darwin@beagle.edu.uk" ,
phone "01 44 171-007-1212" ,
postal-code "OX1 2BH" } } } ,
cit {
authors {
names
std {
{
name
name {
last "Darwin" ,
first "Charles" ,
initials "C.R." } } } ,
affil
std {
affil "Oxbridge University" ,
div "Evolutionary Biology Department" ,
city "Camford" ,
country "United Kingdom" ,
street "1859 Tennis Court Lane" ,
postal-code "OX1 2BH" } } ,
date
std {
year 2003 ,
month 2 ,
day 28 } } ,
subtype new }
This can be exported from the Desktop view of a template file in Sequin, or
from the initial submission dialogs. In addition, unpublished reference or
comments can also be generated in Sequin and saved from the Desktop. The
two files can be catenated to make a .sbt template with the publication or
comment descriptor after the submit block.
|