File: Formats

package info (click to toggle)
readseq 1-14
  • links: PTS, VCS
  • area: main
  • in suites: bookworm, bullseye
  • size: 736 kB
  • sloc: ansic: 9,340; makefile: 426; sh: 32
file content (996 lines) | stat: -rw-r--r-- 39,652 bytes parent folder | download | duplicates (6)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
||||||||||| ReadSeq supported formats   (revised 30Dec92)
--------------------------------------------------------

    -f[ormat=]Name Format name for output:
         1. IG/Stanford           10. Olsen (in-only)
         2. GenBank/GB            11. Phylip3.2
         3. NBRF                  12. Phylip
         4. EMBL                  13. Plain/Raw
         5. GCG                   14. PIR/CODATA
         6. DNAStrider            15. MSF
         7. Fitch                 16. ASN.1
         8. Pearson/Fasta         17. PAUP
         9. Zuker (in-only)       18. Pretty (out-only)

In general, output supports only minimal subsets of each format
needed for sequence data exchanges.  Features, descriptions
and other format-unique information is discarded.

Users of Olsen multi sequence editor (VMS).  The Olsen format
here is produced with the print command:
  print/out=some.file
Use Genbank output from readseq to produce a format that this
editor can read, and use the command
  load/genbank some.file
Dan Davison has a VMS program that will convert to/from the
Olsen native binary data format.  E-mail davison@uh.edu

Warning: Phylip format input is now supported (30Dec92), however the
auto-detection of Phylip format is very probabilistic and messy,
especially distinguishing sequential from interleaved versions. It
is not recommended that one use readseq to convert files from Phylip
format to others unless essential.



||||||||||| ReadSeq usage             (revised 11Nov91)
--------------------------------------------------------

A. determine file format:

        short skiplines;  /* result: number of header lines to skip (or 0) */
        short error;      /* error result or 0 */
        short format;     /* resulting format code, see ureadseq.h */
        char  *filename   = "Mysequence.file"

        format = seqFileFormat( filename, &skiplines, &error);
        if (error!=0) fail;

B. read number and list of sequences (optional)
        short numseqs;    /* resulting number of sequences found in file */
        char  *seqlist;   /* list of sequence names, newline separated, 0 terminated */

        seqlist = listSeqs( filename, skiplines, format, &numseqs, &error);
        if (error!=0)  display (seqlist);
        free( seqlist);

C.  read individual sequences as desired
        short seqIndex;   /* sequence index #, or == kListSeqs for listSeqs equivalent */
        long  seqlen;     /* length of seq */
        char  seqid[256]; /* sequence name */
        char  *seq;       /* sequence, 0 terminated, free when done */

        seq = readSeq( seqIndex, filename, skiplines, format,
                      &seqlen, &numseqs, &error, seqid);
        if (error!=0) manipulate(seq);
        free(seq);

D. write sequences as desired
        int nlines;     /* number of lines of sequence written */
        FILE* fout;     /* open file pointer (stdout or other) */
        short outform;  /* output format, see ureadseq.h */

        nlines = writeSeq( fout, seq, seqlen, format, outform, seqid);


Note (30Dec92): There is various processing done by the main program (in readseq.c),
  rather than just in the subroutines (in ureadseq.c).  Especially for interleaved
  output formats, the writeSeq subroutine does not handle interleaving, nor some of
  the formatting at the top and end of output files.  While seqFileFormat, listSeqs,
  and readSeq subroutines are fairly self-contained, the writeSeq depends a lot on
  auxilliary processing.  At some point, this may be revised so writeSeq is self-
  contained.

Note 2: The NCBI toolkit (ftp from ncbi.nlm.nih.gov) is needed for the ASN.1 format
  reading (see ureadasn.c).  A bastard (but workable I hope) ASN.1 format is written
  by writeSeq alone.



|||||||||||  sequence formats....
---------------------------------------------------

stanford/IG
;comments
;...
seq1 info
abcd...
efgh1 (or 2 = terminator)
;another seq
;....
seq2 info
abcd...1
--- for e.g. ----
;     Dro5s-T.Seq  Length: 120  April 6, 1989  21:22  Check: 9487  ..
dro5stseq
GCCAACGACCAUACCACGCUGAAUACAUCGGUUCUCGUCCGAUCACCGAAAUUAAGCAGCGUCGCGGGCG
GUUAGUACUUAGAUGGGGGACCGCUUGGGAACACCGCGUGUUGUUGGCCU1

;  TOIG of: Dro5srna.Seq  check: 9487  from: 1  to: 120
---------------------------------------------------

Genbank:
LOCUS    seq1 ID..
...
ORIGIN ...
123456789abcdefg....(1st 9 columns are formatting)
     hijkl...
//         (end of sequence)
LOCUS     seq2 ID ..
...
ORIGIN
      abcd...
//
---------------------------------------------------

NBRF format: (from uwgcg ToNBRF)
>DL;DRO5SRNA
Iubio$Dua0:[Gilbertd.Gcg]Dro5srna.Seq;2 => DRO5SRNA

      51  AAUUAAGCAG CGUCGCGGGC GGUUAGUACU UAGAUGGGGG ACCGCUUGGG
     101  AACACCGCGU GUUGUUGGCC U

---------------------------------------------------

EMBL format
ID345 seq1 id   (the 345 are spaces)
... other info
SQ345Sequence   (the 3,4,5 are spaces)
abcd...
hijk...
//              (! this is proper end string: 12Oct90)
ID    seq2 id
...
SQ   Sequence
abcd...
...
//
---------------------------------------------------

UW GCG Format:
comments of any form, up to ".." signal
signal line has seq id, and " Check: ####   .."
only 1 seq/file

-- e.g. --- (GCG from GenBank)
LOCUS       DROEST6      1819 bp ss-mRNA            INV       31-AUG-1987
    ... much more ...
ORIGIN      1 bp upstream of EcoRI site; chromosome BK9 region 69A1.

INVERTEBRATE:DROEST6  Length: 1819  January 9, 1989  16:48  Check: 8008  ..

       1  GAATTCGCCG GAGTGAGGAG CAACATGAAC TACGTGGGAC TGGGACTTAT

      51  CATTGTGCTG AGCTGCCTTT GGCTCGGTTC GAACGCGAGT GATACAGATG


---------------------------------------------------

DNAStrider (Mac) = modified Stanford:
; ### from DNA Strider  Friday, April 7, 1989   11:04:24 PM
; DNA sequence  pBR322   4363  b.p. complete sequence
;
abcd...
efgh
//  (end of sequence)
---------------------------------------------------

Fitch format:
Dro5srna.Seq
 GCC AAC GAC CAU ACC ACG CUG AAU ACA UCG GUU CUC GUC CGA UCA CCG AAA UUA AGC AGC
 GUC GCG GGC GGU UAG UAC UUA GAU GGG GGA CCG CUU GGG AAC ACC GCG UGU UGU UGG CCU
Droest6.Seq
 GAA TTC GCC GGA GTG AGG AGC AAC ATG AAC TAC GTG GGA CTG GGA CTT ATC ATT GTG CTG
 AGC TGC CTT TGG CTC GGT TCG AAC GCG AGT GAT ACA GAT GAC CCT CTG TTG GTG CAG CTG
---------------------------------------------------

W.Pearson/Fasta format:
>BOVPRL GenBank entry BOVPRL from omam file.  907 nucleotides.
TGCTTGGCTGAGGAGCCATAGGACGAGAGCTTCCTGGTGAAGTGTGTTTCTTGAAATCAT

---------------------------------------------------
Phylip version 3.2 format (e.g., DNAML):

   5   13 YF                (# seqs, #bases, YF)
Alpha     AACGTGGCCAAAT
          aaaagggccc...  (continued sp. alpha)
Beta      AAGGTCGCCAAAC
          aaaagggccc...  (continued sp. beta)
Gamma     CATTTCGTCACAA
          aaaagggccc...  (continued sp. Gamma)
1234567890^-- bases must start in col 11, and run 'til #bases 
        (spaces & newlines are okay)
---------------------------------------------------
Phylip version 3.3 format (e.g., DNAML):

  5    42  YF             (# seqs, #bases, YF)
Turkey    AAGCTNGGGC ATTTCAGGGT
Salmo gairAAGCCTTGGC AGTGCAGGGT
H. SapiensACCGGTTGGC CGTTCAGGGT
Chimp     AAACCCTTGC CGTTACGCTT
Gorilla   AAACCCTTGC CGGTACGCTT
1234567890^-- bases must start in col 11
  !! this version interleaves the species -- contrary to
     all other output formats.

GAGCCCGGGC AATACAGGGT AT
GAGCCGTGGC CGGGCACGGT AT
ACAGGTTGGC CGTTCAGGGT AA
AAACCGAGGC CGGGACACTC AT
AAACCATTGC CGGTACGCTT AA

---------------------------------------------------
Phylip version 3.4 format (e.g., DNAML)
-- Both Interleaved and sequential are permitted

   5   13                (# seqs, #bases)
Alpha     AACGTGGCCAAAT
          aaaagggccc...  (continued sp. alpha)
Beta      AAGGTCGCCAAAC
          aaaagggccc...  (continued sp. beta)
Gamma     CATTTCGTCACAA
          aaaagggccc...  (continued sp. Gamma)
1234567890^-- bases must start in col 11, and run 'til #bases 
        (spaces, newlines and numbers are are ignored)

---------------------------------------------------
Gary Olsen (multiple) sequence editor /print format:

!---------------------
!17Oct91 -- error in original copy of olsen /print format, shifted right 1 space
! here is correct copy:
  301  40 Tb.thiop  CGCAGCGAAA----------GCUNUGCUAAUACCGCAUA-CGnCCUG-----------------------------------------------------  Tb.thiop
123456789012345678901
  301  42 Rhc.purp  CGUAGCGAAA----------GUUACGCUAAUACCGCAUA-UUCUGUG-----------------------------------------------------  Rhc.purp

  301  44 Rhc.gela  nnngnCGAAA----------GCCGGAUUAAUACCGCAUA-CGACCUA-----------------------------------------------------  Rhc.gela
!---------------------

 RNase P RNA components.  on 20-FEB-90 17:23:58

    1 (E.c. pr ):  Base pairing in Escherichia coli RNase P RNA.
    2 (chrom   ):  Chromatium
      :
   12 (B.brevis):  Bacillus brevis RNase P RNA, B. James.
   13 ( 90% con):   90% conserved
   14 (100% con):  100% conserved
   15 (gram+ pr):  pairing

1
 RNase P RNA components.  on 20-FEB-90 17:23:58

 Posi-   Sequence
 tion:   identity:   Data:

     1   1 E.c. pr      <<<<<<<<<< {{{{{{{{<<:<<<<<<<<<<^<<<<<<====>>>>  E.c. pr
     1   2 chrom        GGAGUCGGCCAGACAGUCGCUUCCGUCCU------------------  chrom
            :
     1  12 B.brevis  AUGCAGGAAAUGCGGGUAGCCGCUGCCGCAAUCGUCU-------------  B.brevis
1234567890123456789012 <! this should be 21 not 22,
! this example must be inset on left by 1 space from olsen /print files !
     1  13  90% con           G  C G  A  CGC GC               -    -      90% con
     1  14 100% con                G  A  CGC                             100% con
     1  15 gram+ pr     <<<<<<<<<< {{{{{{{{<<<<<<<<<<<<<===============  gram+ pr

    60   1 E.c. pr   >>>>>>^>>^>>>>:>>    <<<^<<<< {{{{{                 E.c. pr
    60   2 chrom     -----GGUG-ACGGGGGAGGAAAGUCCGG-GCUCCAU-------------  chrom
    :       :
    60  10 B.stearo  ----UU-CG-GCCGUAGAGGAAAGUCCAUGCUCGCACGGUGCUGAGAUGC  B.stearo


---------------------------------------------------
  GCG MSF format
Title line

picorna.msf  MSF: 100  Type: P  January 17, 1991  17:53  Check: 541
..
Name: Cb3              Len:   100  Check: 7009  Weight:  1.00
Name: E                Len:   100  Check:   60  Weight:  1.00

//

   1                                                   50
Cb3  ...gpvedai .......t.. aaigr..vad tvgtgptnse aipaltaaet
  E  gvenae.kgv tentna.tad fvaqpvylpe .nqt...... kv.affynrs

   51                                                 100

Cb3  ghtsqvvpgd tmqtrhvkny hsrsestien flcrsacvyf teykn.....
  E  ...spi.gaf tvks...... gs.lesgfap .fsngtc.pn sviltpgpqf

---------------------------------------------------
     PIR format
This is NBRF-PIR MAILSERVER version 1.45
Command-> get PIR3:A31391
\\\
ENTRY           A31391       #Type Protein
TITLE           *Esterase-6 - Fruit fly (Drosophila melanogaster)

DATE            03-Aug-1992 #Sequence 03-Aug-1992 #Text 03-Aug-1992
PLACEMENT          0.0    0.0    0.0    0.0    0.0
COMMENT         *This entry is not verified.
SOURCE          Drosophila melanogaster

REFERENCE
   #Authors     Cooke P.H., Oakeshott J.G.
   #Citation    submitted to GenBank, April 1989
   #Reference-number A31391
   #Accession   A31391
   #Cross-reference GB:J04167

SUMMARY       #Molecular-weight 61125  #Length 544  #Checksum  1679
SEQUENCE
                5        10        15        20        25        30
      1 M N Y V G L G L I I V L S C L W L G S N A S D T D D P L L V
     31 Q L P Q G K L R G R D N G S Y Y S Y E S I P Y A E P P T G D
     61 L R F E A P E P Y K Q K W S D I F D A T K T P V A C L Q W D
     91 Q F T P G A N K L V G E E D C L T V S V Y K P K N S K R N S
    121 F P V V A H I H G G A F M F G A A W Q N G H E N V M R E G K
    151 F I L V K I S Y R L G P L G F V S T G D R D L P G N Y G L K
    181 D Q R L A L K W I K Q N I A S F G G E P Q N V L L V G H S A
    211 G G A S V H L Q M L R E D F G Q L A R A A F S F S G N A L D
    241 P W V I Q K G A R G R A F E L G R N V G C E S A E D S T S L
    271 K K C L K S K P A S E L V T A V R K F L I F S Y V P F A P F
    301 S P V L E P S D A P D A I I T Q D P R D V I K S G K F G Q V
    331 P W A V S Y V T E D G G Y N A A L L L K E R K S G I V I D D
    361 L N E R W L E L A P Y L L F Y R D T K T K K D M D D Y S R K
    391 I K Q E Y I G N Q R F D I E S Y S E L Q R L F T D I L F K N
    421 S T Q E S L D L H R K Y G K S P A Y A Y V Y D N P A E K G I
    451 A Q V L A N R T D Y D F G T V H G D D Y F L I F E N F V R D
    481 V E M R P D E Q I I S R N F I N M L A D F A S S D N G S L K
    511 Y G E C D F K D N V G S E K F Q L L A I Y I D G C Q N R Q H
    541 V E F P
///
\\\
---------------------------------------------------
PAUP format:
The NEXUS Format

Every block starts with "BEGIN blockname;" and ends with "END;".
Each block is composed of one or more statements, each
terminated by a semicolon (;).

Comments may be included in NEXUS files by enclosing them within
square brackets, as in "[This is a comment]."

NEXUS-conforming files are identified by a "#NEXUS" directive at
the very beginning of the file (line 1, column 1).  If the
#NEXUS is omitted PAUP issues a warning but continues
processing.

NEXUS files are entirely free-format.  Blanks, tabs, and
newlines may be placed anywhere in the file.  Unless RESPECTCASE
is requested, commands and data may be entered in upper case,
lower case, or a mixture of upper and lower case.

The following conventions are used in the syntax descriptions of
the various blocks.  Upper-case items are entered exactly as
shown.  Lower-case items inside of angle brackets -- e.g., <x>
-- represent items to be substituted by the user.  Items inside
of square brackets -- e.g., [X] -- are optional.  Items inside
of curly braces and separated by vertical bars -- e.g.,  { X | Y
| Z } -- are mutually exclusive options.


The DATA Block

The DATA block contains the data matrix and other associated
information.  Its syntax is:

BEGIN DATA;
DIMENSIONS NTAX=<number of taxa> NCHAR=<number of characters>;
  [ FORMAT  [ MISSING=<missing-symbol> ]
        [ LABELPOS={ LEFT | RIGHT } ]
        [ SYMBOLS="<symbols-list>" ]
        [ INTERLEAVE ]
        [ MATCHCHAR=<match-symbol> ]
        [ EQUATE="<symbol>=<expansion> [<symbol>=<expansion>...]" ]
        [ TRANSPOSE ]
        [ RESPECTCASE ]
        [ DATATYPE = { STANDARD | DNA | RNA | PROTEIN } ]; ]
        [ OPTIONS [ IGNORE={ INVAR | UNINFORM } ]
        [ MSTAXA = { UNCERTAIN | POLYMORPH | VARIABLE } ]
        [ ZAP = "<list of zapped characters>" ] ; ]
  [ CHARLABELS <label_1> label_2> <label_NCHAR> ; ]
  [ TAXLABELS <label1_1> <label1_2> <label1_NTAX> ; ]
  [ STATELABELS <currently ignored by PAUP> ; ]
  MATRIX <data-matrix> ;
  END;

--- example PAUP file

#NEXUS

[!Brown et al. (1982) primate mitochondrial DNA]

begin data;
  dimensions ntax=5 nchar=896;
  format datatype=dna matchchar=. interleave missing='-';
  matrix
[                              2                    4                    6            8                    ]
[         1                    1                    1                    1            1                    ]
human     aagcttcaccggcgcagtca ttctcataatcgcccacggR cttacatcctcattactatt ctgcctagcaaactcaaact acgaacgcactcacagtcgc
chimp     ................a.t. .c.................a ...............t.... ..................t. .t........c.........
gorilla   ..................tg ....t.....t........a ........a......t.... .................... .......a..c.....c...
orang     ................ac.. cc.....g..t.....t..a ..c........cc....g.. .................... .......a..c.....c...
gibbon    ......t..a..t...ac.g .c.................a ..a..c..t..cc.g..... ......t............. .......a........c...

[         8                    8                    8                    8            8              8     ]
[         0                    2                    4                    6            8              9     ]
[         1                    1                    1                    1            1              6     ]
human     cttccccacaacaatattca tgtgcctagaccaagaagtt attatctcgaactgacactg agccacaacccaaacaaccc agctctccctaagctt
chimp     t................... .a................c. ........a.....g..... ...a................ ................
gorilla   ..................tc .a................c. ........a.g......... ...a.............tt. .a..............
orang     ta....a...........t. .c.......ga......acc ..cg..a.a......tg... .a.a..c.....g...cta. .a.....a........
gibbon    a..t.......t........ ....ac...........acc .....t..a........... .a.tg..........gctag .a..............
  ;
end;
---------------------------------------------------






|||||||||||  Sample SMTP mail header
---------------------------------------------------

- - - - - - - - -
From GenBank-Retrieval-System@genbank.bio.net Sun Nov 10 17:28:56 1991
Received: from genbank.bio.net by sunflower.bio.indiana.edu
        (4.1/9.5jsm) id AA19328; Sun, 10 Nov 91 17:28:55 EST
Received: by genbank.bio.net (5.65/IG-2.0)
        id AA14458; Sun, 10 Nov 91 14:30:03 -0800
Date: Sun, 10 Nov 91 14:30:03 -0800
Message-Id: <9111102230.AA14458@genbank.bio.net>
From: Database Server <GenBank-Retrieval-System@genbank.bio.net>
To: gilbertd@sunflower.bio.indiana.edu
Subject: Results of Query for drorna
Status: R

No matches on drorna.
- - - - - -
From GenBank-Retrieval-System@genbank.bio.net Sun Nov 10 17:28:49 1991
Received: from genbank.bio.net by sunflower.bio.indiana.edu
        (4.1/9.5jsm) id AA19323; Sun, 10 Nov 91 17:28:47 EST
Received: by genbank.bio.net (5.65/IG-2.0)
        id AA14461; Sun, 10 Nov 91 14:30:03 -0800
Date: Sun, 10 Nov 91 14:30:03 -0800
Message-Id: <9111102230.AA14461@genbank.bio.net>
From: Database Server <GenBank-Retrieval-System@genbank.bio.net>
To: gilbertd@sunflower.bio.indiana.edu
Subject: Results of Query for droest6
Status: R

LOCUS       DROEST6      1819 bp ss-mRNA            INV       31-AUG-1987
DEFINITION  D.melanogaster esterase-6 mRNA, complete cds.
ACCESSION   M15961












|||||||||||  GCG manual discussion of sequence symbols:
---------------------------------------------------

III_SEQUENCE_SYMBOLS


     GCG programs allow all upper and lower  case  letters,  periods  (.),
asterisks  (*),  pluses  (+),  ampersands  (&),  and ats (@) as symbols in
biological sequences.  Nucleotide  symbols,  their  complements,  and  the
standard  one-letter amino acid symbols are shown below in separate lists.
The meanings of the symbols +, &, and @ have not  been  assigned  at  this
writing (March, 1989).

     GCG uses the  letter  codes  for  amino  acid  codes  and  nucleotide
ambiguity    proposed    by    IUB    (Nomenclature    Committee,    1985,
Eur. J. Biochem. 150; 1-5).  These codes are  compatible  with  the  codes
used by the EMBL, GenBank, and NBRF data libraries.


                               NUCLEOTIDES

     The meaning of each symbol, its complement,  and  the  Cambridge  and
Stanford  equivalents  are  shown below.  Cambridge files can be converted
into GCG files and vice versa with the programs FROMSTADEN  and  TOSTADEN.
IntelliGenetics  sequence  files  can  be interconverted with the programs
FROMIG and TOIG.

IUB/GCG      Meaning     Complement   Staden/Sanger  Stanford

   A             A             T             A            A
   C             C             G             C            C
   G             G             C             G            G
  T/U            T             A             T           T/U
   M           A or C          K             5            J
   R           A or G          Y             R            R
   W           A or T          W             7            L
   S           C or G          S             8            M
   Y           C or T          R             Y            Y
   K           G or T          M             6            K
   V        A or C or G        B       not supported      N
   H        A or C or T        D       not supported      N
   D        A or G or T        H       not supported      N
   B        C or G or T        V       not supported      N
  X/N     G or A or T or C     X            -/X           N
   .    not G or A or T or C   .       not supported      ?


  The frame ambiguity codes used by Staden are not  supported  by  GCG
and   are  translated  by  FROMSTADEN  as  the  lower  case  single  base
equivalent.

     Staden Code          Meaning              GCG

         D                C or CC                c
         V                T or TT                t
         B                A or AA                a
         H                G or GG                g
         K                C or CX                c
         L                T or TX                t
         M                A or AX                a
         N                G or GX                g


                        AMINO ACIDS

  Here is a list of the standard one-letter amino acid codes and their
three-letter  equivalents.   The synonymous codons and their depiction in
the IUB codes are shown.  You should recognize that the codons  following
semicolons  (;)  are  not  sufficiently specific to define a single amino
acid even though they represent the best possible back  translation  into
the IUB codes!  All of the relationships in this list can be redefined by
the user in a local data file described below.

                                                      IUB
Symbol 3-letter  Meaning      Codons                Depiction
 A    Ala       Alanine      GCT,GCC,GCA,GCG         !GCX
 B    Asp,Asn   Aspartic,
                Asparagine   GAT,GAC,AAT,AAC         !RAY
 C    Cys       Cysteine     TGT,TGC                 !TGY
 D    Asp       Aspartic     GAT,GAC                 !GAY
 E    Glu       Glutamic     GAA,GAG                 !GAR
 F    Phe     Phenylalanine  TTT,TTC                 !TTY
 G    Gly       Glycine      GGT,GGC,GGA,GGG         !GGX
 H    His       Histidine    CAT,CAC                 !CAY
 I    Ile       Isoleucine   ATT,ATC,ATA             !ATH
 K    Lys       Lysine       AAA,AAG                 !AAR
 L    Leu       Leucine      TTG,TTA,CTT,CTC,CTA,CTG
!TTR,CTX,YTR;YTX
 M    Met       Methionine   ATG                     !ATG
 N    Asn       Asparagine   AAT,AAC                 !AAY
 P    Pro       Proline      CCT,CCC,CCA,CCG         !CCX
 Q    Gln       Glutamine    CAA,CAG                 !CAR
 R    Arg       Arginine     CGT,CGC,CGA,CGG,AGA,AGG
!CGX,AGR,MGR;MGX
 S    Ser       Serine       TCT,TCC,TCA,TCG,AGT,AGC !TCX,AGY;WSX
 T    Thr       Threonine    ACT,ACC,ACA,ACG         !ACX
 V    Val       Valine       GTT,GTC,GTA,GTG         !GTX
 W    Trp       Tryptophan   TGG                     !TGG
 X    Xxx       Unknown                              !XXX
 Y    Tyr       Tyrosine     TAT, TAC                !TAY
 Z    Glu,Gln   Glutamic,
                Glutamine    GAA,GAG,CAA,CAG         !SAR
 *    End       Terminator   TAA, TAG, TGA           !TAR,TRA;TRR








|||||||||||  docs from PSC on sequence formats:
---------------------------------------------------


          Nucleic Acid and Protein Sequence File Formats


It will probably save you some time if you have your data in a usable
format before you send it to us.  However, we do have the University of
Wisconsin Genetics Computing Group programs running on our VAXen and
this package includes several reformatting utilities.  Our programs
usually recognize any of several standard formats, including GenBank,
EMBL, NBRF, and MolGen/Stanford.  For the purposes of annotating an
analysis we find the GenBank and EMBL formats most useful, particularly
if you have already received an accession number from one of these
organizations for your sequence.

Our programs do not require that all of the line types available in
GenBank, EMBL, or NBRF file formats be present for the file format to
be recognized and processed.  The following pages outline the essential
details required for correct processing of files by our programs.
Additional information may be present but will generally be ignored.


                      GenBank File Format

File Header

1.  The first line in the file must have "GENETIC SEQUENCE DATA BANK"
    in spaces 20 through 46 (see LINE  1, below).
2.  The next 8 lines may contain arbitrary text.  They are ignored but
    are required to maintain the GenBank format (see LINE 2 - LINE 9).

Sequence Data Entries

3.  Each sequence entry in the file should have the following format.
    a) first line:   Must have LOCUS in the first 5 spaces.  The
                     genetic locus name or identifier must be in spaces
                     13 - 22.  The length of the sequences is right
                     justified in spaces 23 through 29 (see LINE  10).
    b) second line:  Must have DEFINITION in the first 10 spaces.
                     Spaces 13 - 80 are free form text to identify the
                     sequence (see LINE  11).
    c) third line:   Must have ACCESSION in the first 9 spaces.  Spaces
                     13 - 18 must hold the primary accession number
                     (see LINE  12).
    d) fourth line:  Must have ORIGIN in the first 6 spaces.  Nothing
                     else is required on this line, it indicates that
                     the nucleic acid sequence begins on the next line
                     (see LINE  13).
    e) fifth line:   Begins the nucleotide sequence.  The first 9
                     spaces of each sequence line may either be blank
                     or may contain the position in the sequence of the
                     first nucleotide on the line.  The next 66 spaces
                     hold the nucleotide sequence in six blocks of ten
                     nucleotides.  Each of the six blocks begins with a
                     blank space followed by ten nucleotides.  Thus the
                     first nucleotide is in space eleven of the line while
                     the last is in space 75 (see LINE  14, LINE  15).
    f) last line:    Must have // in the first 2 spaces to indicate
                     termination of the sequence (see LINE  16).

NOTE:  Multiple sequences may appear in each file.  To begin another
       sequence go back to a) and start again.


                         Example GenBank file


LINE  1  :                   GENETIC SEQUENCE DATA BANK
LINE  2  :
LINE  3  :
LINE  4  :
LINE  5  :
LINE  6  :
LINE  7  :
LINE  8  :
LINE  9  :
LINE 10  :LOCUS       L_Name     Length BP
LINE 11  :DEFINITION  Describe the sequence any way you want
LINE 12  :ACCESSION   Accession Number
LINE 13  :ORIGIN
LINE 14  :        1 acgtacgtac gtacgtacgt acgtacgtac gtacgtacgt a...
LINE 15  :       61 acgt...
LINE 16  ://



                         EMBL File Format

Unlike the GenBank file format the EMBL file format does not require
a series of header lines.  Thus the first line in the file begins
the first sequence entry of the file.

1.  The first line of each sequence entry contains the two letters ID
    in the first two spaces.  This is followed by the EMBL identifier
    in spaces 6 through 14.  (See LINE  1).

2.  The second line of each sequence entry has the two letters AC in
    the first two spaces.  This is followed by the accession number in
    spaces 6 through 11.  (See LINE  2).

3.  The third line of each sequence entry has the two letters DE in the
    first two spaces.  This is followed by a free form text definition
    in spaces 6 through 72.  (See LINE  3).

4.  The fourth line in each sequence entry has the two letters SQ in
    the first two spaces.  This is followed by the length of the
    sequence beginning at or after space 13.  After the sequence length
    there is a blank space and the two letters BP.  (See LINE  4).

5.  The nucleotide sequence begins on the fifth line of the sequence
    entry.  Each line of sequence begins with four blank spaces. The
    next 66 spaces hold the nucleotide sequence in six blocks of ten
    nucleotides.  Each of the six blocks begins with a blank space
    followed by ten nucleotides.  Thus the first nucleotide is in space
    6 of the line while the last is in space 70.  (See LINE  5 -
    LINE  6).

6.  The last line of each sequence entry in the file is a terminator
    line which has the two characters // in the first two spaces.
    (See LINE  7).

7.  Multiple sequences may appear in each file.  To begin another
    sequence go back to item 1 and start again.


                          Example EMBL file

LINE  1  :ID   ID_name
LINE  2  :AC   Accession number
LINE  3  :DE   Describe the sequence any way you want
LINE  4  :SQ          Length BP
LINE  5  :     ACGTACGTAC GTACGTACGT ACGTACGTAC GTACGTA...
LINE  6  :     ACGT...
LINE  7  ://



            NBRF (protein or nucleic acid) File Format

1.  The first line of each sequence entry begins with a greater than
  symbol, >.  This is immediately followed by the two character
  sequence type specifier.  Space four must contain a semi-colon.
  Beginning in space five is the sequence name or identification code
  for the NBRF database.  The code is from four to six letters and
  numbers.  (See LINE  1).

!!!! >> add these to readseq
          Specifier             Sequence type

              P1                protein, complete
              F1                protein, fragment
              DL                DNA, linear
              DC                DNA, circular
              RL                RNA, linear
              RC                RNA, circular
              N1                functional RNA, other than tRNA
              N3                tRNA

2.  The second line of each sequence entry contains two kinds of
  information.  First is the sequence name which is separated from
  the organism or organelle name by the three character sequence
  blank space, dash, blank space, " - ".  There is no special
  character marking the beginning of this line.  (See LINE  2).

3.  Either the amino acid or nucleic acid sequence begins on line three
  and can begin in any space, including the first.  The sequence is
  free format and may be interrupted by blanks for ease of reading.
  Protein sequences man contain special punctuation to indicate
  various indeterminacies in the sequence.  In the NBRF data files
  all lines may be up to 500 characters long.  However some PSC
  programs currently have a limit of 130 characters per line
  (including blanks), and BitNet will not accept lines of over eighty
  characters.  (See LINE  3, LINE  4, and LINE  5).

  The last character in the sequence must be an asterisks, *.

                       Example NBRF file

 LINE  1  :>P1;CBRT
 LINE  2  :Cytochrome b - Rat mitochondrion (SGC1)
 LINE  3  :M T N I R K S H P L F K I I N H S F I D L P A P S
 LINE  4  : VTHICRDVN Y GWL IRY
 LINE  5  :TWIGGQPVEHPFIIIGQLASISYFSIILILMPISGIVEDKMLKWN*



                MolGen/Stanford File Format

1.  The first line in a sequence file is a comment line.  This line
  begins with a semi-colon in the first space.  This line need
  not be present.  If it is present it holds descriptive text.
  There may be as many comment lines as desired at the first of
  sequence file.  (See LINE  1).

2.  The second line must be present and contains an identifier or
  name for the sequence in the first ten spaces.  (See LINE  2).

3.  The sequence begins on the third line and occupies up to eighty
  spaces.  Spaces may be included in the sequence for ease of
  reading.  The sequence continues for as many line as needed
  and is terminated with a 1 or 2.  1 indicates a linear sequence
  while 2 marks a circular sequence.  (See LINE  3 and LINE  4).

                          Example MolGen/Stanford file

LINE  1  :;  Describe the sequence any way you want
LINE  2  :ECTRNAGLY2
LINE  3  :ACGCACGTAC ACGTACGTAC   A C G T C C G T ACG TAC GTA CGT
LINE  4  :  GCTTA   GG G C T A1




|||||||||||  Phylip file format
---------------------------------------------------

        Phylip 3.3 File Format (DNA sequences)


     The input and output formats for PROTPARS and for RESTML are described  in
their  document  files.   In  general  their input formats are similar to those
described here, except that the one-letter codes for data are specific to those
programs  and  are  described in those document files.  Since the input formats
for the eight DNA sequence programs apply to  all  eight,  they  are  described
here.   Their  input  formats are standard: the data have A's, G's, C's and T's
(or U's).  The first line of the input file contains the number of species  and
the  number  of  sites.   As  with  the other programs, options information may
follow this.  In the case of DNAML, DNAMLK,  and  DNADIST  an  additional  line
(described  in  the  document file for these pograms) may follow the first one.
Following this, each species starts on a new line.  The first 10 characters  of
that  line  are the species name.  There then follows the base sequence of that
species, each character being one of the letters A, B, C, D, G, H, K, M, N,  O,
R, S, T, U, V, W, X, Y, ?, or - (a period was also previously allowed but it is
no longer allowed, because it sometimes is used to in aligned sequences to mean
"the  same  as  the  sequence  above").   Blanks  will  be ignored, and so will
numerical digits.  This allows GENBANK and EMBL sequence  entries  to  be  read
with minimum editing.

     These characters can be  either  upper  or  lower  case.   The  algorithms
convert  all  input  characters  to upper case (which is how they are treated).
The characters constitute the IUPAC (IUB) nucleic acid code  plus  some  slight
extensions.  They enable input of nucleic acid sequences taking full account of
any ambiguities in the sequence.

The sequences can continue over multiple lines; when this is done the sequences
must  be  either  in  "interleaved"  format, similar to the output of alignment
programs, or "sequential" format.  These are described  in  the  main  document
file.   In sequential format all of one sequence is given, possibly on multiple
lines, before the next starts.  In interleaved format the  first  part  of  the
file  should  contain  the first part of each of the sequences, then possibly a
line containing nothing but a carriage-return character, then the  second  part
of  each  sequence, and so on.  Only the first parts of the sequences should be
preceded by names.  Here is a hypothetical example of interleaved format:

  5    42
Turkey    AAGCTNGGGC ATTTCAGGGT
Salmo gairAAGCCTTGGC AGTGCAGGGT
H. SapiensACCGGTTGGC CGTTCAGGGT
Chimp     AAACCCTTGC CGTTACGCTT
Gorilla   AAACCCTTGC CGGTACGCTT

GAGCCCGGGC AATACAGGGT AT
GAGCCGTGGC CGGGCACGGT AT
ACAGGTTGGC CGTTCAGGGT AA
AAACCGAGGC CGGGACACTC AT
AAACCATTGC CGGTACGCTT AA

while in sequential format the same sequences would be:

  5    42
Turkey    AAGCTNGGGC ATTTCAGGGT
GAGCCCGGGC AATACAGGGT AT
Salmo gairAAGCCTTGGC AGTGCAGGGT
GAGCCGTGGC CGGGCACGGT AT
H. SapiensACCGGTTGGC CGTTCAGGGT
ACAGGTTGGC CGTTCAGGGT AA
Chimp     AAACCCTTGC CGTTACGCTT
AAACCGAGGC CGGGACACTC AT
Gorilla   AAACCCTTGC CGGTACGCTT
AAACCATTGC CGGTACGCTT AA


Note, of course, that a portion of a sequence like this:

   300   AAGCGTGAAC GTTGTACTAA TRCAG

is perfectly legal, assuming that the species name  has  gone  before,  and  is
filled  out  to  full  length  by  blanks.  The above digits and blanks will be
ignored, the sequence being taken as starting at the first base symbol (in this
case an A).

     The present versions of the programs may sometimes have difficulties  with
the  blank  lines  between  groups of lines, and if so you might want to retype
those lines, making sure that they have only a  carriage-return  and  no  blank
characters on them, or you may perhaps have to eliminate them.  The symptoms of
this problem are that the programs complain that the sequences are not properly
aligned, and you can find no other cause for this complaint.

------------------------------------------------


|||||||||||  ASN.1 file format
---------------------------------------------------


ASN.1 -- see NCBI toolkit docs, source and examples (ncbi.nlm.nih.gov)

Example asn.1 sequence file----

Bioseq-set ::= {
seq-set {
  seq {
    id { local id 1 } ,                 -- id essential
    descr {  title "Dummy sequence data from nowhere"  } ,  -- optional
    inst {                              -- inst essential
      repr raw ,
      mol dna ,
      length 156 ,
      topology linear ,
      seq-data
        iupacna "GAATTCATTTTTGAAACAAATCGACCTGACGACGGAATGGTACTCGAATTA
TGGGCCAAAGGGTTTTATGGGACAAATTAATAGGTGTTCATTATATGCCACTTTCGGAGATTAGATACAGCAATGCAG
TGGATTCAAAGCAATAGAGTTGTTCTT" 
      } } ,

        seq {
          id { local id 2 } ,
          descr {  title "Dummy sequence 2 data from somewhere else"  } ,
          inst {
                repr raw ,
                mol dna ,
                length 150 ,
                topology linear ,
                seq-data
                  iupacna "TTTTTTTTTTTTGAAACAAATCGACCTGACGACGGAATGGTACTCGAATTA
TGGGCCAAAGGGTTTTATGGGACAAATTAATAGGTGTTCATTATATGCCACTTTCGGAGATTAGATACAGCAATGCAG
TGGATTCAAAGCAATAGAGTT" 
            }
          }
        }
      }


partial ASN.1 description from toolkit

Bioseq ::= SEQUENCE {
    id SET OF Seq-id ,            -- equivalent identifiers
    descr Seq-descr OPTIONAL , -- descriptors
    inst Seq-inst ,            -- the sequence data
    annot SET OF Seq-annot OPTIONAL }

Seq-inst ::= SEQUENCE {            -- the sequence data itself
    repr ENUMERATED {              -- representation class
        not-set (0) ,              -- empty
        virtual (1) ,              -- no seq data
        raw (2) ,                  -- continuous sequence
        seg (3) ,                  -- segmented sequence
        const (4) ,                -- constructed sequence
        ref (5) ,                  -- reference to another sequence
        consen (6) ,               -- consensus sequence or pattern
        map (7) ,                  -- ordered map (genetic, restriction)
        other (255) } ,
    mol ENUMERATED {               -- molecule class in living organism
        not-set (0) ,              --   > cdna = rna
        dna (1) ,
        rna (2) ,
        aa (3) ,
        na (4) ,                   -- just a nucleic acid
        other (255) } ,
    length INTEGER OPTIONAL ,      -- length of sequence in residues
    fuzz Int-fuzz OPTIONAL ,       -- length uncertainty
    topology ENUMERATED {          -- topology of molecule
        not-set (0) ,
        linear (1) ,
        circular (2) ,
        tandem (3) ,               -- some part of tandem repeat
        other (255) } DEFAULT linear ,
    strand ENUMERATED {            -- strandedness in living organism
        not-set (0) ,
        ss (1) ,                   -- single strand
        ds (2) ,                   -- double strand
        mixed (3) ,
        other (255) } OPTIONAL ,   -- default ds for DNA, ss for RNA, pept
    seq-data Seq-data OPTIONAL ,   -- the sequence
    ext Seq-ext OPTIONAL ,         -- extensions for special types
  hist Seq-hist OPTIONAL }       -- sequence history

------------------------------------------------

|||||||||||  LinAll sequence file format
----------------------------------------

1234 seq1-id (1234 is sequence length, right justified)
abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzab
cdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyz
(80 characters per line, one sequence per file)

|||||||||||  Vienna RNA sequence file format
--------------------------------------------

> seq1-id
jsdhkasjlhsdlkjcbsd ... (one single line)
> seq2-id
odjgoirhggonavdskgj ... (one single line)