1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70
|
#!/usr/bin/env python
"""Read GenBank flat files.
Currently only reads sequence data and not annotations.
"""
from typing import Iterator, Optional, TextIO
from ..seq import Alphabet, Seq, SeqList
from ..utils import isblank
names = ("genbank",)
extensions = ("gb", "genbank", "gbk")
def read(fin: TextIO, alphabet: Optional[Alphabet] = None) -> SeqList:
"""Read and parse a file of genbank records.
Args:
fin -- A stream or file to read
alphabet -- The expected alphabet of the data, if given
Returns:
SeqList -- A list of sequences
Raises:
ValueError -- If the file is unparsable
"""
seqs = [s for s in iterseq(fin, alphabet)]
return SeqList(seqs)
def iterseq(fin: TextIO, alphabet: Optional[Alphabet] = None) -> Iterator[Seq]:
"""Iterate over genbank records
Args:
fin -- A stream or file to read
alphabet -- The expected alphabet of the data, if given
Yields:
Seq -- One alphabetic sequence at a time.
Raises:
ValueError -- If the file is unparsable
"""
alphabet = Alphabet(alphabet)
header, block, data = range(3)
state = header
seq: list = []
for L, line in enumerate(fin):
if isblank(line):
continue
if state == header:
if not line.startswith("LOCUS"):
raise ValueError("Cannot find start of record at line %d" % L)
state = block
elif state == block:
if line.startswith("ORIGIN") or line.startswith("//"):
state = data
elif state == data:
if line.startswith("//"):
yield Seq("".join(seq), alphabet)
seq = []
state = block
else:
seq.extend(line.split()[1:])
|