Parser.txt ========== Design documentation for Biopython parsers. Design Overview --------------- Parsers are built around an event-oriented design that includes Scanner and Consumer objects. Scanners take input from a data source and analyze it line by line, sending off an event whenever it recognizes some information in the data. For example, if the data includes information about an organism name, the scanner may generate an "organism_name" event whenever it encounters a line containing the name. Consumers are objects that receive the events generated by Scanners. Following the previous example, the consumer receives the "organism_name" event, and the processes it in whatever manner necessary in the current application. Events ------ There are two types of events: info events that tag the location of information within a data stream, and section events that mark sections within a stream. Info events are associated with specific lines within the data, while section events are not. Section event names must be in the format start_EVENTNAME and end_EVENTNAME where EVENTNAME is the name of the event. For example, a FASTA-formatted sequence scanner may generate the following events: EVENT NAME ORIGINAL INPUT begin_sequence title >gi|132871|sp|P19947|RL30_BACSU 50S RIBOSOMAL PROTEIN L30 (BL27 sequence MAKLEITLKRSVIGRPEDQRVTVRTLGLKKTNQTVVHEDNAAIRGMINKVSHLVSVKEQ end_sequence begin_sequence title >gi|132679|sp|P19946|RL15_BACSU 50S RIBOSOMAL PROTEIN L15 sequence MKLHELKPSEGSRKTRNRVGRGIGSGNGKTAGKGHKGQNARSGGGVRPGFEGGQMPLFQRLPK sequence RKEYAVVNLDKLNGFAEGTEVTPELLLETGVISKLNAGVKILGNGKLEKKLTVKANKFSASAK sequence GTAEVI end_sequence [...] (I cut the lines shorter so they'd look nicer in my editor). The FASTA scanner generated the following events: 'title', 'sequence', 'begin_sequence', and 'end_sequence'. Note that the 'begin_sequence' and 'end_sequence' events are not associated with any line in the original input. They are used to delineate separate sequences within the file. The events a scanner can send must be specifically defined for each data format. 'noevent' EVENT ----------------- A data file can contain lines that have no meaningful information, such as blank lines. By convention, a scanner should generate the "noevent" event for these lines. Scanners -------- class Scanner: def feed(self, handle, consumer): # Implementation Scanners should implement a method named 'feed' that takes a file handle and a consumer. The scanner should read data from the file handle and generate appropriate events for the consumer. Consumers --------- class Consumer: # event handlers Consumers contain methods that handle events. The name of the method is the event that it handles. Info events are passed the line of the data containing the information, and section events are passed nothing. You are free to ignore events that are not interesting for your application. You should just not implement methods for those events. All consumers should be derived from the base Consumer class. An example: class FASTAConsumer(Consumer): def title(self, line): # do something with the title def sequence(self, line): # do something with the sequence def begin_sequence(self): # a new sequence starts def end_sequence(self): # a sequence ends