;; This buffer is for notes you don't want to save, and for Lisp evaluation. ;; If you want to create a file, visit that file with C-x C-f, ;; then enter the text in that file's own buffer. Samfile -> AlignmentFile AlignedRead -> AlignedSegment Tabixfile -> TabixFile Fastafile -> FastaFile Fastqfile -> FastqFile Changes to the AlignedSegment.API: Basic attributes ================ qname -> query_name tid -> reference_id pos -> reference_start mapq -> mapping_quality rnext -> next_reference_id pnext -> next_reference_start cigar = alignment (now returns CigarAlignment object) cigarstring = cigarstring tlen -> query_length seq -> query_sequence qual -> query_qualities, now returns array tags = tags (now returns Tags object) Derived simple attributes ========================= alen -> reference_length, reference is always "alignment", so removed aend -> reference_end rlen -> query_length query -> query_alignment_sequence qqual -> query_alignment_qualities, now returns array qstart -> query_alignment_start qend -> query_alignment_end qlen -> query_alignment_length, kept, because can be computed without fetching the sequence mrnm -> next_reference_id mpos -> next_reference_start rname -> reference_id isize -> query_length Complex attributes - functions =============================== blocks -> getBlocks() aligned_pairs -> getAlignedPairs() inferred_length -> inferQueryLength() positions -> getReferencePositions() overlap() -> getOverlap() Backwards incompatible changes: ================================ 1. Empty cigarstring now returns None (instead of '') 2. Empty cigar now returns None (instead of []) 3. When using the extension classes in cython modules, AlignedRead needs to be substituted with AlignedSegment. Automatic casting of the base class to the derived class seems not to work? 4. fancy_str() has been removed ===================================================== Kevin's suggestions: * smarter file opener * CRAM support * CSI index support (create and use) * better attribute and property names * remove deprecated names * add object oriented CIGAR and tag handling * fetch query that recruits mate pairs that overlap query region * other commonly re-invented functionality qname -> template_name pos -> reference_start aend -> reference_stop alen -> reference_length tid -> ref_id qname -> template_name (not a high priority, but would be clearer) qstart -> query_start (really segment_align_start) qend -> query_stop (really segment_align_stop) qlen -> query_length (really segment_align_length) qqual -> query_qual (really segment_align_qual) rlen -> drop in favor of len(align.seq) inferred_length -> inferred_query_length is_* -> replace with flag object that is a subclass of int cigarstring -> str(align.cigar) tags -> opts object with a mapping-like interface Non-backward compatible: rname, mrnm, mpos, isize -> remove cigar -> CigarSequence object (or something similar) All coordinate and length queries return None when a value is not available (currently some return None, some 0) Store qualities as an array or bytes type Marcel's suggestions: I recently sent in a pull request (which was merged) that improves the pysam doc a bit. While preparing that, I also wrote down some ways in which the API itself could be improved. I think the API is ok, but for someone like me who uses pysam not every day, some of the details are hard to remember and every time I do very basic things I end up looking them up again in the docs. I can recommend this article, written for C++, but many points still apply: http://qt-project.org/wiki/API_Design_Principles . Originally, I wanted to convert at least some of these suggestions to pull requests, but I'm not sure I have the time for that in the near future. So before I forget about this completely, I thought it's best to at least send this to the list. I'm concentrating on issues I found in Samfile and AlignedRead here. - The terminology is inconsistent - often two words are used to describe the same thing: opt vs. tag, query vs. read, reference vs. target. My suggestion is to consistently use the same terms as in the SAM spec: tag, query, reference. This applies both to documentation and function/property names. - In line with the document linked to above (see the section "The Art of Naming"): Do not abbreviate function and property names. For example, tlen -> template_length, pos -> position, mapq -> mapping_quality etc. - Be consistent with multiword function and variable names. I suggest to use the PEP8 convention. This isn't so visible to the user in Samfile and AlignedRead, it seems, but there are things like convertBinaryTagToList which could be renamed to convert_binary_tag_to_list. - Don't make functions that are not setters or getters a property. This applies to AlignedRead.positions/.inferred_length/.aligned_pairs/.blocks and currently also Samfile.references, for example. Making these a property implies to the user that access requires constant time. This is important in code like this: for i in range(n): do_something_with(read.positions[i]) This looks inconspicuous, but is possibly inefficient since .positions actually builds up the result it returns each time it's accessed. - Update the examples in the docs to use context manager syntax. Particular to Samfile: - Deprecate Samfile.nreferences: propose to use len(samfile.references) instead. Cache Samfile.references so it's not re-created every time. - Move Samfile.mapped/.unmapped/.nocoordinate into a .stats attribute (Samfile.stats.mapped/.unmapped/.noocordinate). Particular to AlignedRead: - When read is an AlignedRead, print(read) should print out a properly formatted SAM line. - Force assignment to .qual/.seq to go through a function that sets both at the same time. This avoids the problem that they must be set in a particular order, which is easy to forget and only 'enforced' through documentation. - Deprecate rlen, use len(read.seq) instead. - Handling of tags is a little awkward. Would be cool if this worked: read.tags['XK'] = 'hello' # add a new tag if it doesn't exist read.tags['AS'] += 17 # update the value of a tag del read.tags['AS'] if 'AS' in read.tags: ... - Add a property AlignedRead.qualities that behaves like a Py3-bytes object in both Py2 and Py3. That is, accessing read.qualities[0] gives you an int value that represents the quality. The fact that qualities are encoded as ASCII(q+33) is an implementation detail that can be hidden. Done - And finally: Add a Cigar class. I guess this is already what 'improved CIGAR string handling' refers to in the roadmap. AlignedRead.cigar would then return a Cigar object that can be printed, compared and concatenated to others, whose length can be measured etc. The .unclipped and .inferred length properties can also be moved here. Or perhaps: Make this an Alignment class that even knows about the two strings it is aligning. One could then also iterate over all columns of the alignment. But I guess this goes too far since that's what the AlignedRead itself should be. Regards, Marcel