File: refactoring.txt

package info (click to toggle)
python-pysam 0.23.0%2Bds-1
  • links: PTS, VCS
  • area: main
  • in suites: forky, sid, trixie
  • size: 18,468 kB
  • sloc: ansic: 158,936; python: 8,604; sh: 338; makefile: 264; perl: 41
file content (198 lines) | stat: -rw-r--r-- 7,635 bytes parent folder | download | duplicates (2)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
;; This buffer is for notes you don't want to save, and for Lisp evaluation.
;; If you want to create a file, visit that file with C-x C-f,
;; then enter the text in that file's own buffer.

Samfile -> AlignmentFile
AlignedRead -> AlignedSegment
Tabixfile -> TabixFile
Fastafile -> FastaFile
Fastqfile -> FastqFile

Changes to the AlignedSegment.API:

Basic attributes
================

qname -> query_name
tid -> reference_id
pos -> reference_start
mapq -> mapping_quality
rnext -> next_reference_id
pnext -> next_reference_start
cigar = alignment (now returns CigarAlignment object)
cigarstring = cigarstring
tlen -> query_length
seq -> query_sequence
qual -> query_qualities, now returns array
tags = tags (now returns Tags object)


Derived simple attributes
=========================

alen -> reference_length, reference is always "alignment", so removed
aend -> reference_end
rlen -> query_length 
query -> query_alignment_sequence
qqual -> query_alignment_qualities, now returns array
qstart -> query_alignment_start
qend -> query_alignment_end
qlen -> query_alignment_length, kept, because can be computed without fetching the sequence
mrnm -> next_reference_id   
mpos -> next_reference_start
rname -> reference_id
isize -> query_length


Complex attributes - functions
===============================

blocks -> getBlocks()
aligned_pairs -> getAlignedPairs()
inferred_length -> inferQueryLength()
positions -> getReferencePositions()
overlap() -> getOverlap()


Backwards incompatible changes:
================================

1. Empty cigarstring now returns None (instead of '')

2. Empty cigar now returns None (instead of [])

3. When using the extension classes in cython modules, AlignedRead
   needs to be substituted with AlignedSegment. Automatic casting
   of the base class to the derived class seems not to work?

4. fancy_str() has been removed


=====================================================

Kevin's suggestions:


* smarter file opener
* CRAM support
* CSI index support (create and use)
* better attribute and property names
* remove deprecated names
* add object oriented CIGAR and tag handling
* fetch query that recruits mate pairs that overlap query region
* other commonly re-invented functionality

        qname -> template_name
        pos -> reference_start
        aend -> reference_stop
        alen -> reference_length
        tid -> ref_id
        qname -> template_name (not a high priority, but would be clearer)
        qstart -> query_start (really segment_align_start)
        qend -> query_stop (really segment_align_stop)
        qlen -> query_length (really segment_align_length)
        qqual -> query_qual (really segment_align_qual)
        rlen -> drop in favor of len(align.seq)
        inferred_length -> inferred_query_length
        is_* -> replace with flag object that is a subclass of int
        cigarstring -> str(align.cigar)
        tags -> opts object with a mapping-like interface
    Non-backward compatible:
        rname, mrnm, mpos, isize -> remove
        cigar -> CigarSequence object (or something similar)
        All coordinate and length queries return None when a value is not available (currently some return None, some 0)
        Store qualities as an array or bytes type


Marcel's suggestions:

    I recently sent in a pull request (which was merged) that improves the pysam doc a bit. While preparing that, I also wrote down some ways in which the API itself could be improved. I think the API is ok, but for someone like me who uses pysam not every day, some of the details are hard to remember and every time I do very basic things I end up looking them up again in the docs. I can recommend this article, written for C++, but many points still apply:
    http://qt-project.org/wiki/API_Design_Principles .

    Originally, I wanted to convert at least some of these suggestions to pull requests, but I'm not sure I have the time for that in the near future. So before I forget about this completely, I thought it's best to at least send this to the list. I'm concentrating on issues I found in Samfile and AlignedRead here.

    - The terminology is inconsistent - often two words are used
      to describe the same thing: opt vs. tag, query vs. read,
      reference vs. target. My suggestion is to consistently use
      the same terms as in the SAM spec: tag, query,
      reference. This applies both to documentation and
      function/property names.

    - In line with the document linked to above (see the
      section "The Art of Naming"): Do not abbreviate function
      and property names. For example, tlen -> template_length,
      pos -> position, mapq -> mapping_quality etc.

    - Be consistent with multiword function and variable names. I
      suggest to use the PEP8 convention. This isn't so visible
      to the user in Samfile and AlignedRead, it seems, but there
      are things like convertBinaryTagToList which could be
      renamed to convert_binary_tag_to_list.

    - Don't make functions that are not setters or getters a
      property. This applies to
      AlignedRead.positions/.inferred_length/.aligned_pairs/.blocks
      and currently also Samfile.references, for example. Making
      these a property implies to the user that access requires
      constant time. This is important in code like this:

    for i in range(n):
     do_something_with(read.positions[i])

    This looks inconspicuous, but is possibly inefficient since
    .positions actually builds up the result it returns each time
    it's accessed.

    - Update the examples in the docs to use context manager
      syntax.

    Particular to Samfile:

    - Deprecate Samfile.nreferences: propose to use
      len(samfile.references) instead. Cache Samfile.references
      so it's not re-created every time.

    - Move Samfile.mapped/.unmapped/.nocoordinate into a .stats
      attribute (Samfile.stats.mapped/.unmapped/.noocordinate).

    Particular to AlignedRead:

    - When read is an AlignedRead, print(read) should print out a
      properly formatted SAM line.

    - Force assignment to .qual/.seq to go through a function
      that sets both at the same time. This avoids the problem
      that they must be set in a particular order, which is easy
      to forget and only 'enforced' through documentation.

    - Deprecate rlen, use len(read.seq) instead.

    - Handling of tags is a little awkward. Would be cool if this
      worked:

    read.tags['XK'] = 'hello'  # add a new tag if it doesn't exist
    read.tags['AS'] += 17      # update the value of a tag
    del read.tags['AS']
    if 'AS' in read.tags: ...

    - Add a property AlignedRead.qualities that behaves like a
      Py3-bytes object in both Py2 and Py3. That is, accessing
      read.qualities[0] gives you an int value that represents
      the quality. The fact that qualities are encoded as
      ASCII(q+33) is an implementation detail that can be hidden.

      Done      

    - And finally: Add a Cigar class. I guess this is already
      what 'improved CIGAR string handling' refers to in the
      roadmap. AlignedRead.cigar would then return a Cigar object
      that can be printed, compared and concatenated to others,
      whose length can be measured etc. The .unclipped and
      .inferred length properties can also be moved here. Or
      perhaps: Make this an Alignment class that even knows about
      the two strings it is aligning. One could then also iterate
      over all columns of the alignment. But I guess this goes
      too far since that's what the AlignedRead itself should be.

    Regards,
    Marcel