File: PKG-INFO

package info (click to toggle)
python-airr 1.5.1-1
  • links: PTS, VCS
  • area: main
  • in suites: forky, sid, trixie
  • size: 636 kB
  • sloc: python: 2,407; sh: 19; makefile: 10
file content (205 lines) | stat: -rw-r--r-- 8,800 bytes parent folder | download
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
Metadata-Version: 2.1
Name: airr
Version: 1.5.1
Summary: AIRR Community Data Representation Standard reference library for antibody and TCR sequencing data.
Home-page: http://docs.airr-community.org
Author: AIRR Community
Author-email: 
License: CC BY 4.0
Keywords: AIRR,bioinformatics,sequencing,immunoglobulin,antibody,adaptive immunity,T cell,B cell,BCR,TCR
Classifier: Intended Audience :: Science/Research
Classifier: Natural Language :: English
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Topic :: Scientific/Engineering :: Bio-Informatics

Installation
------------------------------------------------------------------------------

Install in the usual manner from PyPI::

    > pip3 install airr --user

Or from the `downloaded <https://github.com/airr-community/airr-standards>`__
source code directory::

    > python3 setup.py install --user


Quick Start
------------------------------------------------------------------------------

Deprecation Notice
^^^^^^^^^^^^^^^^^^^^

The ``load_repertoire``, ``write_repertoire``, and ``validate_repertoire`` functions
have been deprecated for the new generic ``load_airr_data``, ``write_airr_data``, and
``validate_airr_data`` functions. These new functions are backwards compatible with
the Repertoire metadata format but also support the new AIRR objects such as GermlineSet,
RepertoireGroup, GenotypeSet, Cell and Clone. This new format is defined by the DataFile
Schema, which describes a standard set of objects included in a file containing
AIRR Data Model presentations. Currently, the AIRR DataFile does not completely support
Rearrangement, so users should continue using AIRR TSV files and its specific functions.
Also, the ``repertoire_template`` function has been deprecated for the ``Schema.template``
method, which can now be called on any AIRR Schema to create a blank object.

Reading AIRR Data Files
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

The ``airr`` package contains functions to read and write AIRR Data
Model files. The file format is either YAML or JSON, and the package provides a
light wrapper over the standard parsers. The file needs a ``json``, ``yaml``, or ``yml``
file extension so that the proper parser is utilized. All of the AIRR objects
are loaded into memory at once and no streaming interface is provided::

    import airr

    # Load the AIRR data
    data = airr.read_airr('input.airr.json')
    # loop through the repertoires
    for rep in data['Repertoire']:
        print(rep)

Why are the AIRR objects, such as Repertoire, GermlineSet, and etc., in a list versus in a
dictionary keyed by their identifier (e.g., ``repertoire_id``)? There are two primary reasons for
this. First, the identifier might not have been assigned yet. Some systems might allow MiAIRR
metadata to be entered but the identifier is assigned to that data later by another process. Without
the identifier, the data could not be stored in a dictionary. Secondly, the list allows the data to
have a default ordering. If you know that the data has a unique identifier then you can quickly
create a dictionary object using a comprehension. For example, with repertoires::

    rep_dict = { obj['repertoire_id'] : obj for obj in data['Repertoire'] }

another example with germline sets::

    germline_dict = { obj['germline_set_id'] : obj for obj in data['GermlineSet'] }

Writing AIRR Data Files
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Writing an AIRR Data File is also a light wrapper over standard YAML or JSON
parsers. Multiple AIRR objects, such as Repertoire, GermlineSet, and etc., can be
written together into the same file. In this example, we use the ``airr`` library ``template``
method to create some blank Repertoire objects, and write them to a file.
As with the read function, the complete list of repertoires are written at once,
there is no streaming interface::

    import airr

    # Create some blank repertoire objects in a list
    data = { 'Repertoire': [] }
    for i in range(5):
        data['Repertoire'].append(airr.schema.RepertoireSchema.template())

    # Write the AIRR Data
    airr.write_airr('output.airr.json', data)

Reading AIRR Rearrangement TSV files
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

The ``airr`` package contains functions to read and write AIRR Rearrangement
TSV files as either iterables or pandas data frames. The usage is straightforward,
as the file format is a typical tab delimited file, but the package
performs some additional validation and type conversion beyond using a
standard CSV reader::

    import airr

    # Create an iteratable that returns a dictionary for each row
    reader = airr.read_rearrangement('input.tsv')
    for row in reader: print(row)

    # Load the entire file into a pandas data frame
    df = airr.load_rearrangement('input.tsv')

Writing AIRR Rearrangement TSV files
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Similar to the read operations, write functions are provided for either creating
a writer class to perform row-wise output or writing the entire contents of
a pandas data frame to a file. Again, usage is straightforward with the ``airr``
output functions simply performing some type conversion and field ordering
operations::

    import airr

    # Create a writer class for iterative row output
    writer = airr.create_rearrangement('output.tsv')
    for row in reader:  writer.write(row)

    # Write an entire pandas data frame to a file
    airr.dump_rearrangement(df, 'file.tsv')

By default, ``create_rearrangement`` will only write the ``required`` fields
in the output file. Additional fields can be included in the output file by
providing the ``fields`` parameter with an array of additional field names::

    # Specify additional fields in the output
    fields = ['new_calc', 'another_field']
    writer = airr.create_rearrangement('output.tsv', fields=fields)

A common operation is to read an AIRR rearrangement file, and then
write an AIRR rearrangement file with additional fields in it while
keeping all of the existing fields from the original file. The
``derive_rearrangement`` function provides this capability::

    import airr

    # Read rearrangement data and write new file with additional fields
    reader = airr.read_rearrangement('input.tsv')
    fields = ['new_calc']
    writer = airr.derive_rearrangement('output.tsv', 'input.tsv', fields=fields)
    for row in reader:
        row['new_calc'] = 'a value'
        writer.write(row)


Validating AIRR data files
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

The ``airr`` package can validate AIRR Data Model JSON/YAML files and Rearrangement
TSV files to ensure that they contain all required fields and that the fields types
match the AIRR Schema. This can be done using the ``airr-tools`` command
line program or the validate functions in the library can be called::

    # Validate a rearrangement TSV file
    airr-tools validate rearrangement -a input.tsv

    # Validate an AIRR DataFile
    airr-tools validate airr -a input.airr.json

Combining Repertoire metadata and Rearrangement files
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

The ``airr`` package does not currently keep track of which AIRR Data Model files
are associated with which Rearrangement TSV files, though there is ongoing work to define
a standardized manifest, so users will need to handle those
associations themselves. However, in the data, AIRR identifier fields, such as ``repertoire_id``,
form the link between objects in the AIRR Data Model.
The typical usage is that a program is going to perform some
computation on the Rearrangements, and it needs access to the Repertoire metadata
as part of the computation logic. This example code shows the basic framework
for doing that, in this case doing gender specific computation::

    import airr

    # Load AIRR data containing repertoires
    data = airr.read_airr('input.airr.json')

    # Put repertoires in dictionary keyed by repertoire_id
    rep_dict = { obj['repertoire_id'] : obj for obj in data['Repertoire'] }

    # Create an iteratable for rearrangement data
    reader = airr.read_rearrangement('input.tsv')
    for row in reader:
        # get repertoire metadata with this rearrangement
        rep = rep_dict[row['repertoire_id']]
        
        # check the gender
        if rep['subject']['sex'] == 'male':
            # do male specific computation
        elif rep['subject']['sex'] == 'female':
            # do female specific computation
        else:
            # do other specific computation