File: basics.rst

package info (click to toggle)
libcifpp 9.0.5-2
  • links: PTS, VCS
  • area: main
  • in suites: forky, sid
  • size: 15,528 kB
  • sloc: cpp: 33,979; sh: 105; makefile: 12
file content (400 lines) | stat: -rw-r--r-- 14,011 bytes parent folder | download
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
Basic usage
===========

This library, *libcifpp*, is a generic *CIF* library with some specific additions to work with *mmCIF* files. The main focus of this library is to make sure that files read or written are valid. That is, they are syntactically valid *and* their content is valid with respect to a CIF dictionary, if such a dictionary is available and specified.

Reading a file is as simple as:

.. code-block:: cpp

    #include <cif++.hpp>

    cif::file f("/path/to/file.cif");

The file may also be compressed using *gzip* which is detected automatically.

Writing out the file again is also simple, to write out the terminal you can do:

.. code-block:: cpp

    std::cout << f;

    // or
    f.save(std::cout);

    // or write a compressed file using gzip compression:
    f.save("/tmp/f.cif.gz");

CIF files contain one or more datablocks. To print out the names of all datablocks in our file:

.. code-block:: cpp

    for (auto &db : f)
        std::cout << db.name() << '\n';

Most often *libcifpp* is used to read in structure files in mmCIF format. These files only contain one datablock and so you can safely use code like this:

.. code-block:: cpp

    // get a reference to the first datablock in f
    auto &db = f.front();

But if you know the name of the datablock, this also works:

.. code-block:: cpp

    // get a reference to the datablock name '1CBS'
    auto &db = f["1CBS"];

Now, each datablock contains categories. To print out all their names:

.. code-block:: cpp

    for (auto &cat : db)
        std::cout << cat.name() << '\n';

But you probably know what category you need to use, so lets fetch it by name:

.. _atom_site-label:
.. code-block:: cpp

    // get a reference to the atom_site category in db
    auto &atom_site = db["atom_site"];

    // and make sure there's some data in it:
    assert(not atom_site.empty());

.. note::
    
    Note that we omit the leading underscore in the name of the category here.

Categories contain rows of data and each row has fields or items. Referencing a row in a category results in a :cpp:class:`cif::row_handle` object which you can use to request or manipulate item data.

.. code-block:: cpp

    // Get the first row in atom_site
    auto rh = atom_site.front();

    // Get the label_atom_id value from this row handle as a std::string
    std::string atom_id = rh["label_atom_id"].as<std::string>();

    // Get the x, y and z coordinates using structered binding
    const auto &[x, y, z] = rh.get<float,float,float>("Cartn_x", "Cartn_y", "Cartn_z");

    // Assign a new value to the x coordinate or our atom
    rh["Cartn_x"] = x + 1;

Querying
--------

Walking over the rows in a category is often not very useful. More often you are interested in specific rows in a category. The function :cpp:func:`cif::category::find` and friends are here to help.

What these functions have in common is that they return data based on a query implemented by :cpp:class:`cif::condition`. These condition objects are built in code using regular C++ syntax. The most basic example of a query is:

.. code-block:: cpp

    cif::condition c = cif::key("id") == 1;

Here the condition is that all rows returned should have a value of 1 in there item named *id*. Likewise you can use other data types and even combine those. Oh, and I said we use regular C++ syntax for conditions, so you may as well use other operators to compare values:

.. code-block:: cpp

    // condition for C-alpha atoms having an occupancy less than 1.0
    cif::condition c = cif::key("occupancy") < 1.0f and cif::key("label_atom_id") == "CA";

Using the namespace *cif::literals* that code becomes a little less verbose:

.. code-block:: cpp

    using namespace cif::literals;
    cif::condition c = "occupancy"_key < 1.0f and "label_atom_id"_key == "CA";

Conditions can also be combined:

.. code-block:: cpp

    cif::condition c = "occupancy"_key < 1.0f and "label_atom_id"_key == "CA";

    // extend the condition by requiring the compound ID to be unequal to PRO
    c = std::move(c) and "label_comp_id"_key != "PRO";

.. note::

    Note the use of std::move here. 

Using queries constructed in this way is simple:

.. code-block:: cpp

    cif::condition c = ...
    auto result = atom_site.find(std::move(c));

    // or construct a condition inline:
    auto result = atom_site.find("label_atom_id"_key == "CA");

In the example above the result is a range of :cpp:class:`cif::row_handle` objects. Often, using individual field values is more useful:

.. code-block:: cpp

    // Requesting a single item:
    for (auto id : atom_site.find<std::string>("label_atom_id"_key == "CA", "id"))
        std::cout << "ID for CA: " << id << '\n';

    // Requesting multiple items:
    for (const auto &[id, x, y, z] : atom_site.find<std::string,float,float,float>("label_atom_id"_key == "CA",
            "id", "Cartn_x", "Cartn_y", "Cartn_z"))
    {
        std::cout << "Atom " << id << " is at [" << x << ", " << y << ", " z << "]\n";
    }

Returning a complete set if often not required, if you only want to have the first you can use :cpp:func:`cif::category::find_first` as shown here:

.. code-block:: cpp

    // return the ID item for the first C-alpha atom
    std::string v1 = atom_site.find_first<std::string>("label_atom_id"_key == "CA", "id");

    // If you're not sure the row exists, use std::optional
    auto v2 = atom_site.find_first<std::optional<std::string>>("label_atom_id"_key == "CA", "id");
    if (v2.has_value())
        ...

There are cases when you really need exactly one result. The :cpp:func:`cif::category::find1` can be used in that case, it will throw an exception if the query does not result in exactly one row.

NULL and ANY
------------

Sometimes items may be empty. The trouble is a bit that empty comes in two flavors: unknown and null. Null in *CIF* parlance means the item should not contain a value since it makes no sense in this case, the value stored in the file is a single dot character: ``'.'``. E.g. *atom_site* records may have a NULL value for label_seq_id for atoms that are part of a *non-polymer*.

The other empty value is indicated by a question mark character: ``'?'``. This means the value is simply unknown.

Both these are NULL in *libcifpp* conditions and can be searched for using :cpp:var:`cif::null`.

So you can search for:

.. code-block:: cpp

    cif::condition c = "label_seq_id"_key == cif::null;

You might also want to look for a certain value and don't care in which item it is stored, in that case you can use :cpp:var:`cif::any`.

.. code-block:: cpp

    cif::condition c = cif::any == "foo";

And in linked record you might have the items that have a value in both parent and child or both should be NULL. For that, you can request the value to return by find to be of type std::optional and then use that value to build the query. An example to explain this, let's find the location of the atom that is referenced as the first atom in a struct_conn record:

.. code-block:: cpp

    // Take references to the two categories we need
    auto struct_conn = db["struct_conn"];
    auto atom_site = db["atom_site"];

    // Loop over all rows in struct_conn taking only the values we need
    // Note that the label_seq_id is returned as a std::optional<int>
    // That means it may contain an integer or may be empty
    for (const auto &[asym1, seqid1, authseqid1, atomid1] :
        struct_conn.rows<std::string,std::optional<int>,std::string,std::string,std::string>(
            "ptnr1_label_asym_id", "ptnr1_label_seq_id", "ptnr1_auth_seq_id", "ptnr1_label_atom_id"
        ))
    {
        // Find the location of the first atom
        cif::point p1 = atom_site.find1<float,float,float>(
            "label_asym_id"_key == asym1 and "label_seq_id"_key == seqid1 and "auth_seq_id"_key == authseqid1 and "label_atom_id"_key == atomid1,
            "cartn_x", "cartn_y", "cartn_z");
    }
    

Validation
----------

CIF files can have a dictionary attached. And based on such a dictionary a :cpp:class:`cif::validator` object can be constructed which in turn can be used to validate the content of the file.

A simple case:

.. code-block:: cpp

    #include <cif++.hpp>

    cif::file f("1cbs.cif.gz");
    f.load_dictionary("mmcif_pdbx.dic");

    if (not f.is_valid())
        std::cout << "This file is not valid\n";

If you want to know why it is not valid, you should set the global variable :cpp:var:`cif::VERBOSE` to something higer than zero. Depending on the value more or less diagnostic output is sent to std::cerr.

In the case above we load a dictionary based on its name. You can of course also load dictionaries based on a specific file, that's a bit more work:

.. code-block:: cpp

    std::filesystem::ifstream dictFile("/tmp/my-dictionary.dic");
    auto &validator = cif::parse_dictionary("my-dictionary", dictFile);

    cif::file f("1cbs.cif.gz");

    // assign the validator
    f.set_validator(&validator);

    // alternatively, load it by name
    f.load_dictionary("my-dictionary");

    if (not f.is_valid())
        std::cout << "This file is not valid\n";

Creating your own dictionary is a lot of work, especially if you are only extending an existing dictionary with a couple of new categories or items. So, what you can do is extend a loaded validator like this (code taken from DSSP):

.. code-block:: cpp

    // db is a cif::datablock reference containing an mmCIF file with DSSP annotations
    auto &validator = const_cast<cif::validator &>(*db.get_validator());
    if (validator.get_validator_for_category("dssp_struct_summary") == nullptr)
    {
        auto dssp_extension = cif::load_resource("dssp-extension.dic");
        if (dssp_extension)
            cif::extend_dictionary(validator, *dssp_extension);
    }

.. note::

    In the example above we're loading the data using :doc:`/resources`. See the documentation on that for more information.

If a validator has been assigned to a file, assignments to items are checked for valid data. So the following code will throw an exception (see: :ref:`_atom_site-label`):

.. code-block:: cpp
    
    auto rh = atom_site.front();
    rh["Cartn_x"] = "foo";

Linking
-------

Based on information recorded in dictionary files (see :ref:`Validation`) you can locate linked records in parent or child categories.

To make this example not too complex, lets assume the following example file:

.. code-block:: cif

    data_test
    loop_
    _cat_1.id
    _cat_1.name
    _cat_1.desc
    1 aap  Aap
    2 noot Noot
    3 mies Mies

    loop_
    _cat_2.id
    _cat_2.name
    _cat_2.num
    _cat_2.desc
    1 aap  1 'Een dier'
    2 aap  2 'Een andere aap'
    3 noot 1 'walnoot bijvoorbeeld'

And we have a dictionary containing the following link definition:

.. code-block:: cif

    loop_
    _pdbx_item_linked_group_list.parent_category_id
    _pdbx_item_linked_group_list.link_group_id
    _pdbx_item_linked_group_list.parent_name
    _pdbx_item_linked_group_list.child_name
    _pdbx_item_linked_group_list.child_category_id
    cat_1 1 '_cat_1.name' '_cat_2.name' cat_2

So, there are links between *cat_1* and *cat_2* based on the value in items named *name*. Using this information, we can now locate children and parents:

.. code-block:: cpp

    // Assuming the file was loaded in f:
    auto &cat1 = f.front()["cat_1"];
    auto &cat2 = f.front()["cat_2"];
    auto &cat3 = f.front()["cat_3"];

    // Loop over all ape's in cat2
    for (auto r : cat1.get_children(cat1.find1("name"_key == "aap"), cat2))
        std::cout << r.get<std::string>("desc") << '\n';

Updating a value in an item in a parent category will update the corresponding value in all related children:

.. code-block:: cpp

    auto r1 = cat1.find1("id"_key == 1);
    r1["name"] = "aapje";

    auto rs1 = cat2.find("name"_key == "aapje");
    assert(rs1.size() == 2);

However, changing a value in a child record will not update the parent. This may result in an invalid file since you may then have a child that has no parent:

.. code-block:: cpp

    auto r2 = cat2.find1("id"_key == 3);
    r2["name"] = "wim";

    assert(f.is_valid() == false);

So you have to fix this yourself by inserting a new item in cat1 with the new value.

.. _splitting-rows:
Another situation is when you change a value in a parent and updating children might introduce a situation where you need to split a child. To give an example, consider this:

.. code-block:: cif

    data_test
    loop_
    _cat_1.id
    _cat_1.name
    _cat_1.desc
    1 aap  Aap
    2 noot Noot
    3 mies Mies

    loop_
    _cat_2.id
    _cat_2.name
    _cat_2.num
    _cat_2.desc
    1 aap  1 'Een dier'
    2 aap  2 'Een andere aap'
    3 noot 1 'walnoot bijvoorbeeld'

    loop_
    _cat_3.id
    _cat_3.name
    _cat_3.num
    1 aap 1
    2 aap 2

And we have a dictionary containing the following link definition (reversed compared to the previous example):

.. code-block:: cif

    loop_
    _pdbx_item_linked_group_list.parent_category_id
    _pdbx_item_linked_group_list.link_group_id
    _pdbx_item_linked_group_list.parent_name
    _pdbx_item_linked_group_list.child_name
    _pdbx_item_linked_group_list.child_category_id
    cat_2 1 '_cat_2.name' '_cat_1.name' cat_1
    cat_3 1 '_cat_3.name' '_cat_2.name' cat_2
    cat_3 1 '_cat_3.num'  '_cat_2.num'  cat_2

So *cat3* is a parent of *cat2* and *cat2* is a parent of *cat1*. Now, if you change the *name* value of the first row of *cat3* to 'aapje', the corresponding row in *cat2* is updated as well. But when you update *cat2* you have to update *cat1* too. And simply changing the name field in row 1 of *cat1* is wrong. The default behaviour in libcifpp is to split the record in *cat1* and have a new child with the new name whereas the other remains as is.

The new *cat1* will thus be like:

.. code-block:: cif

    loop_
    _cat_1.id
    _cat_1.name
    _cat_1.desc
    1 aapje Aap
    2 noot  Noot
    3 mies  Mies
    5 aap   Aap