1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400
|
Basic usage
===========
This library, *libcifpp*, is a generic *CIF* library with some specific additions to work with *mmCIF* files. The main focus of this library is to make sure that files read or written are valid. That is, they are syntactically valid *and* their content is valid with respect to a CIF dictionary, if such a dictionary is available and specified.
Reading a file is as simple as:
.. code-block:: cpp
#include <cif++.hpp>
cif::file f("/path/to/file.cif");
The file may also be compressed using *gzip* which is detected automatically.
Writing out the file again is also simple, to write out the terminal you can do:
.. code-block:: cpp
std::cout << f;
// or
f.save(std::cout);
// or write a compressed file using gzip compression:
f.save("/tmp/f.cif.gz");
CIF files contain one or more datablocks. To print out the names of all datablocks in our file:
.. code-block:: cpp
for (auto &db : f)
std::cout << db.name() << '\n';
Most often *libcifpp* is used to read in structure files in mmCIF format. These files only contain one datablock and so you can safely use code like this:
.. code-block:: cpp
// get a reference to the first datablock in f
auto &db = f.front();
But if you know the name of the datablock, this also works:
.. code-block:: cpp
// get a reference to the datablock name '1CBS'
auto &db = f["1CBS"];
Now, each datablock contains categories. To print out all their names:
.. code-block:: cpp
for (auto &cat : db)
std::cout << cat.name() << '\n';
But you probably know what category you need to use, so lets fetch it by name:
.. _atom_site-label:
.. code-block:: cpp
// get a reference to the atom_site category in db
auto &atom_site = db["atom_site"];
// and make sure there's some data in it:
assert(not atom_site.empty());
.. note::
Note that we omit the leading underscore in the name of the category here.
Categories contain rows of data and each row has fields or items. Referencing a row in a category results in a :cpp:class:`cif::row_handle` object which you can use to request or manipulate item data.
.. code-block:: cpp
// Get the first row in atom_site
auto rh = atom_site.front();
// Get the label_atom_id value from this row handle as a std::string
std::string atom_id = rh["label_atom_id"].as<std::string>();
// Get the x, y and z coordinates using structered binding
const auto &[x, y, z] = rh.get<float,float,float>("Cartn_x", "Cartn_y", "Cartn_z");
// Assign a new value to the x coordinate or our atom
rh["Cartn_x"] = x + 1;
Querying
--------
Walking over the rows in a category is often not very useful. More often you are interested in specific rows in a category. The function :cpp:func:`cif::category::find` and friends are here to help.
What these functions have in common is that they return data based on a query implemented by :cpp:class:`cif::condition`. These condition objects are built in code using regular C++ syntax. The most basic example of a query is:
.. code-block:: cpp
cif::condition c = cif::key("id") == 1;
Here the condition is that all rows returned should have a value of 1 in there item named *id*. Likewise you can use other data types and even combine those. Oh, and I said we use regular C++ syntax for conditions, so you may as well use other operators to compare values:
.. code-block:: cpp
// condition for C-alpha atoms having an occupancy less than 1.0
cif::condition c = cif::key("occupancy") < 1.0f and cif::key("label_atom_id") == "CA";
Using the namespace *cif::literals* that code becomes a little less verbose:
.. code-block:: cpp
using namespace cif::literals;
cif::condition c = "occupancy"_key < 1.0f and "label_atom_id"_key == "CA";
Conditions can also be combined:
.. code-block:: cpp
cif::condition c = "occupancy"_key < 1.0f and "label_atom_id"_key == "CA";
// extend the condition by requiring the compound ID to be unequal to PRO
c = std::move(c) and "label_comp_id"_key != "PRO";
.. note::
Note the use of std::move here.
Using queries constructed in this way is simple:
.. code-block:: cpp
cif::condition c = ...
auto result = atom_site.find(std::move(c));
// or construct a condition inline:
auto result = atom_site.find("label_atom_id"_key == "CA");
In the example above the result is a range of :cpp:class:`cif::row_handle` objects. Often, using individual field values is more useful:
.. code-block:: cpp
// Requesting a single item:
for (auto id : atom_site.find<std::string>("label_atom_id"_key == "CA", "id"))
std::cout << "ID for CA: " << id << '\n';
// Requesting multiple items:
for (const auto &[id, x, y, z] : atom_site.find<std::string,float,float,float>("label_atom_id"_key == "CA",
"id", "Cartn_x", "Cartn_y", "Cartn_z"))
{
std::cout << "Atom " << id << " is at [" << x << ", " << y << ", " z << "]\n";
}
Returning a complete set if often not required, if you only want to have the first you can use :cpp:func:`cif::category::find_first` as shown here:
.. code-block:: cpp
// return the ID item for the first C-alpha atom
std::string v1 = atom_site.find_first<std::string>("label_atom_id"_key == "CA", "id");
// If you're not sure the row exists, use std::optional
auto v2 = atom_site.find_first<std::optional<std::string>>("label_atom_id"_key == "CA", "id");
if (v2.has_value())
...
There are cases when you really need exactly one result. The :cpp:func:`cif::category::find1` can be used in that case, it will throw an exception if the query does not result in exactly one row.
NULL and ANY
------------
Sometimes items may be empty. The trouble is a bit that empty comes in two flavors: unknown and null. Null in *CIF* parlance means the item should not contain a value since it makes no sense in this case, the value stored in the file is a single dot character: ``'.'``. E.g. *atom_site* records may have a NULL value for label_seq_id for atoms that are part of a *non-polymer*.
The other empty value is indicated by a question mark character: ``'?'``. This means the value is simply unknown.
Both these are NULL in *libcifpp* conditions and can be searched for using :cpp:var:`cif::null`.
So you can search for:
.. code-block:: cpp
cif::condition c = "label_seq_id"_key == cif::null;
You might also want to look for a certain value and don't care in which item it is stored, in that case you can use :cpp:var:`cif::any`.
.. code-block:: cpp
cif::condition c = cif::any == "foo";
And in linked record you might have the items that have a value in both parent and child or both should be NULL. For that, you can request the value to return by find to be of type std::optional and then use that value to build the query. An example to explain this, let's find the location of the atom that is referenced as the first atom in a struct_conn record:
.. code-block:: cpp
// Take references to the two categories we need
auto struct_conn = db["struct_conn"];
auto atom_site = db["atom_site"];
// Loop over all rows in struct_conn taking only the values we need
// Note that the label_seq_id is returned as a std::optional<int>
// That means it may contain an integer or may be empty
for (const auto &[asym1, seqid1, authseqid1, atomid1] :
struct_conn.rows<std::string,std::optional<int>,std::string,std::string,std::string>(
"ptnr1_label_asym_id", "ptnr1_label_seq_id", "ptnr1_auth_seq_id", "ptnr1_label_atom_id"
))
{
// Find the location of the first atom
cif::point p1 = atom_site.find1<float,float,float>(
"label_asym_id"_key == asym1 and "label_seq_id"_key == seqid1 and "auth_seq_id"_key == authseqid1 and "label_atom_id"_key == atomid1,
"cartn_x", "cartn_y", "cartn_z");
}
Validation
----------
CIF files can have a dictionary attached. And based on such a dictionary a :cpp:class:`cif::validator` object can be constructed which in turn can be used to validate the content of the file.
A simple case:
.. code-block:: cpp
#include <cif++.hpp>
cif::file f("1cbs.cif.gz");
f.load_dictionary("mmcif_pdbx.dic");
if (not f.is_valid())
std::cout << "This file is not valid\n";
If you want to know why it is not valid, you should set the global variable :cpp:var:`cif::VERBOSE` to something higer than zero. Depending on the value more or less diagnostic output is sent to std::cerr.
In the case above we load a dictionary based on its name. You can of course also load dictionaries based on a specific file, that's a bit more work:
.. code-block:: cpp
std::filesystem::ifstream dictFile("/tmp/my-dictionary.dic");
auto &validator = cif::parse_dictionary("my-dictionary", dictFile);
cif::file f("1cbs.cif.gz");
// assign the validator
f.set_validator(&validator);
// alternatively, load it by name
f.load_dictionary("my-dictionary");
if (not f.is_valid())
std::cout << "This file is not valid\n";
Creating your own dictionary is a lot of work, especially if you are only extending an existing dictionary with a couple of new categories or items. So, what you can do is extend a loaded validator like this (code taken from DSSP):
.. code-block:: cpp
// db is a cif::datablock reference containing an mmCIF file with DSSP annotations
auto &validator = const_cast<cif::validator &>(*db.get_validator());
if (validator.get_validator_for_category("dssp_struct_summary") == nullptr)
{
auto dssp_extension = cif::load_resource("dssp-extension.dic");
if (dssp_extension)
cif::extend_dictionary(validator, *dssp_extension);
}
.. note::
In the example above we're loading the data using :doc:`/resources`. See the documentation on that for more information.
If a validator has been assigned to a file, assignments to items are checked for valid data. So the following code will throw an exception (see: :ref:`_atom_site-label`):
.. code-block:: cpp
auto rh = atom_site.front();
rh["Cartn_x"] = "foo";
Linking
-------
Based on information recorded in dictionary files (see :ref:`Validation`) you can locate linked records in parent or child categories.
To make this example not too complex, lets assume the following example file:
.. code-block:: cif
data_test
loop_
_cat_1.id
_cat_1.name
_cat_1.desc
1 aap Aap
2 noot Noot
3 mies Mies
loop_
_cat_2.id
_cat_2.name
_cat_2.num
_cat_2.desc
1 aap 1 'Een dier'
2 aap 2 'Een andere aap'
3 noot 1 'walnoot bijvoorbeeld'
And we have a dictionary containing the following link definition:
.. code-block:: cif
loop_
_pdbx_item_linked_group_list.parent_category_id
_pdbx_item_linked_group_list.link_group_id
_pdbx_item_linked_group_list.parent_name
_pdbx_item_linked_group_list.child_name
_pdbx_item_linked_group_list.child_category_id
cat_1 1 '_cat_1.name' '_cat_2.name' cat_2
So, there are links between *cat_1* and *cat_2* based on the value in items named *name*. Using this information, we can now locate children and parents:
.. code-block:: cpp
// Assuming the file was loaded in f:
auto &cat1 = f.front()["cat_1"];
auto &cat2 = f.front()["cat_2"];
auto &cat3 = f.front()["cat_3"];
// Loop over all ape's in cat2
for (auto r : cat1.get_children(cat1.find1("name"_key == "aap"), cat2))
std::cout << r.get<std::string>("desc") << '\n';
Updating a value in an item in a parent category will update the corresponding value in all related children:
.. code-block:: cpp
auto r1 = cat1.find1("id"_key == 1);
r1["name"] = "aapje";
auto rs1 = cat2.find("name"_key == "aapje");
assert(rs1.size() == 2);
However, changing a value in a child record will not update the parent. This may result in an invalid file since you may then have a child that has no parent:
.. code-block:: cpp
auto r2 = cat2.find1("id"_key == 3);
r2["name"] = "wim";
assert(f.is_valid() == false);
So you have to fix this yourself by inserting a new item in cat1 with the new value.
.. _splitting-rows:
Another situation is when you change a value in a parent and updating children might introduce a situation where you need to split a child. To give an example, consider this:
.. code-block:: cif
data_test
loop_
_cat_1.id
_cat_1.name
_cat_1.desc
1 aap Aap
2 noot Noot
3 mies Mies
loop_
_cat_2.id
_cat_2.name
_cat_2.num
_cat_2.desc
1 aap 1 'Een dier'
2 aap 2 'Een andere aap'
3 noot 1 'walnoot bijvoorbeeld'
loop_
_cat_3.id
_cat_3.name
_cat_3.num
1 aap 1
2 aap 2
And we have a dictionary containing the following link definition (reversed compared to the previous example):
.. code-block:: cif
loop_
_pdbx_item_linked_group_list.parent_category_id
_pdbx_item_linked_group_list.link_group_id
_pdbx_item_linked_group_list.parent_name
_pdbx_item_linked_group_list.child_name
_pdbx_item_linked_group_list.child_category_id
cat_2 1 '_cat_2.name' '_cat_1.name' cat_1
cat_3 1 '_cat_3.name' '_cat_2.name' cat_2
cat_3 1 '_cat_3.num' '_cat_2.num' cat_2
So *cat3* is a parent of *cat2* and *cat2* is a parent of *cat1*. Now, if you change the *name* value of the first row of *cat3* to 'aapje', the corresponding row in *cat2* is updated as well. But when you update *cat2* you have to update *cat1* too. And simply changing the name field in row 1 of *cat1* is wrong. The default behaviour in libcifpp is to split the record in *cat1* and have a new child with the new name whereas the other remains as is.
The new *cat1* will thus be like:
.. code-block:: cif
loop_
_cat_1.id
_cat_1.name
_cat_1.desc
1 aapje Aap
2 noot Noot
3 mies Mies
5 aap Aap
|