File: tutorial.rst

package info (click to toggle)
python-curies 0.9.0-1
  • links: PTS, VCS
  • area: main
  • in suites: forky, sid, trixie
  • size: 804 kB
  • sloc: python: 3,617; makefile: 14
file content (882 lines) | stat: -rw-r--r-- 32,683 bytes parent folder | download
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
Getting Started
===============
Loading a Context
-----------------
There are several ways to load a context with this package, including:

1. pre-defined contexts
2. contexts encoded in the standard prefix map format
3. contexts encoded in the standard JSON-LD context format
4. contexts encoded in the extended prefix map format

Loading a pre-defined context
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
There exist many registries of semantic spaces that include CURIE
prefixes, URI prefixes, sometimes synonyms, and other associated
metadata. The Bioregistry provides a
`detailed overview <https://bioregistry.io/related>`_ of the registries available.
This package exposes a few high quality registries that are internally consistent
(i.e., are bijective).

==============  ==========================================
Name            Function
==============  ==========================================
Bioregistry     :func:`curies.get_bioregistry_converter`
OBO Foundry     :func:`curies.get_obo_converter`
Prefix Commons  :func:`curies.get_prefixcommons_converter`
Monarch         :func:`curies.get_monarch_converter`
Gene Ontology   :func:`curies.get_go_converter`
==============  ==========================================

These functions can be called directly to instantiate the :class:`curies.Converter`
class, which is used for compression, expansion, standardization, and other operations
below.

.. code-block:: python

    import curies

    # Uses the Bioregistry, an integrative, comprehensive registry
    bioregistry_converter = curies.get_bioregistry_converter()

    # Uses the OBO Foundry, a registry of ontologies
    obo_converter = curies.get_obo_converter()

    # Uses the Monarch Initiative project-specific context
    monarch_converter = curies.get_monarch_converter()

Loading Prefix Maps
~~~~~~~~~~~~~~~~~~~
A prefix map is a dictionary whose keys are CURIE prefixes and values are URI prefixes. An abridged example
using OBO Foundry preferred CURIE prefixes and URI prefixes is

.. code-block:: json

   {
       "CHEBI": "http://purl.obolibrary.org/obo/CHEBI_",
       "MONDO": "http://purl.obolibrary.org/obo/MONDO_",
       "GO": "http://purl.obolibrary.org/obo/GO_"
   }

Prefix maps can be loaded using the :func:`curies.load_prefix_map`. First,
a prefix map can be loaded directly from a Python data structure like in

.. code-block:: python

    import curies

    prefix_map = {
        "CHEBI": "http://purl.obolibrary.org/obo/CHEBI_"
    }
    converter = curies.load_prefix_map(data)

This function also accepts a string with a HTTP, HTTPS, or FTP path to a remote file as well as a local file path.

.. warning::

    Ideally, prefix maps are *bijective*, meaning that both the keys and values are unique.
    The Python dictionary data structure ensures that keys are unique, but sometimes values are repeated. For example,
    the CURIE prefixes ``DC`` and ``DCTERMS`` are often used interchangeably with the URI prefix for
    the `Dublin Core Metadata Initiative Terms <https://www.dublincore.org/specifications/dublin-core/dcmi-terms>`_.
    Therefore, many prefix maps are not bijective like

    .. code-block:: json

       {
           "DC": "http://purl.org/dc/terms/",
           "DCTERMS": "http://purl.org/dc/terms/"
       }

    If you load a prefix map that is not bijective, it can have unintended consequences. Therefore,
    an error is thrown. You can pass ``strict=False`` if you don't mind having unsafe data. A better data
    structure for situations when there can be CURIE synonyms or even URI prefix synonyms is
    the *extended prefix map* (see below).

    If you're not in a position where you can fix data issues upstream, you can try using the
    :func:`curies.upgrade_prefix_map` to extract a canonical extended prefix map from a non-bijective
    prefix map.

Loading Extended Prefix Maps
~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Extended prefix maps (EPMs) address the issues with prefix maps by including explicit
fields for CURIE prefix synonyms and URI prefix synonyms while maintaining an explicit
field for the preferred CURIE prefix and URI prefix. An abbreviated example (just
containing an entry for ChEBI) looks like:

.. code-block:: json

   [
       {
           "prefix": "CHEBI",
           "uri_prefix": "http://purl.obolibrary.org/obo/CHEBI_",
           "prefix_synonyms": ["chebi"],
           "uri_prefix_synonyms": [
               "https://identifiers.org/chebi:"
           ]
       }
   ]

Extended prefix maps can be loaded with :func:`curies.load_extended_prefix_map`. First,
a prefix map can be loaded directly from a Python data structure like in

.. code-block:: python

    import curies

    epm = [
        {
            "prefix": "CHEBI",
            "uri_prefix": "http://purl.obolibrary.org/obo/CHEBI_",
            "prefix_synonyms": ["chebi"],
            "uri_prefix_synonyms": [
                "https://identifiers.org/chebi:"
            ]
        }
    ]
    converter = curies.load_extended_prefix_map(data)

An extended prefix map can be loaded from a remote file via HTTP, HTTPS, or FTP with

.. code-block:: python

    import curies

    url = "https://raw.githubusercontent.com/mapping-commons/sssom-py/master/src/sssom/obo.epm.json"
    converter = curies.load_extended_prefix_map(url)

Similarly, an extended prefix map stored in a local file can be loaded with the following.
This works with both :class:`pathlib.Path` and vanilla strings.

.. code-block:: python

    from pathlib import Path
    from urllib.request import urlretrieve

    import curies

    url = "https://raw.githubusercontent.com/mapping-commons/sssom-py/master/src/sssom/obo.epm.json"
    path = Path.home().joinpath("Downloads", "obo.epm.json")
    urlretrieve(url, path)
    converter = curies.load_extended_prefix_map(path)

Loading JSON-LD Contexts
~~~~~~~~~~~~~~~~~~~~~~~~
A `JSON-LD context <https://niem.github.io/json/reference/json-ld/context/>`_
allows for embedding of a simple prefix map within a linked data document.
They can be identified hiding in all sorts of JSON (or JSON-like) content
with the key ``@context``. JSON-LD contexts can be loaded using :meth:`curies.Converter.from_jsonld`.

First, a JSON-LD context can be loaded directly from a Python data structure like in

.. code-block:: python

    import curies

    data = {
        "@context": {
            "CHEBI": "http://purl.obolibrary.org/obo/CHEBI_"
        }
    }
    converter = curies.load_jsonld_context(data)

.. note::

    This correctly handles the more complex data structures including ``@prefix`` noted in
    `here <https://github.com/OBOFoundry/OBOFoundry.github.io/issues/2410>`_.

A JSON-LD context can be loaded from a remote file via HTTP, HTTPS, or FTP with

.. code-block:: python

    import curies

    url = "https://raw.githubusercontent.com/biopragmatics/bioregistry/main/exports/contexts/semweb.context.jsonld"
    converter = curies.load_jsonld_context(url)

A JSON-LD context stored in a local file can be loaded with the following.
This works with both :class:`pathlib.Path` and vanilla strings.

.. code-block:: python

    from pathlib import Path
    from urllib.request import urlretrieve

    import curies

    url = "https://raw.githubusercontent.com/biopragmatics/bioregistry/main/exports/contexts/semweb.context.jsonld"
    path = Path.home().joinpath("Downloads", "semweb.context.jsonld")
    urlretrieve(url, path)
    converter = curies.load_jsonld_context(path)

Loading SHACL
~~~~~~~~~~~~~
The `shapes constraint language (SHACL) <https://bioregistry.io/sh>`_ can be used to represent
prefix maps directly in RDF using the `sh:prefix` and `sh:namespace` predicates. Therefore, the
simple ChEBI example from before can be represented using

.. code-block:: turtle

    @prefix sh: <http://www.w3.org/ns/shacl#> .

    [
        sh:declare
        [
            sh:prefix "CHEBI" ;
            sh:namespace "http://purl.obolibrary.org/obo/CHEBI_" .
        ] .
    ]

A SHACL context can be loaded from a remote file via HTTP, HTTPS, or FTP with

.. code-block:: python

    import curies

    url = "https://raw.githubusercontent.com/biopragmatics/bioregistry/main/exports/contexts/semweb.context.ttl"
    converter = curies.load_shacl(url)


A SHACL context stored in a local file can be loaded with the following.
This works with both :class:`pathlib.Path` and vanilla strings.

.. code-block:: python

    from pathlib import Path
    from urllib.request import urlretrieve

    import curies

    url = "https://raw.githubusercontent.com/biopragmatics/bioregistry/main/exports/contexts/semweb.context.ttl"
    path = Path.home().joinpath("Downloads", "semweb.context.ttl")
    urlretrieve(url, path)
    converter = curies.load_shacl(path)

Introspecting on a Context
--------------------------
After loading a context, it's possible to get certain information out of the converter. For example, if you want to
get all of the CURIE prefixes from the converter, you can use :meth:`Converter.get_prefixes`:

.. code-block:: python

    import curies

    converter = curies.get_bioregistry_converter()
    prefixes = converter.get_prefixes()
    assert 'chebi' in prefixes
    assert 'CHEBIID' not in prefixes, "No synonyms are included by default"

    prefixes = converter.get_prefixes(include_synonyms=True)
    assert 'chebi' in prefixes
    assert 'CHEBIID' in prefixes

Similarly, the URI prefixes can be extracted with :meth:`Converter.get_uri_prefixes` like in:

.. code-block:: python

    import curies

    converter = curies.get_bioregistry_converter()
    uri_prefixes = converter.get_uri_prefixes()
    assert 'http://purl.obolibrary.org/obo/CHEBI_'' in prefixes
    assert 'https://bioregistry.io/chebi:' not in prefixes, "No synonyms are included by default"

    uri_prefixes = converter.get_uri_prefixes(include_synonyms=True)
    assert 'http://purl.obolibrary.org/obo/CHEBI_'' in prefixes
    assert 'https://bioregistry.io/chebi:' in prefixes

It's also possible to get a bijective prefix map, i.e., a dictionary from primary CURIE prefixes
to primary URI prefixes. This is useful for compatibility with legacy systems which assume simple prefix maps.
This can be done with the ``bimap`` property like in the following:

.. code-block:: python

    import curies

    converter = curies.get_bioregistry_converter()
    prefix_map = converter.bimap
    >>> prefix_map['chebi']
    'http://purl.obolibrary.org/obo/CHEBI_'

Modifying a Context
-------------------
Incremental Converters
~~~~~~~~~~~~~~~~~~~~~~
As suggested in `#13 <https://github.com/cthoyt/curies/issues/33>`_, new data
can be added to an existing converter with either
:meth:`curies.Converter.add_prefix` or :meth:`curies.Converter.add_record`.
For example, a CURIE and URI prefix for HGNC can be added to the OBO Foundry
converter with the following:

.. code-block::

    import curies

    converter = curies.get_obo_converter()
    converter.add_prefix("hgnc", "https://bioregistry.io/hgnc:")

Similarly, an empty converter can be instantiated using an empty list
for the `records` argument and prefixes can be added one at a time
(note this currently does not allow for adding synonyms separately):

.. code-block::

    import curies

    converter = curies.Converter(records=[])
    converter.add_prefix("hgnc", "https://bioregistry.io/hgnc:")

A more flexible version of this operation first involves constructing
a :class:`curies.Record` object:

.. code-block::

    import curies

    converter = curies.get_obo_converter()
    record = curies.Record(prefix="hgnc", uri_prefix="https://bioregistry.io/hgnc:")
    converter.add_record(record)

By default, both of these operations will fail if the new content conflicts with existing content.
If desired, the ``merge`` argument can be set to true to enable merging. Further, checking
for conflicts and merging can be made to be case insensitive by setting ``case_sensitive`` to false.

Such a merging strategy is the basis for wholesale merging of converters, described below.

Chaining and Merging
~~~~~~~~~~~~~~~~~~~~
This package implements a faultless chain operation :func:`curies.chain` that is configurable for case
sensitivity and fully considers all synonyms.

:func:`curies.chain` prioritizes based on the order given. Therefore, if two prefix maps
having the same prefix but different URI prefixes are given, the first is retained. The second
is retained as a synonym

.. code-block:: python

    import curies

    c1 = curies.load_prefix_map({"GO": "http://purl.obolibrary.org/obo/GO_"})
    c2 = curies.load_prefix_map({"GO": "https://identifiers.org/go:"})
    converter = curies.chain([c1, c2])

    >>> converter.expand("GO:1234567")
    'http://purl.obolibrary.org/obo/GO_1234567'
    >>> converter.compress("http://purl.obolibrary.org/obo/GO_1234567")
    'GO:1234567'
    >>> converter.compress("https://identifiers.org/go:1234567")
    'GO:1234567'

Chain is the perfect tool if you want to override parts of an existing extended
prefix map. For example, if you want to use most of the Bioregistry, but you
would like to specify a custom URI prefix (e.g., using Identifiers.org), you
can do the following

.. code-block:: python

    import curies

    overrides = curies.load_prefix_map({"pubmed": "https://identifiers.org/pubmed:"})
    bioregistry_converter = curies.get_bioregistry_converter()
    converter = curies.chain([overrides, bioregistry_converter])

    >>> converter.expand("pubmed:1234")
    'https://identifiers.org/pubmed:1234'

Subsetting
~~~~~~~~~~
A subset of a converter can be extracted using :meth:`curies.Converter.get_subconverter`.
This functionality is useful for downstream applications like the following:

1. You load a comprehensive extended prefix map, e.g., from the Bioregistry using
   :func:`curies.get_bioregistry_converter()`.
2. You load some data that conforms to this prefix map by convention. This
   is often the case for semantic mappings stored in the
   `SSSOM format <https://github.com/mapping-commons/sssom>`_.
3. You extract the list of prefixes *actually* used within your data
4. You subset the detailed extended prefix map to only include prefixes
   relevant for your data
5. You make some kind of output of the subsetted extended prefix map to
   go with your data. Effectively, this is a way of reconciling data. This
   is especially effective when using the Bioregistry or other comprehensive
   extended prefix maps.

Here's a concrete example of doing this (which also includes a bit of data science)
to do this on the SSSOM mappings from the `Disease Ontology <https://disease-ontology.org/>`_
project.

>>> import curies
>>> import pandas as pd
>>> import itertools as itt
>>> commit = "faca4fc335f9a61902b9c47a1facd52a0d3d2f8b"
>>> url = f"https://raw.githubusercontent.com/mapping-commons/disease-mappings/{commit}/mappings/doid.sssom.tsv"
>>> df = pd.read_csv(url, sep="\t", comment='#')
>>> prefixes = {
...     curies.Reference.from_curie(curie).prefix
...     for column in ["subject_id", "predicate_id", "object_id"]
...     for curie in df[column]
... }
>>> converter = curies.get_bioregistry_converter()
>>> slim_converter = converter.get_subconverter(prefixes)

Writing a Context
-----------------
After loading and modifying a context, there are several functions for writing
a context to a file:

- :func:`curies.write_extended_prefix_map`
- :func:`curies.write_jsonld_context`
- :func:`curies.write_shacl`
- :func:`curies.write_tsv`

Here's a self-contained example on how this works:

.. code-block:: python

        import curies
        converter = curies.load_prefix_map({
            "CHEBI": "http://purl.obolibrary.org/obo/CHEBI_",
        })
        curies.write_shacl(converter, "example_shacl.ttl")

which outputs the following file:

.. code-block::

    @prefix sh: <http://www.w3.org/ns/shacl#> .
    @prefix xsd: <http://www.w3.org/2001/XMLSchema#> .

    [
      sh:declare
        [ sh:prefix "CHEBI" ; sh:namespace "http://purl.obolibrary.org/obo/CHEBI_"^^xsd:anyURI  ]
    ] .

Faultless handling of overlapping URI prefixes
----------------------------------------------
Most implementations of URI parsing iterate through the CURIE prefix/URI prefix pairs
in a prefix map, check if the given URI starts with the URI prefix, then returns the
CURIE prefix if does. This becomes an issue when a given URI can match multiple
overlapping URI prefixes in the prefix map. For example, the ChEBI URI prefix is
``http://purl.obolibrary.org/obo/CHEBI_`` and the more generic OBO URI prefix
is ``http://purl.obolibrary.org/obo/``. Therefore, it is possible that a URI could be
compressed two different ways, depending on the order of iteration.

:mod:`curies` addresses this by using the `trie <https://en.wikipedia.org/wiki/Trie>`_
data structure, which indexes potentially overlapping strings and allows for efficient
lookup of the longest matching string (e.g., the URI prefix) in the tree to a given target string
(e.g., the URI).

.. image:: img/trie.png
   :width: 200px
   :alt: A graphical depiction of a trie. Reused under the CC0 license from Wikipedia.

This has two benefits. First, it is correct. Second, searching the trie data structure can be done
in sublinear time while iterating over a prefix map can only be done in linear time. When processing
a lot of data, this makes a meaningful difference!

The following code demonstrates that the scenario above. It will always return the correct
CURIE ``CHEBI:1`` instead of the incorrect CURIE ``OBO:CHEBI_1``, regardless of the order of
the dictionary, iteration, or any other factors.

.. code-block::

    import curies

    converter = curies.load_prefix_map({
        "CHEBI": "http://purl.obolibrary.org/obo/CHEBI_",
        "OBO": "http://purl.obolibrary.org/obo/
    })

    >>> converter.compress("http://purl.obolibrary.org/obo/CHEBI_1")
    'CHEBI:1'

Standardization
---------------
The :class:`curies.Converter` data structure supports prefix and URI prefix synonyms.
The following example demonstrates
using these synonyms to support standardizing prefixes, CURIEs, and URIs. Note below,
the colloquial prefix `gomf`, sometimes used to represent the subspace in the
`Gene Ontology (GO) <https://obofoundry.org/ontology/go>`_ corresponding to molecular
functions, is upgraded to the preferred prefix, ``GO``.

.. code-block::

    from curies import Converter, Record

    converter = Converter([
        Record(
            prefix="GO",
            prefix_synonyms=["gomf", "gocc", "gobp", "go", ...],
            uri_prefix="http://purl.obolibrary.org/obo/GO_",
            uri_prefix_synonyms=[
                "http://amigo.geneontology.org/amigo/term/GO:",
                "https://identifiers.org/GO:",
                ...
            ],
        ),
        # And so on
        ...
    ])

    >>> converter.standardize_prefix("gomf")
    'GO'
    >>> converter.standardize_curie('gomf:0032571')
    'GO:0032571'
    >>> converter.standardize_uri('http://amigo.geneontology.org/amigo/term/GO:0032571')
    'http://purl.obolibrary.org/obo/GO_0032571'

Note: non-standard URIs (i.e., ones based on URI prefix synonyms) can still be parsed with
:meth:`curies.Converter.parse_uri` and compressed
into CURIEs with :meth:`curies.Converter.compress`.

Bulk Operations
---------------
Expansion, compression, and standardization operations can be done in bulk to all rows
in a :class:`pandas.DataFrame` using the following examples.

Bulk Compress URIs
~~~~~~~~~~~~~~~~~~
In order to demonstrate bulk operations using :meth:`curies.Converter.pd_compress`,
we construct a small dataframe:

.. code-block:: python

    import curies
    import pandas as pd

    df = pd.DataFrame({"uri": [
        "http://purl.obolibrary.org/obo/GO_0000010",
        "http://purl.obolibrary.org/obo/GO_0000011",
        "http://gudt.org/schema/gudt/baseCGSUnitDimensions",
        "http://qudt.org/schema/qudt/conversionMultiplier",
    ]})

    converter = curies.get_obo_converter()
    converter.pd_compress(df, column="uri", target_column="curie")

Results will look like:

=================================================  ==========
uri                                                curie
=================================================  ==========
http://purl.obolibrary.org/obo/GO_0000010          GO:0000010
http://purl.obolibrary.org/obo/GO_0000011          GO:0000011
http://gudt.org/schema/gudt/baseCGSUnitDimensions
http://qudt.org/schema/qudt/conversionMultiplier
=================================================  ==========

Note that some URIs are not handled by the extended prefix map inside the converter, so if you want
to pass those through, use ``passthrough=True`` like in

.. code-block:: python

    converter.pd_compress(df, column="uri", target_column="curie", passthrough=True)

=================================================  =================================================
uri                                                curie
=================================================  =================================================
http://purl.obolibrary.org/obo/GO_0000010          GO:0000010
http://purl.obolibrary.org/obo/GO_0000011          GO:0000011
http://gudt.org/schema/gudt/baseCGSUnitDimensions  http://gudt.org/schema/gudt/baseCGSUnitDimensions
http://qudt.org/schema/qudt/conversionMultiplier   http://qudt.org/schema/qudt/conversionMultiplier
=================================================  =================================================

The keyword ``ambiguous=True`` can be passed if the source column can either be a CURIE
or URI. Then, the semantics of compression are used from :meth:`curies.Converter.compress_or_standardize`.

Bulk Expand CURIEs
~~~~~~~~~~~~~~~~~~
In order to demonstrate bulk operations using :meth:`curies.Converter.pd_expand`,
we construct a small dataframe used in conjunction with the OBO converter (which
only includes OBO Foundry ontology URI prefix expansions):

.. code-block:: python

    import curies
    import pandas as pd

    df = pd.DataFrame({"curie": [
        "GO:0000001",
        "skos:exactMatch",
    ]})

    converter = curies.get_obo_converter()
    converter.pd_expand(df, column="curie", target_column="uri")

===============  =========================================
curie            uri
===============  =========================================
GO:0000001       http://purl.obolibrary.org/obo/GO_0000001
skos:exactMatch
===============  =========================================

Note that since ``skos`` is not in the OBO Foundry extended prefix map, no results are placed in
the ``uri`` column. If you wan to pass through elements that can't be expanded, you can use
``passthrough=True`` like in:

.. code-block:: python

    converter.pd_expand(df, column="curie", target_column="uri", passthrough=True)

===============  =========================================
curie            uri
===============  =========================================
GO:0000001       http://purl.obolibrary.org/obo/GO_0000001
skos:exactMatch  skos:exactMatch
===============  =========================================

Alternatively, chaining together multiple converters (such as the Bioregistry) will yield better results

.. code-block:: python

    import curies
    import pandas as pd

    df = pd.DataFrame({"curie":  [
        "GO:0000001",
        "skos:exactMatch",
    ]})

    converter = curies.chain([
        curies.get_obo_converter(),
        curies.get_bioregistry_converter(),
    ])
    converter.pd_expand(df, column="curie", target_column="uri")

===============  ==============================================
curie            uri
===============  ==============================================
GO:0000001       http://purl.obolibrary.org/obo/GO_0000001
skos:exactMatch  http://www.w3.org/2004/02/skos/core#exactMatch
===============  ==============================================

The keyword ``ambiguous=True`` can be passed if the source column can either be a CURIE
or URI. Then, the semantics of compression are used from :meth:`curies.Converter.compress_or_standardize`.

Bulk Standardizing Prefixes
~~~~~~~~~~~~~~~~~~~~~~~~~~~
The `Gene Ontology (GO) Annotations Database <https://geneontology.org/docs/go-annotations/>`_
distributes its file where references to proteins from the `Universal Protein Resource (UniProt)
<https://www.uniprot.org/>`_ use the prefix ``UniProtKB``. When using the Bioregistry's extended prefix map,
these prefixes should be standardized to ``uniprot`` with :meth:`curies.Converter.pd_standardize_prefix`.
This can be done in-place with the following:

.. code-block:: python

    import pandas
    import curies

    # the first column represents the prefix for the protein,
    # called "DB" in the schema. This is where we want to upgrade
    # `UniProtKB` to `uniprot`
    df = pd.read_csv(
        "http://geneontology.org/gene-associations/goa_human.gaf.gz",
        sep="\t",
        comment="!",
        header=None,
    )

    converter = curies.get_bioregistry_converter()
    converter.pd_standardize_prefix(df, column=0)

The ``target_column`` keyword can be given if you don't want to overwrite the original.

Bulk Standardizing CURIEs
~~~~~~~~~~~~~~~~~~~~~~~~~
Using the same example data from GO, the sixth column contains CURIE for references such as
`GO_REF:0000043 <https://bioregistry.io/go.ref:0000043>`_. When using the Bioregistry's extended prefix map,
these CURIEs' prefixes should be standardized to ``go.ref`` with :meth:`curies.Converter.pd_standardize_curie`.
This can be done in-place with the following:

.. code-block:: python

    import pandas
    import curies

    df = pd.read_csv(
        "http://geneontology.org/gene-associations/goa_human.gaf.gz",
        sep="\t",
        comment="!",
        header=None,
    )

    converter = curies.get_bioregistry_converter()
    converter.pd_standardize_curie(df, column=5)

The ``target_column`` keyword can be given if you don't want to overwrite the original.

File Operations
~~~~~~~~~~~~~~~
Apply in bulk to a CSV file with :meth:`curies.Converter.file_expand` and
:meth:`curies.Converter.file_compress` (defaults to using tab separator):

.. code-block:: python

    import curies

    path = ...
    converter = curies.get_obo_converter()
    # modifies file in place
    converter.file_compress(path, column=0)
    # modifies file in place
    converter.file_expand(path, column=0)

Like with the Pandas operations, the keyword ``ambiguous=True``` can be set
when entries can either be CURIEs or URIs.

Tools for Developers and Semantic Engineers
-------------------------------------------

Working with strings that might be a URI or a CURIE
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Sometimes, it's not clear if a string is a CURIE or a URI. While
the `SafeCURIE syntax <https://www.w3.org/TR/2010/NOTE-curie-20101216/#P_safe_curie>`_
is intended to address this, it's often overlooked.

CURIE and URI Checks
********************
The first way to handle this ambiguity is to be able to check if the string is a CURIE
or a URI. Therefore, each :class:`curies.Converter`
comes with functions for checking if a string is a CURIE (:meth:`curies.Converter.is_curie`)
or a URI (:meth:`curies.Converter.is_uri`) under its definition.

.. code-block:: python

    import curies

    converter = curies.get_obo_converter()

    >>> converter.is_curie("GO:1234567")
    True
    >>> converter.is_curie("http://purl.obolibrary.org/obo/GO_1234567")
    False
    # This is a valid CURIE, but not under this converter's definition
    >>> converter.is_curie("pdb:2gc4")
    False

    >>> converter.is_uri("http://purl.obolibrary.org/obo/GO_1234567")
    True
    >>> converter.is_uri("GO:1234567")
    False
    # This is a valid URI, but not under this converter's definition
    >>> converter.is_uri("http://proteopedia.org/wiki/index.php/2gc4")
    False

Extended Expansion and Compression
**********************************
The :meth:`curies.Converter.expand_or_standardize` extends the CURIE expansion function to handle the situation where
you might get passed a CURIE or a URI. If it's a CURIE, expansions happen with the normal
rules. If it's a URI, it tries to standardize it.

.. code-block:: python

    from curies import Converter, Record
    converter = Converter.from_extended_prefix_map([
        Record(
            prefix="CHEBI",
            prefix_synonyms=["chebi"],
            uri_prefix="http://purl.obolibrary.org/obo/CHEBI_",
            uri_prefix_synonyms=["https://identifiers.org/chebi:"],
        ),
    ])

    # Expand CURIEs
    >>> converter.expand_or_standardize("CHEBI:138488")
    'http://purl.obolibrary.org/obo/CHEBI_138488'
    >>> converter.expand_or_standardize("chebi:138488")
    'http://purl.obolibrary.org/obo/CHEBI_138488'

    # standardize URIs
    >>> converter.expand_or_standardize("http://purl.obolibrary.org/obo/CHEBI_138488")
    'http://purl.obolibrary.org/obo/CHEBI_138488'
    >>> converter.expand_or_standardize("https://identifiers.org/chebi:138488")
    'http://purl.obolibrary.org/obo/CHEBI_138488'

    # Handle cases that aren't valid w.r.t. the converter
    >>> converter.expand_or_standardize("missing:0000000")
    >>> converter.expand_or_standardize("https://example.com/missing:0000000")

A similar workflow is implemented in :meth:`curies.Converter.compress_or_standardize` for compressing URIs
where a CURIE might get passed.

.. code-block:: python

    from curies import Converter, Record
    converter = Converter.from_extended_prefix_map([
        Record(
            prefix="CHEBI",
            prefix_synonyms=["chebi"],
            uri_prefix="http://purl.obolibrary.org/obo/CHEBI_",
            uri_prefix_synonyms=["https://identifiers.org/chebi:"],
        ),
    ])

    # Compress URIs
    >>> converter.compress_or_standardize("http://purl.obolibrary.org/obo/CHEBI_138488")
    'CHEBI:138488'
    >>> converter.compress_or_standardize("https://identifiers.org/chebi:138488")
    'CHEBI:138488'

    # standardize CURIEs
    >>> converter.compress_or_standardize("CHEBI:138488")
    'CHEBI:138488'
    >>> converter.compress_or_standardize("chebi:138488")
    'CHEBI:138488'

    # Handle cases that aren't valid w.r.t. the converter
    >>> converter.compress_or_standardize("missing:0000000")
    >>> converter.compress_or_standardize("https://example.com/missing:0000000")

Reusable data structures for references
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
While URIs and CURIEs are often represented as strings, for many programmatic applications,
it is preferable to pre-parse them into a pair of prefix corresponding to a semantic space
and local unique identifier from that semantic space. ``curies`` provides two complementary
data structures for representing these pairs:

1. :mod:`curies.ReferenceTuple` - a native Python :class:`typing.NamedTuple` that is
   storage efficient, can be hashed, can be accessed by slicing, unpacking, or via attributes.
2. :mod:`curies.Reference` - a :class:`pydantic.BaseModel` that can be used directly
   with other Pydantic models, FastAPI, SQLModel, and other JSON-schemata

Internally, :mod:`curies.ReferenceTuple` is used, but there is a big benefit to standardizing
this data type and providing utilities to flip-flop back and forth to :mod:`curies.Reference`,
which is preferable in data validation (such as when parsing OBO ontologies)

Integrating with :mod:`rdflib`
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
RDFlib is a pure Python package for manipulating RDF data. The following example shows how to bind the
extended prefix map from a :class:`curies.Converter` to a graph (:class:`rdflib.Graph`).

.. code-block::

    import curies, rdflib, rdflib.namespace

    converter = curies.get_obo_converter()
    graph = rdflib.Graph()

    for prefix, uri_prefix in converter.bimap.items():
        graph.bind(prefix, rdflib.Namespace(uri_prefix))

A more flexible approach is to instantiate a namespace manager (:class:`rdflib.namespace.NamespaceManager`)
and bind directly to that.

.. code-block::

    import curies, rdflib

    converter = curies.get_obo_converter()
    namespace_manager = rdflib.namespace.NamespaceManager(rdflib.Graph())

    for prefix, uri_prefix in converter.bimap.items():
        namespace_manager.bind(prefix, rdflib.Namespace(uri_prefix))

URI references for use in RDFLib's graph class can be constructed from
CURIEs using a combination of :meth:`curies.Converter.expand` and :class:`rdflib.URIRef`.

.. code-block::

    import curies, rdflib

    converter = curies.get_obo_converter()

    uri_ref = rdflib.URIRef(converter.expand("CHEBI:138488", strict=True))