1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231
|
Overview
========
Background on Systems Biology Modeling
--------------------------------------
Biological Expression Language (BEL)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Biological Expression Language (BEL) is a domain specific language that enables the expression of complex molecular
relationships and their context in a machine-readable form. Its simple grammar and expressive power have led to its
successful use to describe complex disease networks with several thousands of relationships. For a detailed
explanation, see the BEL `1.0 <https://github.com/OpenBEL/language/raw/master/docs/version_1.0/bel_specification_version_1.0.pdf>`_ and
`2.0 <https://github.com/OpenBEL/language/raw/master/docs/version_2.0/bel_specification_version_2.0.pdf>`_,
and `2.0+ <https://biological-expression-language.github.io>`_ specifications.
BEL Community Links
~~~~~~~~~~~~~~~~~~~
- BEL `Community Portal <https://biological-expression-language.github.io/>`_
- BEL `Google Group <https://groups.google.com/forum/#!forum/openbel-discuss>`_
Design Considerations
---------------------
Missing Namespaces and Improper Names
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The use of openly shared controlled vocabularies (namespaces) within BEL facilitates the exchange and consistency of
information. Finding the correct :code:`namespace:name` pair is often a difficult part of the curation process.
Outdated Namespaces
~~~~~~~~~~~~~~~~~~~
BEL provides a variety of `namespaces <https://biological-expression-language.github.io/identifiers/>`_
covering each of the BEL function types. Selventa used to provide BEL namespace files generated by the deprecated
project at ``https://github.com/OpenBEL/resource-generator`` and hosted at the abandoned website
``http://www.belframework.org/``. Newer versions of these namespaces can be found at
https://github.com/pharmacome/conso/tree/master/external.
Generating New Namespaces
~~~~~~~~~~~~~~~~~~~~~~~~~
In some cases, it is appropriate to design a new namespace, using the
`custom namespace specification <http://openbel-framework.readthedocs.io/en/latest/tutorials/building_custom_namespaces.html>`_
provided by the OpenBEL Framework. Packages for generating namespace, annotation, and knowledge resources have
been grouped in the `Bio2BEL <https://github.com/bio2bel>`_ organization on GitHub.
Synonym Issues
~~~~~~~~~~~~~~
Due to the huge number of terms across many namespaces, it's difficult for curators to know the domain-specific
synonyms that obscure the controlled/preferred term. However, the issue of synonym resolution and semantic searching
has already been generally solved by the use of ontologies. Besides just a controlled vocabulary, they also a
hierarchical model of knowledge, synonyms with cross-references to databases and other ontologies, and other
information semantic reasoning. Ontologies in the biomedical domain can be found at `OBO <obofoundry.org>`_ and
`EMBL-EBI OLS <http://www.ebi.ac.uk/ols/index>`_.
Additionally, as a tool for curators, the EMBL Ontology Lookup Service (OLS) allows for semantic searching. Simple
queries for the terms 'mitochondrial dysfunction' and 'amyloid beta-peptides' immediately returned results from
relevant ontologies, and ended a long debate over how to represent these objects within BEL. EMBL-EBI also provides a
programmatic API to the OLS service, for searching terms (http://www.ebi.ac.uk/ols/api/search?q=folic%20acid) and
suggesting resolutions (http://www.ebi.ac.uk/ols/api/suggest?q=folic+acid)
Implementation
--------------
PyBEL is implemented using the PyParsing module. It provides flexibility and incredible speed in parsing compared
to regular expression implementation. It also allows for the addition of parsing action hooks, which allow
the graph to be checked semantically at compile-time.
It uses SQLite to provide a consistent and lightweight caching system for external data, such as
namespaces, annotations, ontologies, and SQLAlchemy to provide a cross-platform interface. The same data management
system is used to store graphs for high-performance querying.
Extensions to BEL
-----------------
The PyBEL compiler is fully compliant with both BEL v1.0 and v2.0 and automatically upgrades legacy statements.
Additionally, PyBEL includes several additions to the BEL specification to enable expression of important concepts
in molecular biology that were previously missing and to facilitate integrating new data types. A short example is the
inclusion of protein oxidation in the default BEL namespace for protein modifications. Other, more elaborate additions
are outlined below.
Syntax for Epigenetics
~~~~~~~~~~~~~~~~~~~~~~
PyBEL introduces the gene modification function, gmod(), as a syntax for encoding epigenetic modifications. Its usage
mirrors the pmod() function for proteins and includes arguments for methylation.
For example, the methylation of NDUFB6 was found to be negatively correlated with its expression in a study of insulin
resistance and Type II diabetes. This can now be expressed in BEL such as in the following statement:
``g(HGNC:NDUFB6, gmod(Me)) negativeCorrelation r(HGNC:NDUFB6)``
References:
- https://www.ncbi.nlm.nih.gov/pubmed/17948130
- https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4655260/
.. note::
This syntax is currently under consideration as `BEP-0006 <https://github.com/belbio/bep/blob/bep-0006/docs/drafts/BEP-0006.md>`_.
Definition of Namespaces as Regular Expressions
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
BEL imposes the constraint that each identifier must be qualified with an enumerated namespace to enable semantic
interoperability and data integration. However, enumerating a namespace with potentially billions of names, such as
dbSNP, poses a computational issue. PyBEL introduces syntax for defining namespaces with a consistent pattern using a
regular expression to overcome this issue. For these namespaces, semantic validation can be perform in post-processing
against the underlying database. The dbSNP namespace can be defined with a syntax familiar to BEL annotation
definitions with regular expressions as follows:
``DEFINE NAMESPACE dbSNP AS PATTERN "rs[0-9]+"``
.. note::
This syntax was proposed with `BEP-0005 <https://github.com/belbio/bep/blob/master/docs/published/BEP-0005.md>`_
and has been officially accepted as part of the BEL 2.1 specification.
Definition of Resources using OWL
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Previous versions of PyBEL until 0.11.2 had an alternative namespace definition. Now it is recommended to either
generate namespace files with reproducible build scripts following the Bio2BEL framework, or to directly add them to
the database with the Bio2BEL :class:`bio2bel.manager.namespace_manager.NamespaceManagerMixin` extension.
Things to Consider
------------------
Do All Statements Need Supporting Text?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Yes! All statements must be minimally qualified with a citation and evidence (now called SupportingText in BEL 2.0) to
maintain provenance. Statements without evidence can't be traced to their source or evaluated independently from the
curator, so they are excluded.
Multiple Annotations
~~~~~~~~~~~~~~~~~~~~
All single annotations are considered as single element sets. When multiple annotations are present, all are unioned
and attached to a given edge.
.. code::
SET Citation = {"PubMed","Example Article","12345"}
SET ExampleAnnotation1 = {"Example Value 11", "Example Value 12"}
SET ExampleAnnotation2 = {"Example Value 21", "Example Value 22"}
p(HGNC:YFG1) -> p(HGNC:YFG2)
Namespace and Annotation Name Choices
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
:code:`*.belns` and :code:`*.belanno` configuration files include an entry called "Keyword" in their respective
[Namespace] and [AnnotationDefinition] sections. To maintain understandability between BEL documents, PyBEL
warns when the names given in :code:`*.bel` documents do not match their respective resources. For now, capitalization
is not considered, but in the future, PyBEL will also warn when capitalization is not properly stylized, like forgetting
the lowercase 'h' in "ChEMBL".
Why Not Nested Statements?
~~~~~~~~~~~~~~~~~~~~~~~~~~
BEL has different relationships for modeling direct and indirect causal relations.
Direct
******
- :code:`A => B` means that `A` directly increases `B` through a physical process.
- :code:`A =| B` means that `A` directly decreases `B` through a physical process.
Indirect
********
The relationship between two entities can be coded in BEL, even if the process is not well understood.
- :code:`A -> B` means that `A` indirectly increases `B`. There are hidden elements in `X` that mediate this interaction
through a pathway direct interactions :code:`A (=> or =|) X_1 (=> or =|) ... X_n (=> or =|) B`, or through a set of
multiple pathways that constitute a network.
- :code:`A -| B` means that `A` indirectly decreases `B`. Like for :code:`A -> B`, this process involves hidden
components with varying activities.
Increasing Nested Relationships
*******************************
BEL also allows object of a relationship to be another statement.
- :code:`A => (B => C)` means that `A` increases the process by which `B` increases `C`. The example in the BEL Spec
:code:`p(HGNC:GATA1) => (act(p(HGNC:ZBTB16)) => r(HGNC:MPL))` represents GATA1 directly increasing the process by
which ZBTB16 directly increases MPL. Before, directly increasing was used to specify physical contact, so it's
reasonable to conclude that :code:`p(HGNC:GATA1) => act(p(HGNC:ZBTB16))`. The specification cites examples when `B`
is an activity that only is affected in the context of `A` and `C`. This complicated enough that it is both
impractical to standardize during curation, and impractical to represent in a network.
- :code:`A -> (B => C)` can be interpreted by assuming that `A` indirectly increases `B`, and because of monotonicity,
conclude that :code:`A -> C` as well.
- :code:`A => (B -> C)` is more difficult to interpret, because it does not describe which part of process
:code:`B -> C` is affected by `A` or how. Is it that :code:`A => B`, and :code:`B => C`, so we conclude
:code:`A -> C`, or does it mean something else? Perhaps `A` impacts a different portion of the hidden process in
:code:`B -> C`. These statements are ambiguous enough that they should be written as just :code:`A => B`, and
:code:`B -> C`. If there is no literature evidence for the statement :code:`A -> C`, then it is not the job of the
curator to make this inference. Identifying statements of this might be the goal of a bioinformatics analysis of the
BEL network after compilation.
- :code:`A -> (B -> C)` introduces even more ambiguity, and it should not be used.
- :code:`A => (B =| C)` states `A` increases the process by which `B` decreases `C`. One interpretation of this
statement might be that :code:`A => B` and :code:`B =| C`. An analysis could infer :code:`A -| C`. Statements in the
form of :code:`A -> (B =| C)` can also be resolved this way, but with added ambiguity.
Decreasing Nested Relationships
*******************************
While we could agree on usage for the previous examples, the decrease of a nested statement introduces an unreasonable
amount of ambiguity.
- :code:`A =| (B => C)` could mean `A` decreases `B`, and `B` also increases `C`. Does this mean A decreases C, or does
it mean that C is still increased, but just not as much? Which of these statements takes precedence? Or do their
effects cancel? The same can be said about :code:`A -| (B => C)`, and with added ambiguity for indirect increases
:code:`A -| (B -> C)`
- :code:`A =| (B =| C)` could mean that `A` decreases `B` and `B` decreases `C`. We could conclude that `A` increases
`C`, or could we again run into the problem of not knowing the precedence? The same is true for the indirect versions.
Recommendations for Use in PyBEL
********************************
After considering the ambiguity of nested statements to be a great risk to clarity, and PyBEL disables the usage of
nested statements by default. See the Input and Output section for different parser settings. At Fraunhofer
SCAI, curators resolved these statements to single statements to improve the precision and readability of our BEL
documents.
While most statements in the form :code:`A rel1 (B rel2 C)` can be reasonably expanded to :code:`A rel1 B` and
:code:`B rel2 C`, the few that cannot are the difficult-to-interpret cases that we need to be careful about in our
curation and later analyses.
Why Not RDF?
~~~~~~~~~~~~
Current bel2rdf serialization tools build URLs with the OpenBEL Framework domain as a namespace, rather than respect
the original namespaces of original entities. This does not follow the best
practices of the semantic web, where URL’s representing an object point to a real page with additional information.
For example, UniProt does an exemplary job of this. Ultimately, using non-standard URLs makes
harmonizing and data integration difficult.
Additionally, the RDF format does not easily allow for the annotation of edges. A simple statement in BEL that one
protein up-regulates another can be easily represented in a triple in RDF, but when the annotations and citation from
the BEL document need to be included, this forces RDF serialization to use approaches like representing the statement
itself as a node. RDF was not intended to represent this type of information, but more properly for locating resources
(hence its name). Furthermore, many blank nodes are introduced throughout the process. This makes RDF incredibly
difficult to understand or work with. Later, writing queries in SPARQL becomes very difficult because the data format
is complicated and the language is limited. For example, it would be incredibly complicated to write a query in SPARQL
to get the objects of statements from publications by a certain author.
|