1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189
|
.. testsetup:: *
import gfapy
gfa = gfapy.Gfa()
.. _header:
The Header
----------
GFA files may contain one or multiple header lines (record type: "H"). These
lines may be present in any part of the file, not necessarily at the beginning.
Although the header may consist of multiple lines, its content refers to the
whole file. Therefore in Gfapy the header is accessed using a single line
instance (accessible by the :attr:`~gfapy.lines.headers.Headers.header`
property). Header lines contain only tags. If not header line is present in the
Gfa, then the header line object will be empty (i.e. contain no tags).
Note that header lines cannot be connected to the Gfa as other lines (i.e.
calling :meth:`~gfapy.line.common.connection.Connection.connect` on them raises
an exception). Instead they must be merged to the existing Gfa header, using
`add_line` on the Gfa instance.
.. doctest::
>>> gfa.add_line("H\tnn:f:1.0") #doctest: +ELLIPSIS
>>> gfa.header.nn
1.0
>>> gfapy.Line("H\tnn:f:1.0").connect(gfa)
Traceback (most recent call last):
...
gfapy.error.RuntimeError: ...
Multiple definitions of the predefined header tags
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
For the predefined tags (``VN`` and ``TS``), the presence of multiple
values in different lines is an error, unless the value is the same in
each instance (in which case the repeated definitions are ignored).
.. doctest::
>>> gfa.add_line("H\tVN:Z:1.0") #doctest: +ELLIPSIS
>>> gfa.add_line("H\tVN:Z:1.0") # ignored #doctest: +ELLIPSIS
>>> gfa.add_line("H\tVN:Z:2.0")
Traceback (most recent call last):
...
gfapy.error.VersionError: ...
Multiple definitions of custom header tags
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
If the tags are present only once in the header in its entirety, the access to
the tags is the same as for any other line (see the :ref:`tags` chapter).
However, the specification does not forbid custom tags to be defined with
different values in different header lines (which we name "multi-definition
tags"). This particular case is handled in the next sections.
Reading multi-definitions tags
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Reading, validating and setting the datatype of multi-definition tags is done
using the same methods as for all other lines (see the :ref:`tags` chapter).
However, if a tag is defined multiple times on multiple H lines, reading the
tag will return a list of the values on the lines. This array is an instance of
the subclass ``gfapy.FieldArray`` of list.
.. doctest::
>>> gfa.add_line("H\txx:i:1") #doctest: +ELLIPSIS
>>> gfa.add_line("H\txx:i:2") #doctest: +ELLIPSIS
>>> gfa.add_line("H\txx:i:3") #doctest: +ELLIPSIS
>>> gfa.header.xx
gfapy.FieldArray('i',[1, 2, 3])
Setting tags
~~~~~~~~~~~~
There are two possibilities to set a tag for the header. The first is
the normal tag interface (using ``set`` or the tag name property). The
second is to use ``add``. The latter supports multi-definition tags,
i.e. it adds the value to the previous ones (if any), instead of
overwriting them.
.. doctest::
>>> gfa = gfapy.Gfa()
>>> gfa.header.xx
>>> gfa.header.add("xx", 1)
>>> gfa.header.xx
1
>>> gfa.header.add("xx", 2)
>>> gfa.header.xx
gfapy.FieldArray('i',[1, 2])
>>> gfa.header.set("xx", 3)
>>> gfa.header.xx
3
Modifying field array values
~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Field arrays can be modified directly (e.g. adding new values or
removing some values). After modification, the user may check if the
array values remain compatible with the datatype of the tag using the
:meth:`~gfapy.line.common.validate.Validate.validate_field`` method.
.. doctest::
>>> gfa = gfapy.Gfa()
>>> gfa.header.xx = gfapy.FieldArray('i',[1,2,3])
>>> gfa.header.xx
gfapy.FieldArray('i',[1, 2, 3])
>>> gfa.header.validate_field("xx")
>>> gfa.header.xx.append("X")
>>> gfa.header.validate_field("xx")
Traceback (most recent call last):
...
gfapy.error.FormatError: ...
If the field array is modified using array methods which return a list
or data of any other type, a field array must be constructed, setting
its datatype to the value returned by calling
:meth:`~gfapy.line.common.field_datatype.FieldDatatype.get_datatype`
on the header.
.. doctest::
>>> gfa = gfapy.Gfa()
>>> gfa.header.xx = gfapy.FieldArray('i',[1,2,3])
>>> gfa.header.xx
gfapy.FieldArray('i',[1, 2, 3])
>>> gfa.header.xx = gfapy.FieldArray(gfa.header.get_datatype("xx"),
... list(map(lambda x: x+1, gfa.header.xx)))
>>> gfa.header.xx
gfapy.FieldArray('i',[2, 3, 4])
String representation of the header
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
For consistency with other line types, the string representation of the header
is a single-line string, eventually non standard-compliant, if it contains
multiple instances of the tag. (and when calling
:meth:`~gfapy.line.common.writer.Writer.field_to_s` for a tag present multiple
times, the output string will contain the instances of the tag, separated by
tabs).
However, when the Gfa is output to file or string, the header is split into
multiple H lines with single tags, so that standard-compliant GFA is output.
The split header can be retrieved using the
:attr:`~gfapy.lines.headers.Headers.headers` property of the Gfa instance.
.. doctest::
>>> gfa = gfapy.Gfa()
>>> gfa.header.VN = "1.0"
>>> gfa.header.xx = gfapy.FieldArray('i',[1,2])
>>> gfa.header.field_to_s("xx")
'1\t2'
>>> gfa.header.field_to_s("xx", tag=True)
'xx:i:1\txx:i:2'
>>> str(gfa.header)
'H\tVN:Z:1.0\txx:i:1\txx:i:2'
>>> [str(h) for h in gfa.headers]
['H\tVN:Z:1.0', 'H\txx:i:1', 'H\txx:i:2']
>>> str(gfa)
'H\tVN:Z:1.0\nH\txx:i:1\nH\txx:i:2'
Count the input header lines
~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Due to the different way header lines are stored, the number of header elements
is not equal to the number of header lines in the input. This is annoying if an
application wants to count the number of input lines in a file. In order to make
that possible, the number of input header lines are counted and can be
retrieved using the :attr:`~gfapy.lines.headers.Headers.n_input_header_lines`
property of the Gfa instance.
.. doctest::
>>> gfa = gfapy.Gfa()
>>> gfa.add_line("H\txx:i:1\tyy:Z:ABC") #doctest: +ELLIPSIS
>>> gfa.add_line("H\txy:i:2") #doctest: +ELLIPSIS
>>> gfa.add_line("H\tyz:i:3\tab:A:A") #doctest: +ELLIPSIS
>>> len(gfa.headers)
5
>>> gfa.n_input_header_lines
3
|