File: api-resources.rst

package info (click to toggle)
python-bioframe 0.8.0-1
  • links: PTS, VCS
  • area: main
  • in suites: forky, sid
  • size: 2,280 kB
  • sloc: python: 7,459; makefile: 14; sh: 13
file content (69 lines) | stat: -rw-r--r-- 2,847 bytes parent folder | download
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
Resources
=========

Genome assembly metadata
------------------------

Bioframe provides a collection of genome assembly metadata for commonly used
genomes. These are accessible through a convenient dataclass interface via :func:`bioframe.assembly_info`.

The assemblies are listed in a manifest YAML file, and each assembly
has a mandatory companion file called `seqinfo` that contains the sequence
names, lengths, and other information. The records in the manifest file contain
the following fields:

- ``organism``: the organism name
- ``provider``: the genome assembly provider (e.g, ucsc, ncbi)
- ``provider_build``: the genome assembly build name (e.g., hg19, GRCh37)
- ``release_year``: the year of the assembly release
- ``seqinfo``: path to the seqinfo file
- ``cytobands``: path to the cytoband file, if available
- ``default_roles``: default molecular roles to include from the seqinfo file
- ``default_units``: default assembly units to include from the seqinfo file
- ``url``: URL to where the corresponding sequence files can be downloaded

The `seqinfo` file is a TSV file with the following columns (with header):

- ``name``: canonical sequence name
- ``length``: sequence length
- ``role``: role of the sequence or scaffold (e.g., "assembled", "unlocalized", "unplaced")
- ``molecule``: name of the molecule that the sequence belongs to, if placed
- ``unit``: assembly unit of the chromosome (e.g., "primary", "non-nuclear", "decoy")
- ``aliases``: comma-separated list of aliases for the sequence name

We currently do not include sequences with "alt" or "patch" roles in `seqinfo` files, but we
do support the inclusion of additional decoy sequences (as used by so-called NGS *analysis
sets* for human genome assemblies) by marking them as members of a "decoy" assembly unit.

The `cytoband` file is an optional TSV file with the following columns (with header):

- ``chrom``: chromosome name
- ``start``: start position
- ``end``: end position
- ``band``: cytogenetic coordinate (name of the band)
- ``stain``: Giesma stain result

The order of the sequences in the `seqinfo` file is treated as canonical.
The ordering of the chromosomes in the `cytobands` file should match the order
of the chromosomes in the `seqinfo` file.

The manifest and companion files are stored in the ``bioframe/io/data`` directory.
New assemblies can be requested by opening an issue on GitHub or by submitting a pull request.

.. automodule:: bioframe.io.assembly
   :autosummary:
   :members:

.. autoclass:: bioframe.io.assembly.GenomeAssembly
   :members:
   :undoc-members:


Remote resources
----------------
These functions now default to using the local data store, but can be used to obtain chromsizes or
centromere positions from UCSC by setting ``provider="ucsc"``.

.. automodule:: bioframe.io.resources
   :autosummary:
   :members: