File: README.md

package info (click to toggle)
pyensembl 2.2.4%2Bds-1
  • links: PTS, VCS
  • area: main
  • in suites: bookworm
  • size: 456 kB
  • sloc: python: 3,896; sh: 20; makefile: 10
file content (210 lines) | stat: -rw-r--r-- 7,368 bytes parent folder | download
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
<a href="https://app.travis-ci.com/github/openvax/pyensembl">
    <img src="https://app.travis-ci.com/openvax/pyensembl.svg?branch=master" alt="Build Status" />
</a>
<a href="https://coveralls.io/github/openvax/pyensembl?branch=master">
    <img src="https://coveralls.io/repos/openvax/pyensembl/badge.svg?branch=master&service=github" alt="Coverage Status" />
</a>
<a href="https://pypi.python.org/pypi/pyensembl/">
    <img src="https://img.shields.io/pypi/v/pyensembl.svg?maxAge=1000" alt="PyPI" />
</a>


PyEnsembl
=======
PyEnsembl is a Python interface to [Ensembl](http://www.ensembl.org) reference genome metadata such as exons and transcripts. PyEnsembl downloads [GTF](https://en.wikipedia.org/wiki/Gene_transfer_format) and [FASTA](https://en.wikipedia.org/wiki/FASTA_format) files from the [Ensembl FTP server](ftp://ftp.ensembl.org) and loads them into a local database. PyEnsembl can also work with custom reference data specified using user-supplied GTF and FASTA files. 

# Example Usage

```python
from pyensembl import EnsemblRelease

# release 77 uses human reference genome GRCh38
data = EnsemblRelease(77)

# will return ['HLA-A']
gene_names = data.gene_names_at_locus(contig=6, position=29945884)

# get all exons associated with HLA-A
exon_ids  = data.exon_ids_of_gene_name('HLA-A')
```

# Installation

You can install PyEnsembl using [pip](https://pip.pypa.io/en/latest/quickstart.html):

```sh
pip install pyensembl
```

This should also install any required packages, such as [datacache](https://github.com/openvax/datacache) and
[BioPython](http://biopython.org/).

Before using PyEnsembl, run the following command to download and install
Ensembl data:

```
pyensembl install --release <list of Ensembl release numbers> --species <species-name>
```

For example, `pyensembl install --release 75 76 --species human` will download and install all
human reference data from Ensembl releases 75 and 76.

Alternatively, you can create the `EnsemblRelease` object from inside a Python
process and call `ensembl_object.download()` followed by `ensembl_object.index()`.

## Cache Location
By default, PyEnsembl uses the platform-specific `Cache` folder
and caches the files into the `pyensembl` sub-directory.
You can override this default by setting the environment key `PYENSEMBL_CACHE_DIR`
as your preferred location for caching:

```sh
export PYENSEMBL_CACHE_DIR=/custom/cache/dir
```

or

```python
import os

os.environ['PYENSEMBL_CACHE_DIR'] = '/custom/cache/dir'
# ... PyEnsembl API usage
```

# Non-Ensembl Data

PyEnsembl also allows arbitrary genomes via the specification
of local file paths or remote URLs to both Ensembl and non-Ensembl GTF
and FASTA files. (Warning: GTF formats can vary, and handling of
non-Ensembl data is still very much in development.)

For example:

```python
data = Genome(
    reference_name='GRCh38',
    annotation_name='my_genome_features',
    gtf_path_or_url='/My/local/gtf/path_to_my_genome_features.gtf')
# parse GTF and construct database of genomic features
data.index()
gene_names = data.gene_names_at_locus(contig=6, position=29945884)
```

# API

The `EnsemblRelease` object has methods to let you access all possible
combinations of the annotation features *gene\_name*, *gene\_id*,
*transcript\_name*, *transcript\_id*, *exon\_id* as well as the location of
these genomic elements (contig, start position, end position, strand).

## Genes

<dl>
<dt>genes(contig=None, strand=None)</dt>
<dd>Returns a list of Gene objects, optionally restricted to a particular contig
or strand.</dd>

<dt>genes_at_locus(contig, position, end=None, strand=None)</dt>
<dd>Returns a list of Gene objects overlapping a particular position on a contig,
optionally extend into a range with the end parameter and restrict to
forward or backward strand by passing strand='+' or strand='-'.</dd>

<dt>gene_by_id(gene_id)</dt>
<dd>Return a Gene object for given Ensembl gene ID (e.g. "ENSG00000068793").</dd>

<dt>gene_names(contig=None, strand=None)</dt>
<dd>Returns all gene names in the annotation database, optionally restricted
to a particular contig or strand.</dd>

<dt>genes_by_name(gene_name)</dt>
<dd>Get all the unqiue genes with the given name (there might be multiple
due to copies in the genome), return a list containing a Gene object for each
distinct ID.</dd>

<dt>gene_by_protein_id(protein_id)</dt>
<dd>Find Gene associated with the given Ensembl protein ID (e.g. "ENSP00000350283")</dd>

<dt>gene_names_at_locus(contig, position, end=None, strand=None)
</dt>
<dd>Names of genes overlapping with the given locus, optionally restricted by strand.
(returns a list to account for overlapping genes)</dd>

<dt>gene_name_of_gene_id(gene_id)
</dt>
<dd>Returns name of gene with given genen ID.</dd>

<dt>gene_name_of_transcript_id(transcript_id)
</dt><dd>Returns name of gene associated with given transcript ID.</dd>

<dt>gene_name_of_transcript_name(transcript_name)
</dt>
<dd>Returns name of gene associated with given transcript name.</dd>

<dt>gene_name_of_exon_id(exon_id)
</dt><dd>Returns name of gene associated with given exon ID.</dd>

<dt>gene_ids(contig=None, strand=None)
</dt>
<dd>Return all gene IDs in the annotation database, optionally restricted by
chromosome name or strand.</dd>

<dt>gene_ids_of_gene_name(gene_name)
</dt>
<dd>Returns all Ensembl gene IDs with the given name.</dd>

</dl>

## Transcripts

<dl>
<dt>transcripts(contig=None, strand=None)</dt>
<dd>Returns a list of Transcript objects for all transcript entries in the
Ensembl database, optionally restricted to a particular contig or strand.</dd>

<dt>transcript_by_id(transcript_id)</dt>
<dd>Construct a Transcript object for given Ensembl transcript ID (e.g. "ENST00000369985")</dd>

<dt>transcripts_by_name(transcript_name)</dt>
<dd>Returns a list of Transcript objects for every transcript matching the given name.</dd>

<dt>transcript_names(contig=None, strand=None)</dt>
<dd>Returns all transcript names in the annotation database.</dd>

<dt>transcript_ids(contig=None, strand=None)</dt>
<dd>Returns all transcript IDs in the annotation database.</dd>

<dt>transcript_ids_of_gene_id(gene_id)</dt>
<dd>Return IDs of all transcripts associated with given gene ID.</dd>

<dt>transcript_ids_of_gene_name(gene_name)</dt>
<dd>Return IDs of all transcripts associated with given gene name.</dd>

<dt>transcript_ids_of_transcript_name(transcript_name)</dt>
<dd>Find all Ensembl transcript IDs with the given name.</dd>

<dt>transcript_ids_of_exon_id(exon_id)</dt>
<dd>Return IDs of all transcripts associatd with given exon ID.</dd>
</dl>

## Exons

<dl>
<dt>exon_ids(contig=None, strand=None)</dt>
<dd>Returns a list of exons IDs in the annotation database, optionally restricted
by the given chromosome and strand.</dd>

<dt>exon_by_id(exon_id)</dt>
<dd>Construct an Exon object for given Ensembl exon ID (e.g. "ENSE00001209410")</dd>

<dt>exon_ids_of_gene_id(gene_id)</dt>
<dd>Returns a list of exon IDs associated with a given gene ID.</dd>

<dt>exon_ids_of_gene_name(gene_name)</dt>
<dd>Returns a list of exon IDs associated with a given gene name.</dd>

<dt>exon_ids_of_transcript_id(transcript_id)</dt>
<dd>Returns a list of exon IDs associated with a given transcript ID.</dd>

<dt>exon_ids_of_transcript_name(transcript_name)</dt>
<dd>Returns a list of exon IDs associated with a given transcript name.</dd>
</dl>