File: PKG-INFO

package info (click to toggle)
python-ncls 0.0.70%2Bds-2
  • links: PTS, VCS
  • area: main
  • in suites: forky, sid
  • size: 412 kB
  • sloc: ansic: 2,246; python: 197; sh: 18; makefile: 8
file content (164 lines) | stat: -rw-r--r-- 4,890 bytes parent folder | download | duplicates (2)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
Metadata-Version: 2.4
Name: ncls
Version: 0.0.70
Summary: A fast interval tree-like implementation in C, wrapped for the Python ecosystem.
Author-email: Endre Bakken Stovner <endbak@pm.me>
License: MIT
Project-URL: Homepage, http://github.com/pyranges/ncls
Keywords: ncls,interval-tree,genomics
Classifier: Programming Language :: Python :: 3
Classifier: Development Status :: 4 - Beta
Classifier: Environment :: Other Environment
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: POSIX :: Linux
Classifier: Operating System :: MacOS :: MacOS X
Classifier: Topic :: Scientific/Engineering
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: numpy
Provides-Extra: dev
Requires-Dist: black; extra == "dev"
Requires-Dist: bumpver; extra == "dev"
Requires-Dist: isort; extra == "dev"
Requires-Dist: pip-tools; extra == "dev"
Requires-Dist: pytest; extra == "dev"
Dynamic: license-file

# Nested containment list

## Deprecation notice

While I'll continue maintaining this library I suggest you change to [ruranges](https://github.com/pyranges/ruranges/) which is a more lightweight and faster library with many more operations than NCLS.

## NCLS

[![Build Status](https://travis-ci.org/biocore-ntnu/ncls.svg?branch=master)](https://travis-ci.org/hunt-genes/ncls) [![PyPI version](https://badge.fury.io/py/ncls.svg)](https://badge.fury.io/py/ncls)

The Nested Containment List is a datastructure for interval overlap queries,
like the interval tree. It is usually an order of magnitude faster than the
interval tree both for building and query lookups.

The implementation here is a revived version of the one used in the now defunct
PyGr library, which died of bitrot. I have made it less memory-consuming and
created wrapper functions which allows batch-querying the NCLS for further speed
gains.

It was implemented to be the cornerstone of the PyRanges project, but I have made
it available to the Python community as a stand-alone library. Enjoy.

Original Paper: https://academic.oup.com/bioinformatics/article/23/11/1386/199545
Cite: http://dx.doi.org/10.1093/bioinformatics/btz615

## Cite

If you use this library in published research cite

http://dx.doi.org/10.1093/bioinformatics/btz615

## Install

```
pip install ncls
```

## Usage

```python
from ncls import NCLS

import pandas as pd

starts = pd.Series(range(0, 5))
ends = starts + 100
ids = starts

subject_df = pd.DataFrame({"Start": starts, "End": ends}, index=ids)

print(subject_df)
#    Start  End
# 0      0  100
# 1      1  101
# 2      2  102
# 3      3  103
# 4      4  104

ncls = NCLS(starts.values, ends.values, ids.values)

# python API, slower
it = ncls.find_overlap(0, 2)
for i in it:
    print(i)
# (0, 100, 0)
# (1, 101, 1)

starts_query = pd.Series([1, 3])
ends_query = pd.Series([52, 14])
indexes_query = pd.Series([10000, 100])

query_df = pd.DataFrame({"Start": starts_query.values, "End": ends_query.values}, index=indexes_query.values)

query_df
#        Start  End
# 10000      1   52
# 100        3   14


# everything done in C/Cython; faster
l_idxs, r_idxs = ncls.all_overlaps_both(starts_query.values, ends_query.values, indexes_query.values)
l_idxs, r_idxs
# (array([10000, 10000, 10000, 10000, 10000,   100,   100,   100,   100,
#          100]), array([0, 1, 2, 3, 4, 0, 1, 2, 3, 4]))

print(query_df.loc[l_idxs])
#        Start  End
# 10000      1   52
# 10000      1   52
# 10000      1   52
# 10000      1   52
# 10000      1   52
# 100        3   14
# 100        3   14
# 100        3   14
# 100        3   14
# 100        3   14
print(subject_df.loc[r_idxs])
#    Start  End
# 0      0  100
# 1      1  101
# 2      2  102
# 3      3  103
# 4      4  104
# 0      0  100
# 1      1  101
# 2      2  102
# 3      3  103
# 4      4  104

# return intervals in python (slow/mem-consuming)
intervals = ncls.intervals()
intervals
# [(0, 100, 0), (1, 101, 1), (2, 102, 2), (3, 103, 3), (4, 104, 4)]
```

There is also an experimental floating point version of the NCLS called FNCLS.
See the examples folder.

## Benchmark

Test file of 100 million intervals (created by subsetting gencode gtf with replacement):

| Library | Function | Time (s) | Memory (GB) |
| --- | --- | --- | --- |
| bx-python | build | 161.7 | 2.5 |
| ncls | build | 3.15 | 0.5 |
| bx-python | overlap | 148.4 | 4.3 |
| ncls | overlap | 7.2 | 0.5 |

Building is 50 times faster and overlap queries are 20 times faster. Memory
usage is one fifth and one ninth.

## Original paper

> Alexander V. Alekseyenko, Christopher J. Lee; Nested Containment List (NCList): a new algorithm for accelerating interval query of genome alignment and interval databases, Bioinformatics, Volume 23, Issue 11, 1 June 2007, Pages 1386–1393, https://doi.org/10.1093/bioinformatics/btl647