1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463
|
API Reference
=============
pyfastx.version
---------------
.. py:function:: pyfastx.version(debug=False)
Get current version of pyfastx
:param bool debug: if true, return versions of pyfastx, zlib, sqlite3 and zran.
:return: version of pyfastx
:rtype: str
.. py:function:: pyfastx.gzip_check(file_name)
New in pyfastx 0.5.4
Check file is gzip compressed or not
:param str file_name: the path of input file
:return: Ture if file is gzip compressed else False
:rtype: bool
.. py:function:: pyfastx.reverse_complement(seq)
New in pyfastx 2.0.0
get reverse complement sequence of given DNA sequence
:param str seq: DNA sequence
:return: reverse complement sequence
:rtype: str
pyfastx.Fasta
-------------
.. py:class:: pyfastx.Fasta(file_name, index_file=None, uppercase=True, build_index=True, full_index=False, full_name=False, memory_index=False, key_func=None)
Read and parse fasta files. Fasta can be used as dict or list, you can use index or sequence name to get a sequence object, e.g. ``fasta[0]``, ``fasta['seq1']``
:param str file_name: the file path of input FASTA file
:param str index_file: the index file of FASTA file, default using index file with extension of .fxi in the same directory of FASTA file, New in 2.0.0
:param bool uppercase: always output uppercase sequence, default: ``True``
:param bool build_index: build index for random access to FASTA sequence, default: ``True``. If build_index is False, iteration will return a tuple (name, seq); If build_index is True, iteration will return a sequence object.
:param bool full_index: calculate character (e.g. A, T, G, C) composition when building index, this will improve the speed of GC content extracting. However, it will take more time to build index, default: ``False``
:param bool full_name: use the full header line instead of the part before first whitespace as the identifier of sequence, even in mode without building index. New in 0.6.14, default: ``False``
:param bool memory_index: if memory_index is True, the fasta index will be kept in memory and do not generate a index file, default: ``False``
:param function key_func: new in 0.5.1, key function is generally a lambda expression to split header and obtain a shortened identifer, default: ``None``
:return: Fasta object
.. py:attribute:: file_name
FASTA file path
.. py:attribute:: size
total length of sequences in FASTA file
.. py:attribute:: type
New in ``pyfastx`` 0.5.4
get fasta type, return DNA, RNA, protein, or unknown
.. py:attribute:: is_gzip
New in pyfastx 0.5.0
return True if fasta is gzip compressed else return False
.. py:attribute:: gc_content
GC content of whole sequences in FASTA file, return a float value
.. py:attribute:: gc_skew
GC skew of whole sequences in FASTA file, learn more about `GC skew <https://en.wikipedia.org/wiki/GC_skew>`_
New in ``pyfastx`` 0.3.8
.. py:attribute:: composition
nucleotide composition in FASTA file, a dict contains counts of A, T, G, C and N (unkown nucleotide base)
.. py:attribute:: longest
get longest sequence in FASTA file, return a Sequence object
New in ``pyfastx`` 0.3.0
.. py:attribute:: shortest
get shortest sequence in FASTA file, return a Sequence object
New in ``pyfastx`` 0.3.0
.. py:attribute:: mean
get average length of sequences in FASTA file
New in ``pyfastx`` 0.3.0
.. py:attribute:: median
get median length of sequences in FASTA file
New in ``pyfastx`` 0.3.0
.. py:method:: fetch(chrom, intervals, strand='+')
truncate subsequences from a given sequence by a start and end coordinate or a list of coordinates. This function will cache the full sequence into memory, and is suitable for extracting large numbers of subsequences from specified sequence.
:param str chrom: chromosome name or sequence name
:param list/tuple intervals: list of [start, end] coordinates
:param str strand: sequence strand, ``+`` indicates sense strand, ``-`` indicates antisense strand, default: '+'
.. note::
intervals can be a list or tuple with start and end position e.g. (10, 20).
intervals also can be a list or tuple with multiple coordinates e.g. [(10, 20), (50,70)]
:return: sliced subsequences
:rtype: str
.. py:method:: flank(chrom, start, end, flank_length=50, use_cache=False)
Get the flank sequence of given subsequence with start and end. New in 0.7.0
:param str chrom: chromosome name or sequence name
:param int start: 1-based start position of subsequence on chrom
:param int end: 1-based end position of subsequence on chrom
:param int flank_length: length of flank sequence, default 50
:param bool use_cache: cache the whole sequence
.. note::
If you want to extract flank sequence for large numbers of subsequences from the same sequence. Use ``use_cache=True`` will greatly improve the speed
:return: left flank and right flank sequence
:rtype: tuple
.. py:method:: build_index()
build index for FASTA file
.. py:method:: keys()
get all names of sequences
:return: an FastaKeys object
.. py:method:: count(n)
get counts of sequences whose length >= n bp
New in ``pyfastx`` 0.3.0
:param int n: number of bases
:return: sequence counts
:rtype: int
.. py:method:: nl(quantile)
calculate assembly N50 and L50, learn more about `N50,L50 <https://www.molecularecologist.com/2017/03/whats-n50/>`_
New in ``pyfastx`` 0.3.0
:param int quantile: a number between 0 and 100, default 50
:return: (N50, L50)
:rtype: tuple
pyfastx.Sequence
----------------
.. py:class:: pyfastx.Sequence
Readonly sequence object generated by fasta object, Sequence can be treated as a list and support slicing e.g. ``seq[10:20]``
.. py:attribute:: id
sequence id or order number in FASTA file
.. py:attribute:: name
sequence name
.. py:attribute:: description
Get sequence description after name in sequence header
New in ``pyfastx`` 0.3.1
.. py:attribute:: start
start position of sequence
.. py:attribute:: end
end position of sequence
.. py:attribute:: gc_content
GC content of current sequence, return a float value
.. py:attribute:: gc_skew
GC skew of current sequence, learn more about `GC skew <https://en.wikipedia.org/wiki/GC_skew>`_
.. py:attribute:: composition
nucleotide composition of sequence, a dict contains counts of A, T, G, C and N (unkown nucleotide base)
.. py:attribute:: raw
get the raw string (with header line and sequence lines) of sequence as it appeared in file
New in ``pyfastx`` 0.6.3
.. py:attribute:: seq
get the string of sequence in sense strand
.. py:attribute:: reverse
get the string of reversed sequence
.. py:attribute:: complement
get the string of complement sequence
.. py:attribute:: antisense
get the string of sequence in antisense strand, corresponding to reversed and complement sequence
.. py:method:: search(subseq, strand='+')
Search for subsequence from given sequence and get the start position of the first occurrence
New in ``pyfastx`` 0.3.6
:param str subseq: a subsequence for search
:param str strand: sequence strand + or -, default +
:return: if found subsequence return one-based start position, if not return None
:rtype: int or None
pyfastx.Fastq
-------------
New in ``pyfastx`` 0.4.0
.. py:class:: pyfastx.Fastq(file_name, index_file=None, phred=0, build_index=True, full_index=False)
Read and parse fastq file
:param str file_name: input FASTQ file path
:param str index_file: the index file of FASTQ file, default using the index file with extension of .fxi in the same directory of FASTQ file. New in 2.0.0
:param bool build_index: build index for random access to FASTQ reads, default: ``True``. If build_index is False, iteration will return a tuple (name, seq, qual); If build_index is True, iteration will return a read object
:param bool full_index: calculate character (e.g. A, T, G, C) composition when building index, this will improve the speed of GC content extracting. However, it will take more time to build index, default: ``False``
:param int phred: phred was used to convert quality ascii to quality int value, usually is 33 or 64, default ``33``
:return: Fastq object
.. py:attribute:: file_name
FASTQ file path
.. py:attribute:: size
total bases in FASTQ file
.. py:attribute:: is_gzip
New in pyfastx 0.5.0
return True if fasta is gzip compressed else return False
.. py:attribute:: gc_content
GC content of whole FASTQ file
.. py:attribute:: avglen
New in ``pyfastx`` 0.6.10
get average length of reads
.. py:attribute:: maxlen
New in ``pyfastx`` 0.6.10
get maximum length of reads
.. py:attribute:: minlen
New in ``pyfastx`` 0.6.10
get minimum length of reads
.. py:attribute:: maxqual
New in ``pyfastx`` 0.6.10
get maximum quality score of bases
.. py:attribute:: minqual
New in ``pyfastx`` 0.6.10
get minimum quality score of bases
.. py:attribute:: composition
base composition in FASTQ file, a dict contains counts of A, T, G, C and N (unkown nucleotide base)
.. py:attribute:: phred
get phred value
.. py:attribute:: encoding_type
New in ``pyfastx`` 0.4.1
Guess the quality encoding type used by FASTQ sequence file
.. py:method:: build_index()
Build index for fastq file when build_index set to False
.. py:method:: keys()
New in ``pyfastx`` 0.8.0
Get all the names of reads in fastq file
:return: an FastqKeys object
pyfastx.Read
------------
New in ``pyfastx`` 0.4.0
.. py:class:: pyfastx.Read
Readonly read object for obtaining read information, generated by fastq object
.. py:attribute:: id
read id or order number in FASTQ file
.. py:attribute:: name
read name excluding '@'
.. py:attribute:: description
get the full header line of read
.. py:attribute:: raw
get the raw string (with header, sequence, comment and quality lines) of read as it appeared in file
New in ``pyfastx`` 0.6.3
.. py:attribute:: seq
get read sequence string
.. py:attribute:: qual
get read quality ascii string
.. py:attribute:: quali
get read quality integer value (ascii - phred), return a list
pyfastx.Fastx
-------------
.. py:class:: pyfastx.Fastx(file_name, format="auto", uppercase=False)
New in ``pyfastx`` 0.8.0. A python binding of kseq.h, provide a simple api for iterating over sequences in fasta/q file
:param str file_name: input fasta or fastq file path
:param str format: the input file format, can be "fasta" or "fastq", default: "auto", automatically detect the format of sequence file
:param bool uppercase: always output uppercase sequence, only work for fasta file, default: False
:return: Fastx object
pyfastx.FastaKeys
------------------
.. py:class:: pyfastx.FastaKeys
FastaKeys is a readonly and list-like object, contains all names of sequences
.. py:method:: sort(by="id", reverse=False)
Sort keys by sequence id, name or length for iteration
New in ``pyfastx`` 0.5.0
:param str by: order by id, name, or length, default is id
:param bool reverse: used to flag descending sorts, default is False
:return: FastaKeys object itself
.. py:method:: filter(*filters)
Filter keys by sequence name and length for iteration
:param list filters: filters generated by comparison like ids > 500 or ids % 'seq1', where ids is a Identifier object
:return: FastaKeys object itself
.. py:method:: reset()
Clear all filters and sort order
:return: FastaKeys object itself
pyfastx.FastqKeys
------------------
.. py:class:: pyfastx.FastqKeys
New in ``pyfastx`` 0.8.0. FastqKeys is a readonly and list-like object, contains all names of reads
|