1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144
|
========
Glossary
========
.. glossary::
:sorted:
cigar
Stands for Compact Idiosyncratic Gapped Alignment Report and
represents a compressed (run-length encoded) pairwise alignment
format. It was first defined by the Exonerate Aligner, but was alter
adapted and adopted as part of the :term:`SAM` standard and many other
aligners. In the Python API, the cigar alignment is presented as a
list of tuples ``(operation,length)``. For example, the tuple ``[
(0,3), (1,5), (0,2) ]`` refers to an alignment with 3 matches, 5
insertions and another 2 matches.
region
A genomic region, stated relative to a :term:`reference` sequence. A
region consists of reference name ('chr1'), start (15000), and
end (20000). Start and end can be omitted for regions spanning
a whole chromosome. If ``end`` is missing, the region will span from
``start`` to the end of the chromosome. Within pysam, coordinates
are 0-based half-open intervals, i.e., the first base of the
reference sequence is numbered zero; and the base at position
``start`` is part of the interval, but the base at ``end`` is not.
When a region is written as a single string using
`samtools`_-compatible notation, e.g., 'chr1:15001-20000',
the string's coordinates instead represent a 1-based closed interval,
i.e., both (1-based) positions 15,001 and 20,000 are part of the
interval. (This example denotes the same 5,000-base region as the
example in the previous paragraph.)
genotype
An individual's collection of genes. It can also refer to the two alleles
inherited for a particular gene.
column
The portion of reads aligned to a single base in the
:term:`reference` sequence.
tid
The :term:`target` id. The target id is 0 or a positive integer mapping to
entries within the sequence dictionary in the header section of
a :term:`TAM` file or :term:`BAM` file.
contig
The sequence that a :term:`tid` refers to. For example ``chr1``, ``contig123``.
reference
Synonym for contig.
BED
Browser Extensible Data format. A text file format used to store genomic
:term:`regions<region>` as coordinates and associated notations.
GTF
The Gene Transfer Format is a file format used to hold information
about gene structure.
SAM
A textual format for storing genomic alignment information.
BAM
Binary SAM format. BAM files are binary formatted, indexed and
allow random access.
CRAM
CRAM is a binary format representing the same sequence alignment
information as SAM and BAM, but offering significantly better
lossless compression than BAM.
TAM
Text SAM file. TAM files are human readable files of
tab-separated fields. TAM files do not allow random access.
sam file
A file containing aligned reads. The :term:`sam file` can either
be a :term:`BAM` file or a :term:`TAM` file.
pileup
Pileup
fetching
Retrieving all mapped reads mapped to a :term:`region`.
target
The sequence that a read has been aligned to. Target
sequences have bot a numerical identifier (:term:`tid`)
and an alphanumeric name (:term:`Reference`).
tabix file
A sorted, compressed and indexed tab-separated file created
by the command line tool :file:`tabix` or the commands
:meth:`tabix_compress` and :meth:`tabix_index`. The file
is indexed by chromosomal coordinates.
tabix row
A row in a :term:`tabix file`. Fields within a row are
tab-separated.
soft clipping
soft clipped
In alignments with soft clipping part of the query sequence
are not aligned. The unaligned query sequence is still part
of the alignment record. This is in difference to
:term:`hard clipped` reads.
hard clipping
hard clipped
In hard clipped reads, part of the sequence has been removed
prior to alignment. That only a subsequence is aligend might be
recorded in the :term:`cigar` alignment, but the removed
sequence will not be part of the alignment record, in contrast
to :term:`soft clipped` reads.
VCF
Variant Call Format.
BCF
Binary :term:`VCF`.
FASTA
Simple text format containing sequence data, with only the bare
minimum of metadata. Typically used for reference sequence data.
FASTQ
Simple text format containing sequence data and associated base
qualities.
tabix
Utility in the htslib package to index :term:`bgzip` compressed
files.
faidx
Utility in the `samtools`_ package to index :term:`fasta` formatted
files.
bgzip
Utility in the htslib package to block compress genomic data
files.
|