1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333
|
.. _chap_methodology:
***********
Methodology
***********
**siderRETRO** uses NGS (*Next Generation Sequencing*) data
to identify **unfixed** - *dimorphic/polymorphic, germinative,
or somatic* - retrocopies absent in the reference genome,
but present in the sequenced genome (by NGS).
Our **methodology** consists of detecting
abnormal (discordant) alignments in `SAM/BAM/CRAM
<https://samtools.github.io/hts-specs/SAMv1.pdf>`_
file and, with an **unsupervised machine learning**
algorithm, clustering these reads and genotype in order
to discover somatic retrocopy insertions. Care is taken
to ensure the quality and consistency of the data,
taking into account the features that characterize a
retrocopy mobilization, such as the absence of
**intronic** and **regulatory** regions.
.. note:: For more detail about the jargon, see `Retrocopy in a nutshell <retrocopy.rst>`_
Abnormal alignment
==================
When a structural variation, such as a retrotransposition,
occurs into an individual and her genome is sequenced with
a next-generation sequencing technology (e.g. `illumina
<https://www.illumina.com/>`_), we may **expect** that the
aligner (e.g. `BWA <http://bio-bwa.sourceforge.net/>`_,
`Bowtie <http://bowtie-bio.sourceforge.net/index.shtml>`_)
will be **confused** as to the origin of certain reads. As the
retrocopies come from a **mature mRNA**, reads from the
retrocopy may be **erroneously** aligned to an **exon** of the
**parental gene**:
.. image:: images/indistinguishable_alignment.png
:scale: 25%
:align: center
These kind of alignment may be called **indistinguishable**,
because they do not give any **clue** about the presence of
the retrocopy. However, for our luck, there are reads
with abnormal (discordant) alignments which could be
helpful according to their characteristics:
- Paired-end reads aligned at **different** exons
- Paired-end reads aligned at **different** chromosomes
- Paired-end reads aligned at **distant** regions
- Splitted reads (Reads with **supplementary** alignment)
We will talk about each one as best as we can in the
next lines.
Alignment at different exons
----------------------------
When paired-end reads are mapped to contiguous exons and they
came from a genomic sequencing - which of course is not
expected.
.. image:: images/abnormal_alignment_exon.png
:scale: 25%
:align: center
This kind of alignment is useful for **assume** a
retrotransposition for the given parental gene, however it
is not possible to annotate the **genomic position** of the event.
Alignment at different chromosomes
------------------------------------
When the retrotransposition does not occur into the **same** parental
gene chromosome, it may happen that one read come from a **near**
intergenic region and its pair from the somatic **retrocopy**. As the
retrocopy **does not exist** in the reference genome, the aligner will
**map** one read to the retrotransposition chromosome and its pair to
the parental gene **exon**.
.. image:: images/abnormal_alignment_chr.png
:scale: 25%
:align: center
This alignment is useful to **estimate** the genomic position of the
event, but not with so much **precision** concerning to the **insertion
point**.
Alignment at distant regions
-----------------------------
If a retrocopy is inserted into the **same** chromosome of its parental
gene, possibly it will occur at a **distant** location. As well as
*"alignment at different chromosomes"*, one read may come from a near
intergenic region and its mate from the somatic retrocopy. So when
the aligner try to map these reads, we will observe that one fall
inside the parental gene exon, while its pair is mapped to a **distant
region**.
.. image:: images/abnormal_alignment_dist.png
:scale: 25%
:align: center
Splitted reads
--------------
The most **important** kind of alignment when detecting structural variations.
The splitted read may occur when **part** of the **same** read come from a near
intergenic region and part from the somatic retrocopy. When the aligner
**try** to map the read, it will need to **create** another one to represent
the splitted part, which is called **supplementary**.
.. image:: images/abnormal_alignment_sr.png
:scale: 25%
:align: center
This alignment is useful to detect the **insertion point** with a
**good precision**.
Taking all together
-------------------
We can resume all abnormal alignments according to their power
to detect the retrotransposition coordinate and its exact insertion
point:
+--------------------------+------------+-----------------+
| Abnormal alignments | Coordinate | Insertion point |
+==========================+============+=================+
| At different exons | NO | NO |
+--------------------------+------------+-----------------+
| At different chromosomes | YES | NO |
+--------------------------+------------+-----------------+
| At distant regions | YES | NO |
+--------------------------+------------+-----------------+
| Splitted read | YES | YES |
+--------------------------+------------+-----------------+
sideRETRO uses **only** the abnormal alignments **capable** to
detect **at least** the coordinate, so those that fall into
*different exons* are dismissed.
Clustering
==========
So far we have been talking about abnormal reads **selection**. As
soon as this step is over, we need to determine if a bunch of
reads aligned to some genomic region may **represent** a putative
retrocopy insertion. Therefore, firstly we restrict the abnormal
reads for those whose **mate is mapped** to a protein coding **exon**,
and then we **cluster** them according to the chromosome they mapped
to.
.. image:: images/abnormal_alignment_clustering.png
:scale: 25%
:align: center
Wherefore, the clustering algorithm plays the role to resolve
if there really is a retrotransposition event. As the **number**
of reads **covering** the group is an important feature to take
into account, one possible choice of algorithm is **DBSCAN**.
DBSCAN
------
*Density Based Spatial Clustering of Applications with Noise* [1]_
is a density based clustering algorithm designed to discover cluster
in a **spatial database**. In our particular case, the database is
spatially of **one dimension** (the chromosome extension) and the
points are represented by the **range** comprising the mapped reads
start and end.
.. image:: images/DBSCAN.png
:scale: 25%
:align: center
The denser (covered) the region the greater the chance of a
retrotransposition event there.
Genotype
========
In order to **increase** the putative insertion coverage, it is common
to **join** analysis of a bunch of individuals. After the discovery
of the retrocopies, it is necessary to identify **who owns** the
variation and with what **zygosity** ((heterozygous, homozygous).
So we have **three** possibilities for biallelic sites [2]_: If *A*
is the **reference** allele and *B* is the **alternate** allele, the
ordering of genotypes for the likelihoods is *AA*, *AB*, *BB*. The
**likelihoods** in turn is calculated based on *Heng Li* paper [3]_
with some assumptions that we are going to discuss.
Suppose at a given retrotransposition insertion point site there are
*k* reads. Let the first *l* reads identical to the reference genome
and the rest be different. The unphred alignment error probability of
the *j*-th read is :math:`\epsilon_{j}`. Assuming error independence,
we can derive that:
.. math::
\delta(g) =
\frac{1}{m^k}
\prod_{j=1}^{l} [(m-g)\epsilon_{j}+g(1-\epsilon_{j})]
\prod_{j=l+1}^{k} [(m-g)(1-\epsilon_{j})+g\epsilon_{j}]
where:
+-------------------+--------------------------------------------+
| :math:`\delta(g)` | Likelihoods for a given genotype |
+-------------------+--------------------------------------------+
| :math:`m` | Ploidy |
+-------------------+--------------------------------------------+
| :math:`g` | Genotype (the number of reference alleles) |
+-------------------+--------------------------------------------+
.. note::
The way we are modeling the likelihoods probability **differs** a little
bit from the **SNP calling** model: We are **treating** the *read* as the
**unit**, not the *base*, therefore the error (:math:`\epsilon`) is the
**mapping** quality (fifth column of SAM file), instead of the
**sequencing** quality.
So we can summarize the formula for homozygous reference (HOR), heterozygous
(HET) and homozygous alternate (HOA):
HOR
.. math::
\delta(HOR) =
\frac{1}{2^k}
\prod_{j=1}^{l} 2(1-\epsilon_{j})
\prod_{j=l+1}^{k} 2\epsilon_{j}
HET
.. math::
\delta(HET) =
\frac{1}{2^k}
HOA
.. math::
\delta(HOA) =
\frac{1}{2^k}
\prod_{j=1}^{l} 2\epsilon_{j}
\prod_{j=l+1}^{k} 2(1-\epsilon_{j})
We determine the insertion point site according to the abnormal alignments
clustering. Those *reads* will be used as the :math:`k - l` rest of the
*reads* which differs from reference genome. In order to verify if there
is evidence of reference allele, we need to come back to the SAM file and
check for the presence of *reads* **crossing** the insertion point. To
**mitigate** alignment error - which would otherwise overestimate the
number of reference allele *reads* - we select the *reads* that cover one
**decile** of *read* length window containing the insertion point. Then we
come to the :math:`l` *reads* **identical** to the reference genome and can
calculate the **genotype likelihoods**.
.. image:: images/genotype.png
:scale: 25%
:align: center
Orientation
===========
Other important information that can be obtained from the data is the
retrocopy **orientation** in relation to its parental gene. The abnormal
alignment *reads* give us the clue to solve this issue. We catch *reads*
when one pair aligns against an exon and its mate aligns to some genomic
region, so we can **sort** the *reads* from the exonic site and analyze
if their mates will be sorted in **ascending** or **descending** order as
result. If we observe that they are **directly** proportional, then we can
assume that the retrocopy is at the **same** parental gene strand, else they
are at **opposite** strands.
.. warning::
This approach disregards the fact that there may have been structural variations,
such as chromosomal inversions, which may invalidate these results.
Therefore summarizing:
* Retrocopy and its parental gene are at the same strand
.. image:: images/orientation_same_strand.png
:scale: 25%
:align: center
* Retrocopy and its parental gene are at opposite strands
.. image:: images/orientation_opposite_strand.png
:scale: 25%
:align: center
Spearman's rank correlation coefficient
---------------------------------------
We use *Spearman's rank correlation coefficient* [4]_ in order to have a
**measure** of relationship between *reads* from exon and their mates from
clustering site. As our data is **nonparametric**, the Spearman's rho can
assess **monotonic** relationship, that is, it can tell us if the genomic
position of *reads* from exon **increases** when **does** the genomic
position of their *mates* from clustering site (positive rho) - or the
opposite (negative rho).
So we come to the following proposition:
+----------------------+---------+---------+
| | Retrocopy strand |
| Parental gene strand +---------+---------+
| | rho > 0 | rho < 0 |
+======================+=========+=========+
| \+ | \+ | \- |
+----------------------+---------+---------+
| \- | \- | \+ |
+----------------------+---------+---------+
References and Further Reading
==============================
.. [1] Ester, Martin. (1996).
A Density-Based Algorithm for Discovering Clustersin Large Spatial Databases with Noise.
KDD. Available at https://www.aaai.org/Papers/KDD/1996/KDD96-037.pdf.
.. [2] hts-specs. (2019).
The Variant Call Format (VCF) Version 4.2 Specificatio.
Available at https://samtools.github.io/hts-specs/VCFv4.2.pdf.
.. [3] Li, Heng (2011).
A statistical framework for SNP calling, mutation discovery, association mapping and
population genetical parameter estimation from sequencing data.
Oxford University Press.
.. [4] Fieller, E. C., et al. (1957).
Tests for Rank Correlation Coefficients. I.
Biometrika, 44(3/4), 470–481. JSTOR.
Available at https://www.jstor.org/stable/2332878.
|