File: Introduction_to_MutationalPatterns.Rmd

package info (click to toggle)
r-bioc-mutationalpatterns 3.16.0%2Bdfsg-2
  • links: PTS, VCS
  • area: main
  • in suites: forky, sid, trixie
  • size: 5,360 kB
  • sloc: sh: 8; makefile: 2
file content (1691 lines) | stat: -rw-r--r-- 68,095 bytes parent folder | download | duplicates (4)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
1001
1002
1003
1004
1005
1006
1007
1008
1009
1010
1011
1012
1013
1014
1015
1016
1017
1018
1019
1020
1021
1022
1023
1024
1025
1026
1027
1028
1029
1030
1031
1032
1033
1034
1035
1036
1037
1038
1039
1040
1041
1042
1043
1044
1045
1046
1047
1048
1049
1050
1051
1052
1053
1054
1055
1056
1057
1058
1059
1060
1061
1062
1063
1064
1065
1066
1067
1068
1069
1070
1071
1072
1073
1074
1075
1076
1077
1078
1079
1080
1081
1082
1083
1084
1085
1086
1087
1088
1089
1090
1091
1092
1093
1094
1095
1096
1097
1098
1099
1100
1101
1102
1103
1104
1105
1106
1107
1108
1109
1110
1111
1112
1113
1114
1115
1116
1117
1118
1119
1120
1121
1122
1123
1124
1125
1126
1127
1128
1129
1130
1131
1132
1133
1134
1135
1136
1137
1138
1139
1140
1141
1142
1143
1144
1145
1146
1147
1148
1149
1150
1151
1152
1153
1154
1155
1156
1157
1158
1159
1160
1161
1162
1163
1164
1165
1166
1167
1168
1169
1170
1171
1172
1173
1174
1175
1176
1177
1178
1179
1180
1181
1182
1183
1184
1185
1186
1187
1188
1189
1190
1191
1192
1193
1194
1195
1196
1197
1198
1199
1200
1201
1202
1203
1204
1205
1206
1207
1208
1209
1210
1211
1212
1213
1214
1215
1216
1217
1218
1219
1220
1221
1222
1223
1224
1225
1226
1227
1228
1229
1230
1231
1232
1233
1234
1235
1236
1237
1238
1239
1240
1241
1242
1243
1244
1245
1246
1247
1248
1249
1250
1251
1252
1253
1254
1255
1256
1257
1258
1259
1260
1261
1262
1263
1264
1265
1266
1267
1268
1269
1270
1271
1272
1273
1274
1275
1276
1277
1278
1279
1280
1281
1282
1283
1284
1285
1286
1287
1288
1289
1290
1291
1292
1293
1294
1295
1296
1297
1298
1299
1300
1301
1302
1303
1304
1305
1306
1307
1308
1309
1310
1311
1312
1313
1314
1315
1316
1317
1318
1319
1320
1321
1322
1323
1324
1325
1326
1327
1328
1329
1330
1331
1332
1333
1334
1335
1336
1337
1338
1339
1340
1341
1342
1343
1344
1345
1346
1347
1348
1349
1350
1351
1352
1353
1354
1355
1356
1357
1358
1359
1360
1361
1362
1363
1364
1365
1366
1367
1368
1369
1370
1371
1372
1373
1374
1375
1376
1377
1378
1379
1380
1381
1382
1383
1384
1385
1386
1387
1388
1389
1390
1391
1392
1393
1394
1395
1396
1397
1398
1399
1400
1401
1402
1403
1404
1405
1406
1407
1408
1409
1410
1411
1412
1413
1414
1415
1416
1417
1418
1419
1420
1421
1422
1423
1424
1425
1426
1427
1428
1429
1430
1431
1432
1433
1434
1435
1436
1437
1438
1439
1440
1441
1442
1443
1444
1445
1446
1447
1448
1449
1450
1451
1452
1453
1454
1455
1456
1457
1458
1459
1460
1461
1462
1463
1464
1465
1466
1467
1468
1469
1470
1471
1472
1473
1474
1475
1476
1477
1478
1479
1480
1481
1482
1483
1484
1485
1486
1487
1488
1489
1490
1491
1492
1493
1494
1495
1496
1497
1498
1499
1500
1501
1502
1503
1504
1505
1506
1507
1508
1509
1510
1511
1512
1513
1514
1515
1516
1517
1518
1519
1520
1521
1522
1523
1524
1525
1526
1527
1528
1529
1530
1531
1532
1533
1534
1535
1536
1537
1538
1539
1540
1541
1542
1543
1544
1545
1546
1547
1548
1549
1550
1551
1552
1553
1554
1555
1556
1557
1558
1559
1560
1561
1562
1563
1564
1565
1566
1567
1568
1569
1570
1571
1572
1573
1574
1575
1576
1577
1578
1579
1580
1581
1582
1583
1584
1585
1586
1587
1588
1589
1590
1591
1592
1593
1594
1595
1596
1597
1598
1599
1600
1601
1602
1603
1604
1605
1606
1607
1608
1609
1610
1611
1612
1613
1614
1615
1616
1617
1618
1619
1620
1621
1622
1623
1624
1625
1626
1627
1628
1629
1630
1631
1632
1633
1634
1635
1636
1637
1638
1639
1640
1641
1642
1643
1644
1645
1646
1647
1648
1649
1650
1651
1652
1653
1654
1655
1656
1657
1658
1659
1660
1661
1662
1663
1664
1665
1666
1667
1668
1669
1670
1671
1672
1673
1674
1675
1676
1677
1678
1679
1680
1681
1682
1683
1684
1685
1686
1687
1688
1689
1690
1691
---
title: "Introduction to MutationalPatterns"
author:
- name: Freek Manders
  affiliation:
    - Princess Maxima Center, Utrecht, The Netherlands
  email: F.M.Manders@prinsesmaximacentrum.nl
- name: Francis Blokzijl
  affiliation:
    - University Medical Center Utrecht, Utrecht, The Netherlands
- name: Roel Janssen
  affiliation:
    - University Medical Center Utrecht, Utrecht, The Netherlands
- name: Rurika Oka
  affiliation:
    - Princess Maxima Center, Utrecht, The Netherlands
- name: Jurrian de Kanter
  affiliation:
    - Princess Maxima Center, Utrecht, The Netherlands
- name: Mark van Roosmalen
  affiliation:
    - Princess Maxima Center, Utrecht, The Netherlands
  email: vanBoxtelBioinformatics@prinsesmaximacentrum.nl
- name: Ruben van Boxtel
  affiliation:
    - Princess Maxima Center, Utrecht, The Netherlands
- name: Edwin Cuppen
  affiliation:
    - University Medical Center Utrecht, Utrecht, The Netherlands
package: "MutationalPatterns"
output: 
    BiocStyle::html_document
bibliography: references.bib
vignette: >
    %\VignetteIndexEntry{Introduction to MutationalPatterns}
    %\VignetteEngine{knitr::rmarkdown}
    %\VignetteEncoding{UTF-8}  
---


```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```



```{r, echo=FALSE}
options(width = 96)
library(ggplot2)
library(BiocStyle)
```

# Introduction

Mutational processes leave characteristic footprints in genomic DNA. This
package provides a comprehensive set of flexible functions that allows
researchers to easily evaluate and visualize a multitude of mutational patterns
in base substitution catalogues of e.g. healthy samples, tumour samples, or
DNA-repair deficient cells. This is the second major version of the package.
Many new functions have been added and functions from the previous version have
been enhanced. The package covers a wide range of patterns including: mutational
signatures, transcriptional and replicative strand bias, lesion segregation,
genomic distribution and association with genomic features, which are
collectively meaningful for studying the activity of mutational processes. The
package works with single nucleotide variants (SNVs), insertions and deletions
(Indels), double base substitutions (DBSs) and larger multi base substitutions
(MBSs). The package provides functionalities for both extracting mutational
signatures *de novo* and determining the contribution of previously identified
mutational signatures on a single sample level. MutationalPatterns integrates
with common R genomic analysis workflows and allows easy association with
(publicly available) annotation data.

Background on the biological relevance of the different mutational patterns, a
practical illustration of the package functionalities, comparison with similar
tools and software packages and an elaborate discussion, are described in the
new MutationalPatterns article, which is in currently being written. The old article can be
found [here](https://doi.org/10.1186/s13073-018-0539-0).

This vignette shows some common ways in which the functions in this package can
be used. It is however not exhaustive and won't show every argument of every
function. You can view the documentation of a function by adding a `?` in front
of it. Like: `?plot_spectrum`. The describes the functions and all their
arguments in more detail. It also contains more examples of how the functions in
this package can be used.

# Installation

First you need to install the package from `Bioconductor`.
```{r install_package, eval=FALSE}
if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install("MutationalPatterns")
```


You also need to load the package.
This needs to be repeated every time you restart `R`.
```{r Load package, message=FALSE}
library(MutationalPatterns)
```

# Data

To perform the mutational pattern analyses, you need to load one or multiple
VCF files with variant calls and the corresponding reference
genome.

## List reference genome

You can list available genomes using `r Biocpkg("BSgenome")`:
```{r, message=FALSE}
library(BSgenome)
head(available.genomes())
```

Make sure to install the reference genome that matches your VCFs.
For the example data this is `BSgenome.Hsapiens.UCSC.hg19`.

Now you can load your reference genome:
```{r}
ref_genome <- "BSgenome.Hsapiens.UCSC.hg19"
library(ref_genome, character.only = TRUE)
```

## Load example data SNVs

We provided two example data sets with this package. One consists of a subset of
somatic SNV catalogues of 9 normal human adult stem cells from 3 different
tissues (Blokzijl et al. 2016), and the other contains somatic indels and DBSs
from 3 healthy human hematopoietic stem cells (Osorio et al. 2018). The MBS data
you will find in the latter dataset was manually included by us to demonstrate
some features of this package.

This is how you can locate the VCF files of the example data from the first set.

These will be used for the SNV examples:
```{r locate_vcfs}
vcf_files <- list.files(system.file("extdata", package = "MutationalPatterns"),
  pattern = "sample.vcf", full.names = TRUE
)
```

You also need to define corresponding sample names for the VCF files:
```{r set_sample_names}
sample_names <- c(
  "colon1", "colon2", "colon3",
  "intestine1", "intestine2", "intestine3",
  "liver1", "liver2", "liver3"
)
```

Now you can load the VCF files into a `GRangesList`:
```{r read_vcfs_as_granges, message=FALSE}
grl <- read_vcfs_as_granges(vcf_files, sample_names, ref_genome)
```

Here we define relevant metadata on the samples, such as tissue type. 
This will be useful later.
```{r store tissue variable}
tissue <- c(rep("colon", 3), rep("intestine", 3), rep("liver", 3))
```


## Load example data indels, DBSs and MBSs

We will now locate the VCF files of the example data from the second set. 
These will be used for the indels, DBS and MBS examples.
```{r locate blood data}
blood_vcf_fnames <- list.files(
  system.file("extdata", package = "MutationalPatterns"), 
  pattern = "blood.*vcf", full.names = TRUE)
```

Set their sample names.
```{r set blood data sample names}
blood_sample_names <- c("blood1", "blood2", "blood3")
```

Read in the data, without filtering for any mutation type using the `type="all"`
argument.
(By default only SNVs are loaded for backwards compatibility.)
```{r read blood data}
blood_grl <- read_vcfs_as_granges(blood_vcf_fnames, blood_sample_names, 
                                  ref_genome, type = "all")
```

You can now retrieve different types of mutations from the `GrangesList`.
```{r get mut types}
snv_grl <- get_mut_type(blood_grl, type = "snv")
indel_grl <- get_mut_type(blood_grl, type = "indel")
dbs_grl <- get_mut_type(blood_grl, type = "dbs")
mbs_grl <- get_mut_type(blood_grl, type = "mbs")
```

It's also possible to directly select for a specific mutation type when reading
in the data. This can be a convenient shortcut, when you are only interested in
a single type of mutation.
```{r read indels, eval=FALSE}
indel_grl <- read_vcfs_as_granges(blood_vcf_fnames, blood_sample_names, 
                                  ref_genome, type = "indel")
```

By default the package assumes that DBS and MBS variants are present in your
vcfs as separate neighbouring SNVs. MutationalPatterns merges these to get DBS
and MBS variants. If DBS and MBS variants have already been defined in your vcf
or if you don't want any variants to be merged, then you can use the
`predefined_dbs_mbs` argument, when using `read_vcfs_as_granges` or
`get_mut_type`.
(In this example the result will be empty, because the DBS variants were not predefined)
```{r read predefined dbs, eval=FALSE}
predefined_dbs_grl <- read_vcfs_as_granges(blood_vcf_fnames, blood_sample_names, 
                                  ref_genome, type = "dbs",
                                  predefined_dbs_mbs = TRUE)
```


# Mutation characteristics

## SNVs

### Base substitution types

You can retrieve base substitution types from the VCF GRanges object as "REF>ALT"
using `mutations_from_vcf`:

```{r mutations_from_vcf}
muts <- mutations_from_vcf(grl[[1]])
head(muts, 12)
```

You can retrieve the base substitutions from the VCF GRanges object and convert
them to the 6 types of base substitution types that are distinguished by
convention: C>A, C>G, C>T, T>A, T>C, T>G. For example, when the reference
allele is G and the alternative allele is T (G>T), `mut_type`
returns the G:C>T:A mutation as a C>A mutation:

```{r mut_type}
types <- mut_type(grl[[1]])
head(types, 12)
```

To retrieve the sequence context (one base upstream and one base downstream) of
the base substitutions in the VCF object from the reference genome, you can use
the `mut_context` function:

```{r mut_context}
context <- mut_context(grl[[1]], ref_genome)
head(context, 12)
```

With`type_context`, you can retrieve the types and contexts
for all positions in the VCF GRanges object. For the base substitutions that are
converted to the conventional base substitution types, the reverse complement of
the sequence context is returned.

```{r type_context}
type_context <- type_context(grl[[1]], ref_genome)
lapply(type_context, head, 12)
```

With `mut_type_occurrences`, you can count mutation type
occurrences for all VCF objects in the `GRangesList`. For
C>T mutations, a distinction is made between C>T at CpG sites and other
sites, as deamination of methylated cytosine at CpG sites is a common mutational
process. For this reason, the reference genome is needed for this functionality.

```{r mut_type_occurrences}
type_occurrences <- mut_type_occurrences(grl, ref_genome)
type_occurrences
```

### Mutation spectrum

A mutation spectrum shows the relative contribution of each mutation type in
the base substitution catalogs. The `plot_spectrum` function plots
the mean relative contribution of each of the 6 base substitution types over
all samples. Error bars indicate the 95% confidence interval over all samples. 
The total number of mutations is indicated.

```{r plot_spectrum}
p1 <- plot_spectrum(type_occurrences)
```

You can also plot the mutation spectrum with distinction
between C>T at CpG sites and other sites:
```{r plot_spectrum_2}
p2 <- plot_spectrum(type_occurrences, CT = TRUE)
```

Other options include plotting the spectrum with the individual samples as 
points and removing the legend to save space:
```{r plot_spectrum_3}
p3 <- plot_spectrum(type_occurrences, CT = TRUE, 
                    indv_points = TRUE, legend = FALSE)
```

Here we use the `r CRANpkg("gridExtra")` package to combine the created plots, 
so you can see them next to each other.
```{r combine_plot_spectrum, eval=TRUE, fig.wide = TRUE, message=FALSE}
library("gridExtra")
grid.arrange(p1, p2, p3, ncol = 3, widths = c(3, 3, 1.75))
```


It's also possible to create a facet per sample group, e.g. plot the spectrum 
for each tissue separately:
```{r plot_spectrum_4}
p4 <- plot_spectrum(type_occurrences, by = tissue, CT = TRUE, legend = TRUE)
```

Or you could use the standard deviation instead of a 95% confidence interval:
```{r plot_spectrum_5}
p5 <- plot_spectrum(type_occurrences, CT = TRUE, 
                    legend = TRUE, error_bars = "stdev")
```


```{r combine_plot_spectrum_2, fig.wide=TRUE, message=FALSE}
grid.arrange(p4, p5, ncol = 2, widths = c(4, 2.3))
```

### 96 mutational profile

First you should make a 96 trinucleotide mutation count matrix.
(In contrast to previous versions this also works for single samples.)
```{r mut_matrix}
mut_mat <- mut_matrix(vcf_list = grl, ref_genome = ref_genome)
head(mut_mat)
```

Next, you can use this matrix to plot the 96 profile of samples. 
In this example we do this for 2 samples:
```{r plot_96_profile, fig.wide=TRUE}
plot_96_profile(mut_mat[, c(1, 7)])
```


### Larger contexts

It's also possible to look at larger mutational contexts. 
However, this is only useful if you have a large number of mutations.
```{r mut_mat_extendend_context}
mut_mat_ext_context <- mut_matrix(grl, ref_genome, extension = 2)
head(mut_mat_ext_context)
```
The `extension` argument also works for the `mut_context` and `type_context` functions.

You can visualize this matrix with a heatmap.
```{r plot_profile_heatmap, fig.wide=TRUE}
plot_profile_heatmap(mut_mat_ext_context, by = tissue)
```

You can also visualize this with a riverplot.
```{r riverplot, fig.wide=TRUE}
plot_river(mut_mat_ext_context[,c(1,4)])
```

## Indels

First you should get the COSMIC indel contexts. This is done with
`get_indel_context`, which adds the columns `muttype` and `muttype_sub` to the
`GRangesList`.
The `muttype` column contains the main type of indel. The `muttype_sub` column
shows the number of repeat units. For microhomology (mh) deletions the mh length
is shown.
```{r get_indel_context}
indel_grl <- get_indel_context(indel_grl, ref_genome)
head(indel_grl[[1]], n = 3)
```

Next count the number of indels per type. This results in a matrix that is 
similar to the `mut_mat` matrix.
```{r count_indel_contexts}
indel_counts <- count_indel_contexts(indel_grl)
head(indel_counts)
```

Now you can plot the indel spectra. The facets at the top show the indel types. 
First the C and T deletions.
Then the C and T insertions. Next are the multi base deletions and insertions. 
Finally the deletions with microhomology are shown.
The x-axis at the bottom shows the number of repeat units. For mh deletions the 
microhomology length is shown.
```{r, fig.wide=TRUE}
plot_indel_contexts(indel_counts, condensed = TRUE)
```

You can also choose to only plot the main contexts, 
without taking the number of repeat units or microhomology length into account.
```{r, fig.wide=TRUE}
plot_main_indel_contexts(indel_counts)
```

## DBSs

First get the COSMIC DBS contexts. This is done by changing the `REF` and `ALT` 
columns of the `GRangesList`.
```{r get_DBS_contexts}
head(dbs_grl[[1]])
dbs_grl <- get_dbs_context(dbs_grl)
head(dbs_grl[[1]])
```

Next count the number of DBSs per type. 
This again results in a matrix that is similar to the `mut_mat` matrix.
```{r count_DBS_contexts}
dbs_counts <- count_dbs_contexts(dbs_grl)
```

Finally we can plot the DBS contexts. 
The facets at the top show the reference bases. 
The x-axis shows the alternative variants.
```{r plot_DBS_contexts, fig.wide=TRUE}
plot_dbs_contexts(dbs_counts, same_y = TRUE)
```

We can also choose to plot based on only the reference bases. 
Now the x-axis contains the reference bases.
```{r plot_main_DBS_contexts, fig.wide=TRUE}
plot_main_dbs_contexts(dbs_counts, same_y = TRUE)
```

## MBSs

No COSMIC MBS contexts existed when this vignette was written. 
Therefore the length of the MBSs is used as its context.
First we can count the MBSs. 
This again results in a matrix that is similar to the `mut_mat` matrix.
```{r count_MBS_contexts}
mbs_counts <- count_mbs_contexts(mbs_grl)
```

Next we can plot the contexts
```{r plot_MBS_contexts}
plot_mbs_contexts(mbs_counts, same_y = TRUE)
```

## Pooling samples

Sometimes you have very few mutations per sample. 
In these cases it might be useful to combine multiple samples.
This can be done with `pool_mut_mat`. 
This works on the matrixes of SNVs, indels, DBSs and MBSs.
```{r pool_mut_mat}
pooled_mut_mat <- pool_mut_mat(mut_mat, grouping = tissue)
head(pooled_mut_mat)
```

# Mutational signatures

Mutational signatures are thought to represent mutational processes, and are
characterized by a specific contribution of mutation types with a
certain sequence context. 
Mutational signatures can be extracted *de novo* from your
mutation count matrix, with non-negative matrix factorization (NMF). 
It's also possible to identify the exposure of your mutation count matrix to 
previously defined mutational signatures. 
This is often referred to as signature refitting. 
NMF is most useful for large amounts of samples, 
while signature refitting can also be used on single samples. 
We will first discuss NMF and then signature refitting. 
Finally we will discuss analyzing the similarity between a mutational profile 
and signatures directly.

## *De novo* mutational signature extraction using NMF

### NMF

A critical parameter in NMF is the factorization rank, which is the number of
mutational signatures you extract. You can determine the optimal factorization
rank using the `r CRANpkg("NMF")` package [@Gaujoux2010]. As described in
their paper:

``...a common
way of deciding on the rank is to try different values, compute some quality
measure of the results, and choose the best value according to this quality
criteria. The most common approach is to choose the smallest rank for which
cophenetic correlation coefficient starts decreasing. Another approach is to
choose the rank for which the plot of the residual sum of squares (RSS) between
the input matrix and its estimate shows an inflection point.''

In general, larger datasets allow you to use a higher rank.
Here we will show NMF for SNVs. Performing NMF on other mutation types works the
same way.

First add a small pseudocount to your mutation count matrix:
```{r psuedo_count}
mut_mat <- mut_mat + 0.0001
```

Use the NMF package to generate an estimate rank plot. 
This can take a long time:
```{r use_nmf}
library("NMF")
estimate <- nmf(mut_mat, rank = 2:5, method = "brunet", 
                nrun = 10, seed = 123456, .opt = "v-p")
```

And plot it:
```{r estimate_rank, fig.wide=TRUE}
plot(estimate)
```

Extract mutational signatures from the mutation count matrix with
`extract_signatures`. In this example 2 signatures are extracted, because a rank
of 2 is used. (For larger datasets it is wise to perform more iterations by
changing the nrun parameter to achieve stability and avoid bad local minima):
```{r extract_signatures}
nmf_res <- extract_signatures(mut_mat, rank = 2, nrun = 10, single_core = TRUE)
```

NMF also works on other mutation types like indels and DBS. You can even combine
matrixes from different mutation types to, for example, extract combined
indel/DBS signatures.
```{r extract_signatures_snv_indel}
combi_mat = rbind(indel_counts, dbs_counts)
nmf_res_combi <- extract_signatures(combi_mat, rank = 2, nrun = 10, single_core = TRUE)
```



### Bayesian NMF

It's also possible to use variational bayes NMF. This could make it easier to
determine, the correct rank. To do this you need to install the 
`r Biocpkg("ccfindR")` package. You can then determine the optimal number of
signatures, which can again take a long time. Warnings will occur when you use
ranks that are too high. (In this example we avoid these warnings by
using `nrun=1`, combined with a set seed. In practice you shouldn't use a rank
that's too high and you should also use a higher number for nrun.) With a larger
dataset you could try higher ranks. The highest value in the plot is the
mathematically optimal number of signatures. (A note of warning: The
mathematically optimal number doesn't necessarily make biological sense.)
```{r use_bayesnmf, message=FALSE}
# BiocManager::install("ccfindR")
library("ccfindR")
sc <- scNMFSet(count = mut_mat)
set.seed(4)
estimate_bayes <- vb_factorize(sc, ranks = 1:3, nrun = 1, 
                               progress.bar = FALSE, verbose = 0)
plot(estimate_bayes)
```

Extracting the signatures is then done by:
```{r extract_signatures_bayes}
nmf_res_bayes <- extract_signatures(mut_mat, rank = 2, nrun = 10, 
                                    nmf_type = "variational_bayes")
```

### Changing the names of the extracted signatures

You can provide the extracted signatures with custom names:
```{r add_column_names}
colnames(nmf_res$signatures) <- c("Signature A", "Signature B")
rownames(nmf_res$contribution) <- c("Signature A", "Signature B")
```

It's possible that some of the signatures extracted by NMF are very similar to
signatures that are already known. Therefore, it might be useful to change the
names of the NMF signatures to these already known signatures. This often makes
it easier to interpret your results.

To do this you first need to read in some already existing signatures. Here we
will use signatures from [COSMIC](https://cancer.sanger.ac.uk/signatures)
(v3.2) [@Alexandrov2020]. (We will discuss how to use other signature matrixes
later.)
```{r read_signatures}
signatures = get_known_signatures()
```

You can now change the names of the signatures extracted by NMF. In this example
the name of a signature is changed if it has a cosine similarity of more than
0.85 with an existing COSMIC signature.
```{r change_names_NMF_sigs}
nmf_res <- rename_nmf_signatures(nmf_res, signatures, cutoff = 0.85)
colnames(nmf_res$signatures)
```

We now see that the signatures we extracted are very similar to COSMIC
signatures SBS1 and SBS5. This helps with the interpretation because the
aetiology of SBS1 is already known. This also tells us we didn't identify any
completely novel processes.
An extracted signature that is not similar to any previously defined signatures,
is not proof of a "novel" signature. The extracted signature might be a
combination of known signatures, that could not be split by NMF. This can happen
when, for example, too few samples were used for the NMF.

### Visualizing NMF results

You can plot the 96-profile of the signatures (When looking at SNVs):
```{r plot_96_profile_signatures, fig.wide=TRUE}
plot_96_profile(nmf_res$signatures, condensed = TRUE)
```

You can visualize the contribution of the signatures in a barplot:
```{r plot_contribution}
plot_contribution(nmf_res$contribution, nmf_res$signature,
  mode = "relative"
)
```


The relative contribution of each signature for each sample can also be plotted
as a heatmap with `plot_contribution_heatmap`, which might be easier to
interpret and compare than stacked barplots. The signatures and samples can be
hierarchically clustered based on their euclidean distance. Clustering here is
based on the similarity between the contributions. (Signatures with a similar
contribution will thus be clustered together. The same applies for samples.)

Plot signature contribution as a heatmap with sample and signature clustering
dendrograms:
```{r plot_contribution_heatmap_clust, fig.wide=TRUE}
plot_contribution_heatmap(nmf_res$contribution, 
                          cluster_samples = TRUE, 
                          cluster_sigs = TRUE)
```

It's also possible to provide your own signature and sample order. This can be a
manual ordering, but in this example we use clustering. We can cluster the
signatures based on their cosine similarity and then retrieve the order:
```{r cluster signatures}
hclust_signatures <- cluster_signatures(nmf_res$signatures, method = "average")
signatures_order <- colnames(nmf_res$signatures)[hclust_signatures$order]
signatures_order
```

We can do the same thing for the samples:
```{r cluster samples}
hclust_samples <- cluster_signatures(mut_mat, method = "average")
samples_order <- colnames(mut_mat)[hclust_samples$order]
samples_order
```

Now we can use the signature and sample order in the contribution heatmap:
```{r plot_contribution_heatmap_order, fig.wide=TRUE}
plot_contribution_heatmap(nmf_res$contribution,
  sig_order = signatures_order, sample_order = samples_order,
  cluster_sigs = FALSE, cluster_samples = FALSE
)
```

A reconstructed mutational profile has been made for each sample by the NMF,
based on the signatures and their contribution. The better the NMF worked the
more similar the reconstructed profile will be to the original profile.

We can compare the reconstructed mutational profile with the original mutational
profile for a single sample like this:
```{r plot_compare_profiles}
plot_compare_profiles(mut_mat[, 1],
  nmf_res$reconstructed[, 1],
  profile_names = c("Original", "Reconstructed"),
  condensed = TRUE
)
```
This is the function for SNVs. For indels you would use `plot_compare_indels`,
for DBSs, `plot_compare_dbs` and for MBSs `plot_compare_mbs`.

We can also plot the cosine similarity between the original and reconstructed
matrix for all the samples. When a reconstructed profile has a cosine similarity
of more than 0.95 with the original, the reconstructed profile is considered
very good.
```{r plot_ori_vs_rec}
plot_original_vs_reconstructed(mut_mat, nmf_res$reconstructed, 
                               y_intercept = 0.95)
```


## Signature refitting

### Find mathematically optimal contribution of COSMIC signatures

Signature refitting quantifies the contribution of any set of signatures to the
mutational profile of a sample. This is specifically useful for mutational
signature analyses of small cohorts or individual samples, but also to relate
own findings to known signatures and published findings. The
`fit_to_signatures` function finds the optimal linear combination of mutational
signatures that most closely reconstructs the mutation matrix by solving a
non-negative least-squares constraints problem. It can work with a SNV, indel,
DBS or other type of count matrix.

Fit mutation matrix to the COSMIC mutational signatures:
```{r fit_to_signatures}
fit_res <- fit_to_signatures(mut_mat, signatures)
```

The `fit_res` object can be visualized similarly to the `nmf_res` object. The
functions `plot_contribution`, `plot_contribution_heatmap`,
`plot_compare_profiles` and `plot_original_vs_reconstructed` will all work. As
an example we show the contribution of signatures as a barplot.
```{r plot_contribution_refit}
plot_contribution(fit_res$contribution,
  coord_flip = FALSE,
  mode = "absolute"
)
```

We also show the cosine similarity with the reconstructed profiles, as this
gives a good idea of how well the signatures could explain the mutational
profiles.
```{r plot_ori_vs_rec_fit}
plot_original_vs_reconstructed(mut_mat, fit_res$reconstructed, 
                               y_intercept = 0.95)
```

### Stricter refitting

In the previous plots, many signatures were used to explain the mutational
profiles of the samples. It seems however unlikely that this many mutational
processes were really active in these samples. This problem, known as
[overfitting](https://en.wikipedia.org/wiki/Overfitting), occurs because
`fit_to_signatures` finds the optimal combination of signatures to reconstruct a
profile. It will use a signature, even if it improves the fit very little.

Another issue with signature refitting is signature misattribution. Mutations
will sometimes be attributed to different signatures in samples with a similar
mutational profile. This can give the impression that samples are very
different, when they actually aren't. This is often the result of "flat"
signatures, which are harder to fit. Signatures that are similar to each other
can also cause this misattribution issue.

One way to deal with overfitting and the misattribution of signatures is by
selecting a limited number of signatures that will be used for the refitting.
When you are analyzing a liver sample you could for example only use signatures
that are known to occur in liver. This method is recommended by @Degasperi2020.
Using prior knowledge like this will reduce overfitting, but can also introduce
bias. You won't be able to identify signatures, if you removed them beforehand.
Another downside of this method is that you need prior knowledge of which
signatures could be present. We recommend using this method when possible.


Another way of dealing with overfitting is by starting with a standard refit
and then removing signatures that have little effect on how well a mutational
profile can be reconstructed. This works in an iterative fashion. In each
iteration the signature with the lowest contribution is removed and refitting is
repeated. Each time the cosine similarity between the original and reconstructed
profile is calculated. You stop removing signatures when the difference between
two iterations becomes bigger than a certain cutoff. This way only the
signatures that are really necessary to explain a mutational profile will be
used. This method is similar to a method used by @Alexandrov2020. In
MutationalPatterns it can be used with `fit_to_signatures_strict`.


A downside of this method is that the cutoff you should use is somewhat
subjective and depends on the data. Here we use a cutoff of 0.004. Decreasing
this number will make the refitting less strict, while increasing it will make
the refitting more strict. Trying out different values can sometimes be useful
to achieve the best results.
```{r fit_to_signatures_strict}
strict_refit <- fit_to_signatures_strict(mut_mat, signatures, max_delta = 0.004)
```

This function returns a list containing a `fit_res` object and a list of
figures, showing in what order signatures were removed during the refitting.

Here we show the figure for one sample. The x-axis shows the signature that was
removed during that iteration. The red bar indicates that the difference in
cosine similarity has become too large. The removal of signatures is stopped and
SBS1 is kept for the final refit.
```{r strict_refit_decay, fig.wide=TRUE}
fig_list <- strict_refit$sim_decay_fig
fig_list[[1]]
```

The fit_res can be visualized in the same way as other `fit_res` objects.

```{r plot_contribution_refit_strict}
fit_res_strict <- strict_refit$fit_res
plot_contribution(fit_res_strict$contribution,
  coord_flip = FALSE,
  mode = "absolute"
)
```

By default `fit_to_signatures_strict` uses the "backwards" selection approach
described above. However, it is also possible to use a "best subset" approach.
The benefit of this method is that it can be more accurate than the "backwards"
approach. However, it becomes computationally infeasible when using many
signatures. Therefore it should only be used on small signature sets (max 10-15
signatures), like tissue specific signatures. 
The "best subset" approach works similarly to the "backwards" approach. This
approach again starts with a standard refit. The refitting is then repeated for
each combination of n-1 signatures, where n is the total number of signatures.
In other words, if you started with 10 signatures, the refitting is repeated 10
times, with a different signature being removed each time. The combination of
signatures that has the best cosine similarity between the original and
reconstructed profile is chosen. This is done in an iterative fashion for n-2,
n-3, ect. You stop removing signatures when the difference between two
iterations becomes bigger than a certain cutoff, just like with the backwards
method.

We randomly selected a few signatures for this example, to keep the runtime low.
In practice, signatures should be selected based on prior knowledge.
```{r fit_to_signatures_strict_best_subset}
best_subset_refit <- fit_to_signatures_strict(mut_mat, signatures[,1:5], max_delta = 0.002, method = "best_subset")
```

A third method that can reduce overfitting and the misattribution of signatures
is to merge similar signatures. This works by merging signatures whose cosine
similarity is higher than a certain cutoff value. These merged signatures can
then be used for refitting. The benefit of this method is that you don't need
prior knowledge. For most common use-cases, we don't recommend this method,
because it is less conventional and can be harder to interpret. However, we
provide it here to give you the possibility to use it if you need it. You can
merge signatures like this:

```{r Merge_signatures}
merged_signatures <- merge_signatures(signatures, cos_sim_cutoff = 0.8)
```

The best refitting method will depend on your data and research question. A
single method can be used, but it's also possible to combine several methods.


### Bootstrapped refitting.

The stability of signature refitting can be suboptimal, because of the
previously mentioned signature misattribution. Bootstrapping can be used to
verify how stable the refitting is [@Huang2018]. A more stable refit provides
more confidence in the results. It works by making small changes to the
mutational profile of a sample. These changes are made by resampling mutations
with replacement using the samples own mutational profile as weights. The number
of sampled mutations is the same as the number of mutations that was originally
in the profile. This process is by default repeated 1000 times. A signature
refit is performed for each iteration, resulting in an estimate of the refitting
stability. In MutationalPatterns bootstrapping can be done with
`fit_to_signatures_bootstrapped`.

This function can be used with the standard and strict refitting methods
described previously. Here we will use the "strict" method on two samples.
(We only use 50 bootstraps here to reduce the run time and figure size.)
```{r fit_to_signatures_bootstrapped, message=FALSE}
contri_boots <- fit_to_signatures_bootstrapped(mut_mat[, c(3, 7)],
  signatures,
  n_boots = 50,
  method = "strict"
)
```

You can visualize the bootstrapped refitting like this. Each dot is one
bootstrap iteration.
```{r plot_bootstrapped_refit}
plot_bootstrapped_contribution(contri_boots)
```

You can also visualize this using the relative contribution and a dotplot.
Here, the color of the dot shows the percentage of iterations in which the signature is found (contribution > 0),
and the size of the dot represents the average contribution of that signature (in the iterations in which the contribution was higher than 0).
```{r plot_bootstrapped_refit_rel_boxtplot}
plot_bootstrapped_contribution(contri_boots, 
                               mode = "relative", 
                               plot_type = "dotplot")
```

We can see that SBS1 is relatively stable in the first sample. However, SBS5 is
very unstable in the second sample. This instability is likely the result of
SBS5 being very flat.

You can also plot the correlation between signatures. A negative correlation
between two signatures means that their contributions were high in different
bootstrap iterations. Here we will visualize this correlation for one sample.
```{r plot_correlation_signatures}
fig_list <- plot_correlation_bootstrap(contri_boots)
fig_list[[2]]
```

Here we can see that SBS5 and SBS40 have a negative correlation. This makes
sense because they are both flat signatures that are very similar to each other.
As a result the refitting process has difficulty distinguishing them.

## Similarity between mutational profiles and signatures

Instead of performing NMF or fitting signatures to a profile, you can also look
at their similarity. This circumvents the issues that exist with NMF and
signature refitting. However, looking at similarities doesn't allow us to
separate the different signatures that have contributed to a mutational profile.
When multiple signatures have contributed to a profile, the similarities between
this profile and the individual signatures can also become diluted.

You can calculate the similarity between two mutational profiles / signatures
like this:
```{r Calculate_cossim_single}
cos_sim(mut_mat[, 1], signatures[, 1])
```

You can also calculate the similarity between multiple mutational profiles /
signatures:
```{r Calculate_cossim}
cos_sim_samples_signatures <- cos_sim_matrix(mut_mat, signatures)
cos_sim_samples_signatures[1:3, 1:3]
```

You can visualize this with a heatmap using `plot_cosine_heatmap`. This function
has the same clustering options as `plot_contribution_heatmap`, which we
discussed earlier.
```{r Plot_cosine_heatmap}
plot_cosine_heatmap(cos_sim_samples_signatures, 
                    cluster_rows = TRUE, cluster_cols = TRUE)
```

It's also possible to look at the cosine similarities between samples.
```{r plot_cosine_heatmap_samples}
cos_sim_samples <- cos_sim_matrix(mut_mat, mut_mat)
plot_cosine_heatmap(cos_sim_samples, cluster_rows = TRUE, cluster_cols = TRUE)
```

## Signature potential damage analysis

Some signatures are more likely than others to have functional effects, by
causing "stop gain" or "mismatch" mutations. With MutationalPatterns it's
possible to analyze how likely it is for a signature to either cause "stop
gain", "mismatch", "synonymous" or "splice site" mutations for a set of genes of
interest. Please take into account that this is a relatively basic analysis,
that only looks at mutational contexts. Other features like open/closed
chromatin are not taken into account. This analysis is meant to give an
indication, not a definitive answer, of how damaging a signature might be.

First you need to load a transcription annotation database and make sure some
dependencies are installed.
```{r get_transcription_annotation_ojbect}
# For example get known genes table from UCSC for hg19 using
# BiocManager::install("TxDb.Hsapiens.UCSC.hg19.knownGene")
# BiocManager::install("AnnotationDbi")
# BiocManager::install("GenomicFeatures")
library("TxDb.Hsapiens.UCSC.hg19.knownGene")
txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene
```

Next, you need to choose a set of genes and create a vector of Entrez gene ids.
In this example we used a small set to keep the runtime low, but in practice you
can use a larger list of genes, that you are interested in. (The genes used in
this example are: P53, KRAS, NRAS, BRAF, BRCA2, CDKN2A, ARID1A, PTEN and TERT.)
A useful list of cancer genes can be found here: https://cancer.sanger.ac.uk/cosmic/census.
```{r Choose_gene_ids}
gene_ids <- c(7157, 3845, 4893, 673, 675, 1029, 8289, 5728, 7015)
```

Now the ratio of "stop gain", "mismatch", "synonymous" and "splice site" mutations can be
determined per genomic context. The total number of possible mutations per
context is also given. Finally, a blosum62 score is given for the mismatches. A
lower score means that the amino acids in the mismatches are more dissimilar.
More dissimilar amino acids are more likely to have a detrimental effect.
```{r context_potential_damage_analysis}
contexts <- rownames(mut_mat)
context_mismatches <- context_potential_damage_analysis(contexts, txdb, ref_genome, gene_ids)
head(context_mismatches)
```

The ratios per context can then be used to get the ratios per signature.
Normalized ratios are also given. These were calculated by dividing the ratios
in each signature, by the ratios of a completely "flat" signature. A normalized
ratio of 2 for "stop gain" mutations, means that a signature is twice as likely
to cause "stop gain" mutations, compared to a completely random "flat"
signature. The total number of possible mutations per context is multiplied with
the signature contribution per context and summed over all contexts. It thus
gives a measure of the amount of mutations that a signature could cause.
```{r}
sig_damage <- signature_potential_damage_analysis(signatures, contexts, context_mismatches)
head(sig_damage)
```


## Using other signature matrixes

So far we have used the SNV signatures from COSMIC. For your convenience we have
also included indel, DBS and transcription strand bias signatures in this
package. Additionally, we included signatures from SIGNAL [@Kucab2019, @Degasperi2020].
These signature matrixes can all be loaded using the
`get_known_signature` function. If you use any of these signature matrixes,
please cite the associated paper. (The papers are listed in the functions
documentation.) A complete list of signature matrixes is shown in the
documentation.


You can choose the mutation type like this:
```{r load_signatures_indels}
signatures_indel = get_known_signatures(muttype = "indel")
signatures_indel[1:5, 1:5]
```

It's also possible to include signatures, that might be artifacts. Including
these signatures can lead to more overfitting. Therefore we recommend against
using them for most analyses. However, these signatures can be useful to see if
your data contains many sequencing artifacts, if you doubt the quality of your
data.
```{r load_signatures_artifacts}
signatures_artifacts = get_known_signatures(incl_poss_artifacts = TRUE)
dim(signatures_artifacts)
```

For the COSMIC signatures it is possible to use a version that is normalized to GRCh38 instead of GRCh37.
```{r load_signatures_grch38}
signatures_GRCh38 = get_known_signatures(genome = "GRCh38")
dim(signatures_GRCh38)
```


You can load the SIGNAL reference signatures like this:
```{r load_signatures_signal}
signatures_signal = get_known_signatures(source = "SIGNAL")
signatures_signal[1:5, 1:5]
```

SIGNAL also contains signatures based on drug exposures:
```{r load_signatures_exposure}
signatures_exposure = get_known_signatures(source = "SIGNAL", sig_type = "exposure")
signatures_exposure[1:5, 1:5]
```

Finally, SIGNAL contains tissue specific signatures:
```{r load_signatures_tissue}
signatures_stomach = get_known_signatures(source = "SIGNAL", sig_type = "tissue", tissue_type = "Stomach")
signatures_stomach[1:5, 1:5]
```

Using an incorrect `tissue_type` will result in an error. This is useful,
because it shows all possible tissue types. (Not run here, to prevent the
error.):
```{r load_signatures_tissue_error, eval=FALSE}
get_known_signatures(source = "SIGNAL", sig_type = "tissue", tissue_type = "?")
```


The contribution of tissue specific signatures can be converted back to SIGNAL
reference signatures. First fit the mutation matrix to tissue specific
signatures:
```{r convert_signatures_to_ref_1}
fit_res_tissue <- fit_to_signatures(mut_mat, signatures_stomach)
fit_res_tissue$contribution[1:5, 1:5]
```

Then convert the contributions to reference signatures:
```{r convert_signatures_to_ref_2}
fit_res_tissue <- convert_sigs_to_ref(fit_res_tissue)
fit_res_tissue$contribution[1:5, 1:5]
```


Instead of using a signature matrix included in this package, you can also download
your own signature matrixes. If you do this you have to make sure that the order
of the mutation types is the same as the MutationalPatterns standard. (You can
use the `match` function for this.)


# Strand bias analyses

## Transcriptional strand bias analysis

For the mutations within genes it can be determined whether the mutation is
on the transcribed or non-transcribed strand, which can be used to evaluate
the involvement of transcription-coupled repair. To this end, it is determined
whether the "C" or "T" base (since by convention we regard base substitutions
as C>X or T>X) are on the same strand as the gene definition. Base substitutions
on the same strand as the gene definitions are considered "untranscribed", and
on the opposite strand of gene bodies as "transcribed", since the gene
definitions report the coding or sense strand, which is untranscribed. No
strand information is reported for base substitution that overlap with more
than one gene body on different strands.

### Gene definitions

Start by getting gene definitions for your reference genome:
```{r get_genes}
genes_hg19 <- genes(TxDb.Hsapiens.UCSC.hg19.knownGene)
genes_hg19
```

### Strand bias profile

You can get transcriptional strand information for all positions in the first
VCF object with `mut_strand`. This function returns "-" for positions outside
gene bodies, and positions that overlap with more than one gene on different
strands.

```{r mut_strand}
strand <- mut_strand(grl[[1]], genes_hg19)
head(strand, 10)
```

You can make a mutation count matrix with transcriptional strand information (96
trinucleotides * 2 strands = 192 features). NB: only those mutations that are
located within gene bodies are counted.

```{r mut_matrix_stranded}
mut_mat_s <- mut_matrix_stranded(grl, ref_genome, genes_hg19)
mut_mat_s[1:5, 1:5]
```

You can visualize samples from this matrix like this:
```{r plot_192_spectrum}
plot_192_profile(mut_mat_s[, 1:2])
```

### Strand bias test

You can count the number of mutations on each strand, per tissue, per
mutation type:
```{r strand_occurrences}
strand_counts <- strand_occurrences(mut_mat_s, by = tissue)
head(strand_counts)
```

Next, you can use these counts to perform a Poisson test for strand asymmetry. 
Multiple testing correction is also performed.
```{r strand_bias_test}
strand_bias <- strand_bias_test(strand_counts)
head(strand_bias)
```

Plot the mutation spectrum with strand distinction:
```{r plot_strand}
ps1 <- plot_strand(strand_counts, mode = "relative")
```

Plot the effect size (log2(untranscribed/transcribed) of the strand bias.
Asteriks indicate significant strand bias. Here we use p-values to plot
asterisks. By default fdr is used.
```{r plot_strand_bias}
ps2 <- plot_strand_bias(strand_bias, sig_type = "p")
```

Finally, combine the plots into one figure:
```{r plot_strand_bias_combi, fig.wide=TRUE}
grid.arrange(ps1, ps2)
```

You can change the significance cutoffs for the fdr and p-values. You can use up
to three cutoff levels for each, which changes the number of asteriks in the
`significant` and `significant_fdr` columns. These asteriks will be used in the
plot.
```{r strand_bias_test_notstrict}
strand_bias_notstrict <- strand_bias_test(strand_counts,
  p_cutoffs = c(0.5, 0.1, 0.05),
  fdr_cutoffs = 0.5
)
plot_strand_bias(strand_bias_notstrict, sig_type = "p")
```

## Replicative strand bias analysis

The involvement of replication-associated mechanisms can be evaluated by
testing for a mutational bias between the leading and lagging strand. The
replication strand is dependent on the locations of replication origins from
which DNA replication is fired. However, replication timing is dynamic and
cell-type specific, which makes replication strand determination less
straightforward than transcriptional strand bias analysis. Replication timing
profiles can be generated with Repli-Seq experiments. Once the replication
direction is defined, a strand asymmetry analysis can be performed similarly as
the transcription strand bias analysis. The only difference is that you need 
to use the `replication` mode for the `mut_strand` and `mut_strand_matrix` 
functions.

### Define replication direction

Here we read in an example bed file provided with the package containing 
replication direction annotation:
```{r repli_file}
repli_file <- system.file("extdata/ReplicationDirectionRegions.bed",
  package = "MutationalPatterns"
)
repli_strand <- read.table(repli_file, header = TRUE)
# Store in GRanges object
repli_strand_granges <- GRanges(
  seqnames = repli_strand$Chr,
  ranges = IRanges(
    start = repli_strand$Start + 1,
    end = repli_strand$Stop
  ),
  strand_info = repli_strand$Class
)
# UCSC seqlevelsstyle
seqlevelsStyle(repli_strand_granges) <- "UCSC"
repli_strand_granges
```

This `GRanges` object should have a `strand_info` metadata column, which
contains only two different annotations, e.g. "left" and "right", or
"leading" and "lagging". The genomic ranges cannot overlap, to allow only
one annotation per location. 

The levels of the `strand_info` metadata in the GRanges object determines the
order in which the strands are reported in the mutation matrix that is returned
by `mut_matrix_stranded`, so if you want to count right before left,
you can specify this, before you run `mut_matrix_stranded`:

```{r specify_levels}
repli_strand_granges$strand_info <- factor(repli_strand_granges$strand_info,
  levels = c("right", "left")
)
```

### Replication bias analysis

Now that we defined the replication direction, the rest of the analysis is 
similar to the transcription bias analysis:

You can calculate the strand matrix, counts and bias like this:
```{r mut_matrix_stranded_rep}
mut_mat_s_rep <- mut_matrix_stranded(grl, ref_genome, repli_strand_granges,
  mode = "replication"
)
strand_counts_rep <- strand_occurrences(mut_mat_s_rep, by = tissue)
strand_bias_rep <- strand_bias_test(strand_counts_rep)
```

And then visualize them:
```{r plot_strand_rep, fig.wide=TRUE}
ps1 <- plot_strand(strand_counts_rep, mode = "relative")
ps2 <- plot_strand_bias(strand_bias_rep)
grid.arrange(ps1, ps2)
```


## Signatures with strand bias

Strand bias can be included in signature analyses.
You can for example perform NMF on a mutation count matrix with strand features:
```{r extract_signatures_bias}
nmf_res_strand <- extract_signatures(mut_mat_s, rank = 2, single_core = TRUE)
colnames(nmf_res_strand$signatures) <- c("Signature A", "Signature B")
```


# Genomic distribution

Mutations are not randomly distributed throughout the genome. With
`MutationalPatterns` you can visualize how mutations are distributed throughout
the genome. You can also look at specific genomic regions, such as promoters,
CTCF binding sites and transcription factor binding sites. Within these regions
you can look for enrichment/depletion of mutations and you can look for
differences in the mutational spectra between them.

## Rainfall plot

A rainfall plot visualizes mutation types and intermutation distance. Rainfall
plots can be used to visualize the distribution of mutations along the genome or
a subset of chromosomes. The y-axis corresponds to the distance of a mutation
with the previous mutation and is log10 transformed. Drop-downs from the plots
indicate clusters or "hotspots" of mutations. Rainfall plots can be made for
SNVs, indels, DBSs and MBSs.

In this example we make a rainfall plot over the autosomal chromosomes for 1 
sample:
```{r plot_rainfall, fig.wide=TRUE}
# Define autosomal chromosomes
chromosomes <- seqnames(get(ref_genome))[1:22]

# Make a rainfall plot
plot_rainfall(grl[[1]],
  title = names(grl[1]),
  chromosomes = chromosomes, cex = 1.5, ylim = 1e+09
)
```


## Define genomic regions

To look at specific types of genomic regions you first need to define them in a
named `GRangesList`. You can use your own genomic region definitions (based on
e.g. ChipSeq experiments) or you can use publicly available genomic annotation
data, like in the example below.

The following example displays how to download promoter, CTCF binding sites and
transcription factor binding sites regions for
genome build hg19 from Ensembl using Biocpkg("biomaRt"). For other datasets,
see the `r Biocpkg("biomaRt")` documentation [@Durinck2005]. 
(Remember to install this package before trying to use it.)


Load the Biocpkg("biomaRt") package.
```{r load_biomart}
library(biomaRt)
```

Download genomic regions. NB: Here we take some shortcuts by loading the results
from our example data. The corresponding code for downloading this data can be
found above the command we run:

```{r download_using_biomaRt}
# regulatory <- useEnsembl(biomart="regulation",
#                          dataset="hsapiens_regulatory_feature",
#                          GRCh = 37)

## Download the regulatory CTCF binding sites and convert them to
## a GRanges object.
# CTCF <- getBM(attributes = c('chromosome_name',
#                             'chromosome_start',
#                             'chromosome_end',
#                             'feature_type_name'),
#              filters = "regulatory_feature_type_name",
#              values = "CTCF Binding Site",
#              mart = regulatory)
#
# CTCF_g <- reduce(GRanges(CTCF$chromosome_name,
#                 IRanges(CTCF$chromosome_start,
#                 CTCF$chromosome_end)))

CTCF_g <- readRDS(system.file("states/CTCF_g_data.rds",
  package = "MutationalPatterns"
))

## Download the promoter regions and convert them to a GRanges object.

# promoter = getBM(attributes = c('chromosome_name', 'chromosome_start',
#                                 'chromosome_end', 'feature_type_name'),
#                  filters = "regulatory_feature_type_name",
#                  values = "Promoter",
#                  mart = regulatory)
# promoter_g = reduce(GRanges(promoter$chromosome_name,
#                     IRanges(promoter$chromosome_start,
#                             promoter$chromosome_end)))

promoter_g <- readRDS(system.file("states/promoter_g_data.rds",
  package = "MutationalPatterns"
))

## Download the promoter flanking regions and convert them to a GRanges object.

# flanking = getBM(attributes = c('chromosome_name',
#                                 'chromosome_start',
#                                 'chromosome_end',
#                                 'feature_type_name'),
#                  filters = "regulatory_feature_type_name",
#                  values = "Promoter Flanking Region",
#                  mart = regulatory)
# flanking_g = reduce(GRanges(
#                        flanking$chromosome_name,
#                        IRanges(flanking$chromosome_start,
#                        flanking$chromosome_end)))

flanking_g <- readRDS(system.file("states/promoter_flanking_g_data.rds",
  package = "MutationalPatterns"
))
```

Combine all genomic regions (`GRanges` objects) in a named `GrangesList`:
```{r combine_genomic_regions}
regions <- GRangesList(promoter_g, flanking_g, CTCF_g)

names(regions) <- c("Promoter", "Promoter flanking", "CTCF")
```

Make sure that these regions use the same chromosome naming convention as the
mutation data:
```{r combine_genomic_regions_2}
seqlevelsStyle(regions) <- "UCSC"
```

## Enrichment or depletion of mutations in genomic regions

It is necessary to include a list with `GRanges` of regions that were surveyed
in your analysis for each sample, that is: positions in the genome at which
you have enough high quality reads to call a mutation. This can
be determined using e.g. CallableLoci by GATK. If you would not include the
surveyed area in your analysis, you might for example see a depletion of
mutations in a certain genomic region that is solely a result from a low
coverage in that region, and therefore does not represent an actual depletion
of mutations.

We provided an example surveyed region data file with the package. For
simplicity, here we use the same surveyed file for each sample. For a proper
analysis, determine the surveyed area per sample and use these in your analysis.

Load the example surveyed region data:
```{r download_bed_data}
## Get the filename with surveyed/callable regions
surveyed_file <- system.file("extdata/callableloci-sample.bed",
  package = "MutationalPatterns"
)

## Import the file using rtracklayer and use the UCSC naming standard
library(rtracklayer)
surveyed <- import(surveyed_file)
seqlevelsStyle(surveyed) <- "UCSC"

## For this example we use the same surveyed file for each sample.
surveyed_list <- rep(list(surveyed), 9)
```

First you need to calculate the number of observed and the number of expected 
mutations in each genomic region for each sample.
```{r genomic_distribution}
distr <- genomic_distribution(grl, surveyed_list, regions)
```

Next, you can test for enrichment or depletion of mutations in the defined
genomic regions using a two-sided binomial test. For this test, the chance of
observing a mutation is calculated as the total number of mutations, divided by
the total number of surveyed bases. Multiple testing correction is also
performed. The significance cutoffs for the fdr and p-values can be changed in
the same way as for `strand_bias_test`.
In this example we perform the enrichment/depletion test by tissue type.
```{r enrichment_depletion_test}
distr_test <- enrichment_depletion_test(distr, by = tissue)
head(distr_test)
```

Finally, you can plot the results. Asteriks indicate significant
enrichment/depletion. Here we use p-values to plot asterisks. By default fdr is
used.
```{r plot_enrichment_depletion, fig.wide=TRUE}
plot_enrichment_depletion(distr_test, sig_type = "p")
```


## Mutational patterns of genomic regions

### Split mutations based on genomic regions

You can also look at the mutational patterns of genomic regions. However, keep
in mind that regions with very few mutations will lead to less reliable results.

First you can split the `GRangesList` containing the mutations based on the
defined genomic regions.
```{r split_muts_region}
grl_region <- split_muts_region(grl, regions)
names(grl_region)
```

You could now treat these sample/region combinations as completely separate
samples. You could for example perform NMF on these, to try to identify
signatures that are specific to certain genomic regions.

```{r extract_signatures_regions, fig.wide=TRUE}
mut_mat_region <- mut_matrix(grl_region, ref_genome)
nmf_res_region <- extract_signatures(mut_mat_region, rank = 2, nrun = 10, single_core = TRUE)
nmf_res_region <- rename_nmf_signatures(nmf_res_region, 
                                        signatures, 
                                        cutoff = 0.85)
plot_contribution_heatmap(nmf_res_region$contribution, 
                          cluster_samples = TRUE, 
                          cluster_sigs = TRUE)
```

In this case there don't seem to be any region specific signatures.

### Mutation Spectrum

Instead of treating the sample/region combinations as separate samples, you can
also plot the spectra per genomic region using the `plot_spectrum_region`
function. The arguments of `plot_spectrum` can also be used with this function.
By default the y-axis shows the number of variants divided by the total number
of variants in that sample and genomic region. This way the spectra of regions
with very few mutations can be more easily compared to regions with many
mutations.
```{r plot_spectrum_region, fig.wide=TRUE}
type_occurrences_region <- mut_type_occurrences(grl_region, ref_genome)
plot_spectrum_region(type_occurrences_region)
```

You can also plot the number of variants divided by the total number of variants
in that sample on the y-axis. In this case you don't normalize for the number of
variants per genomic region. As you can see below the vast majority of mutations
in this example occurred in the "other" region.
```{r plot_spectrum_region_yaxis, fig.wide=TRUE}
plot_spectrum_region(type_occurrences_region, mode = "relative_sample")
```

### Mutational profiles

In addition to plotting the spectra you can also plot a mutational profile. To
do this you first need to make a "long" mutation matrix. In this matrix the
different genomic regions are considered as different mutational types, instead
of as different samples like before.
```{r make_long_profile}
mut_mat_region <- mut_matrix(grl_region, ref_genome)
mut_mat_long <- lengthen_mut_matrix(mut_mat_region)
mut_mat_long[1:5, 1:5]
```

You can now plot this using `plot_profile_region`. The arguments of
`plot_96_profile` can also be used with this function. The options for the
y-axis are the same as for `plot_spectrum_region`. However, by default no
normalization is performed for the number of variants per genomic region,
because of the often limited number of mutations per mutation type.
```{r plot_profile_region, fig.wide=TRUE}
plot_profile_region(mut_mat_long[, c(1, 4, 7)])
```

NB: Since the "mut_mat_long" is a mutation matrix, you could perform NMF on it.
This would result in signatures, which will contain different mutation types in
different genomic regions.

### Mutation density

In the examples above we used known features like promoters for the regions.
It's also possible to define regions based on mutation density. You can divide
the genome into 3 bins with different mutation density like this:
```{r bin_mutation_density}
regions_density <- bin_mutation_density(grl, ref_genome, nrbins = 3)
names(regions_density) <- c("Low", "Medium", "High")
```

These regions can then be used in the same way as the previous regions. This can
be useful to, for example, compare the spectrum of regions with kataegis with
that of the rest of the genome.
```{r split_muts_region_density}
grl_region_dens <- split_muts_region(grl, regions_density, include_other = FALSE)
```


## Unsupervised local mutational patterns
Regional mutational patterns can also be investigated using an unsupervised
approach with the `determine_regional_similarity` function. This function uses a
sliding window approach to calculate the cosine similarity between the global
mutation profile and the mutation profile of smaller genomic windows, allowing
for the unbiased identification of regions with a mutation profile, that differs
from the rest of the genome. Because of the unbiased approach of this function,
it works best on a large dataset containing at least 100,000 substitutions.

First we combine all our samples together. Normally, you would only do this for
samples from the same cancer type/tissue, but here we combine everything because of
the limited number of substitutions in our example data.
```{r combine substitutions}
gr = unlist(grl)
```


Next, regions with a mutational pattern that is different from the rest of the
genome are identified. Here we use a small window size, because of the small
size of the example data. In practice a window size of 100 or more works better.
```{r determine_regional_similarity}
regional_sims <- determine_regional_similarity(gr,
                              ref_genome, 
                              chromosomes = c("chr1", "chr2", "chr3", "chr4", "chr5", "chr6"),
                              window_size = 40,
                              stepsize = 10,
                              max_window_size_gen = 40000000
)
```

The results of `determine_regional_similarity` can be visualized. Each dot shows
the cosine similarity between the mutation profiles of a single window and the
rest of the genome. A region with a different mutation profile will have a lower
cosine similarity. The dots are colored based on the sizes in mega bases of the
windows. This size is the distance between the first and last mutations in a
window.
```{r plot_regional_similarity}
plot_regional_similarity(regional_sims)
```

# Lesion segregation

Large Watson versus Crick strand asymmetries can sometimes be observed in
mutation spectra [@Aitken2020]. This can be the result of many DNA lesions
occurring during a single cell cycle. For example, many C>T lesions could occur.
If these lesions aren't properly repaired before the next genome duplication,
then the resulting sister chromatids will contain the incorrect "T" nucleotides
only on their parental strand. Incorrect "A" nucleotides will be incorporated on
the newly synthesized strands. These sister chromatids will segregate into
different daughter cells, which will have the C>T variants on different strands.
The majority of mutations will be either on the Watson or the Crick strand. This
process is known as lesion segregation [@Aitken2020]. 

Healthy human cells are 2n. Therefore, a daughter cell could inherit one copy of
a specific chromosome with mutations on the Watson strand and one copy with
mutations on the Crick strand. These will cancel each other out and no strand
bias will be visible. Because the chromosomes segregate independently from each
other, lesion segregation is expected to follow a mendelian inheritance pattern
of 1:2:1. 25% of the chromosomes will have mutations on the Watson strand, 25%
will have mutations on the Crick strand and 50% will show no Watson versus Crick
bias.

## Visualizing lesion segregation

You can visualize possible lesion segregation for a single or multiple samples.
If lesion segregation is present, then it will generally be quite clear that the
mutations are not randomly distributed over the strands. In this example no
lesion segregation is present. ("+" and "-" are used instead of "Watson" and
"Crick" to save space.)
```{r plot_lesion_segregation}
plot_lesion_segregation(grl[1:2])
```

## Calculating lesion segregation

You can also calculate whether lesion segregation is present instead of 
visualizing it. 

You can calculate whether lesion segregation is present using
`calculate_lesion_segregation`. This has the benefit that we can quantify the
amount of lesion segregation and generate p-values. However, the generated
p-values aren't always 100% reliable as will be discussed below. Therefore, we
recommend you to always confirm any suspected lesion segregation by visualizing
it.

`calculate_lesion_segregation` has three different modes. The first mode is
based on how often two subsequent mutations occur on the same strand. The
function assumes that when no lesion segregation is present, there is a 50%
chance of two subsequent mutations occurring on the same strand. A two-sided
binomial test is used to calculate whether the strand between subsequent
mutations is switched more often than that. Multiple testing correction is also
performed.
```{r calculate_lesion_segregation}
lesion_segretation <- calculate_lesion_segregation(grl, sample_names)
head(lesion_segretation)
```

This statistical test can be influenced by events such as kataegis and local
strand asymetries like replication-associated strand bias. As a result the
p-value can incorrectly suggest that lesion segregation is present. Therefore,
it can be useful to also look at the fraction of strand switches. In samples
with lesion segregation this is generally below 0.4.

By default this mode calculates the lesion segregation for all mutations
together. However, a mutational process might cause multiple types of base
substitutions, which aren't necessarily considered to be on the same strand.
Therefore, it might be useful to calculate the number of strand switches per
mutation type and then sum up the results. In this case the reference genome
also needs to be set. We recommend using this when you have a sample with
suspected lesion segregation and multiple common types of base substitutions.
```{r calculate_lesion_segregation_split}
lesion_seg_type <- calculate_lesion_segregation(grl,
                                                sample_names,
                                                split_by_type = TRUE,
                                                ref_genome = ref_genome)
```
 
 
The second mode of `calculate_lesion_segregation` uses the Wald-Wolfowitz test,
which was used by @Aitken2020 This test checks whether the Watson and Crick
strands are randomly distributed. It's results should generally be similar to
the first mode.
```{r calculate_lesion_segregation_wald}
lesion_segretation_wald <- calculate_lesion_segregation(grl, sample_names, 
                                                        test = "wald-wolfowitz")
head(lesion_segretation_wald)
```

This statistical test can also be influenced by events such as kataegis and 
local strand asymetries like replication-associated strand bias. 

The third mode of `calculate_lesion_segregation` can calculate the rl20 value
and the associated genomic span, which together are somewhat less sensitive to
events like kataegis. 

A rl20 value of 6 means that at least 20% of mutations are
in a strand specific run of 6 or more consecutive mutations. The genomic span is
the part of the genome covered by these runs. If the rl20 is high and a decent
part of the genome is covered by the strand specific runs, then this provides
strong evidence of lesion segregation. A high rl20, combined with a low genomic
span (<5%) is indicative of local clustering events like kataegis
[@Aitken2020]. A downside of this method is that it doesn't generate a
p-value. In general, we recommend you to use the first or second mode
of `calculate_lesion_segregation` to get a p-value and the third mode to check
if you are looking at a genome wide process or a local process like kataegis.
```{r calculate_lesion_segregation_rl20}
lesion_segretation_rl20 <- calculate_lesion_segregation(grl,
  sample_names,
  test = "rl20",
  ref_genome = ref_genome,
  chromosomes = chromosomes
)
head(lesion_segretation_rl20)
```


# A note on the graphics

The plots made with this package are all made using `r CRANpkg("ggplot2")`
[@Wickham2016]. This means that all the plots (except for the plots with
dendograms) are highly customizable. You can for example change the size and 
text
orientation of the y-axis.

```{r plot_spectrum_xaxis}
p <- plot_spectrum(type_occurrences, legend = FALSE)
p_axis <- p +
  theme(axis.text.y = element_text(size = 14, angle = 90))
```

You can also change the entire theme of the plot.
```{r plot_spectrum_theme}
p_theme <- p +
  theme_classic()
```

```{r combine_plot_spectrum2, eval=TRUE, fig.wide = TRUE, message=FALSE}
grid.arrange(p, p_axis, p_theme, ncol = 3, widths = c(3, 3, 3))
```

More information on `r CRANpkg("ggplot2")` is available
[here](https://ggplot2.tidyverse.org/). A list of themes is available
[here](https://ggplot2.tidyverse.org/reference/ggtheme.html).

# Session Info
```{r sessioninfo}
sessionInfo()
```

# References