1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169
|
---
title: 'AIRR Data Representation Reference Library Usage'
author: "The AIRR Community"
date: '`r Sys.Date()`'
output:
html_document:
standalone: true
fig_height: 4
fig_width: 7.5
highlight: pygments
theme: readable
toc: yes
md_document:
fig_height: 4
fig_width: 7.5
preserve_yaml: no
toc: yes
pdf_document:
dev: pdf
fig_height: 4
fig_width: 7.5
highlight: pygments
toc: yes
geometry: margin=1in
fontsize: 11pt
vignette: >
%\VignetteIndexEntry{Usage Vignette}
%\usepackage[utf8]{inputenc}
%\VignetteEngine{knitr::rmarkdown}
---
## Introduction
Since the use of High-throughput sequencing (HTS) was first introduced to analyze
immunoglobulin (B-cell receptor, antibody) and T-cell receptor repertoires
(Freeman et al, 2009; Robins et al, 2009; Weinstein et al, 2009),
the increasing number of studies making use of this technique has produced enormous
amounts of data and there exists a pressing need to develop and adopt common standards,
protocols, and policies for generating and sharing data sets. The [Adaptive Immune
Receptor Repertoire (AIRR) Community](http://airr-community.org) formed in
2015 to address this challenge (Breden et al, 2017) and has stablished the set of
minimal metadata elements (MiAIRR) required for describing published AIRR
datasets (Rubelt et al, 2017) as well as file formats to represent this data in a
machine-readable form. The `airr` R package provide read, write and validation of data
following the AIRR Data Representation schemas. This vignette provides a set of
simple use examples.
### AIRR Data Standards
The AIRR Community's recommendations for a minimal set of metadata that should be
used to describe an AIRR-seq data set when published or deposited in a
AIRR-compliant public repository are described in Rubelt et al, 2017. The primary
aim of this effort is to make published AIRR datasets FAIR (findable, accessible,
interoperable, reusable); with sufficient detail such that a person skilled in the
art of AIRR sequencing and data analysis will be able to reproduce the
experiment and data analyses that were performed.
Following this principles, V(D)J reference alignment annotations are
saved in standard tab-delimited files (TSV) with associated metadata
provided in accompanying YAML formatted files. The column names
and field names in these files have been defined by the
AIRR Data Representation Working Group using a controlled vocabulary of
standardized terms and types to refer to each piece of information.
## Reading AIRR formatted files
The `airr` package contains the function `read_rearrangement` to read and validate
files containing AIRR Rearrangement records, where a Rearrangement record describes
the collection of optimal annotations on a single sequence that has undergone V(D)J reference alignment.
The usage is straightforward, as the file format is a typical tabulated file.
The argument that needs attention is `base`, with possible values `"0"` and `"1"`.
`base` denotes the starting index for positional fields in the input file.
Positional fields are those that contain alignment coordinates and names
ending in "_start" and "_end". If the input file is using 1-based closed
intervals (R style), as defined by the standard, then positional fields will
not be modified under the default setting of `base="1"`. If the input file is using
0-based coordinates with half-open intervals (python style), then
positional fields may be converted to 1-based closed
intervals using the argument `base="0"`.
### Reading Rearrangements
```{r}
# Imports
library(airr)
library(tibble)
# Read Rearrangement example file
f1 <- system.file("extdata", "rearrangement-example.tsv.gz", package="airr")
rearrangement <- read_rearrangement(f1)
glimpse(rearrangement)
```
### Reading AIRR Data Models
AIRR Data Model records, such as Repertoire and GermlineSet, can be read from either a YAML
or JSON formatted file into a nested list.
```{r}
# Read Repertoire example file
f2 <- system.file("extdata", "repertoire-example.yaml", package="airr")
repertoire <- read_airr(f2)
glimpse(repertoire)
# Read GermlineSet example file
f3 <- system.file("extdata", "germline-example.json", package="airr")
germline <- read_airr(f3)
glimpse(germline)
```
## Writing AIRR formatted files
The `airr` package contains the function `write_rearrangement` to write
Rearrangement records to the AIRR TSV format.
### Writing Rearrangements
```{r}
x1 <- file.path(tempdir(), "airr_out.tsv")
write_rearrangement(rearrangement, x1)
```
### Writing AIRR Data Models
AIRR Data Model records can be written to either YAML or JSON using the `write_airr` function.
```{r}
x2 <- file.path(tempdir(), "airr_repertoire_out.yaml")
write_airr(repertoire, x2, format="yaml")
x3 <- file.path(tempdir(), "airr_germline_out.json")
write_airr(germline, x3, format="json")
```
## Validating AIRR data structures
The `airr` package contains the function `validate_rearrangement` to validate
tabular (`data.frame`) Rearrangement records and AIRR Data Model objects, respectively.
```{r}
# Validate Rearrangement data.frame
validate_rearrangement(rearrangement)
# Validate an AIRR Data Model
validate_airr(repertoire)
# Validate AIRR Data Model records individual
validate_airr(germline, each=TRUE)
```
## References
1. Breden, F., E. T. Luning Prak, B. Peters, F. Rubelt, C. A. Schramm, C. E. Busse,
J. A. Vander Heiden, et al. 2017. Reproducibility and Reuse of Adaptive Immune
Receptor Repertoire Data. *Front Immunol* 8: 1418.
2. Freeman, J. D., R. L. Warren, J. R. Webb, B. H. Nelson, and R. A. Holt. 2009.
Profiling the T-cell receptor beta-chain repertoire by massively parallel sequencing.
*Genome Res* 19 (10): 1817-24.
3. Robins, H. S., P. V. Campregher, S. K. Srivastava, A. Wacher, C. J. Turtle,
O. Kahsai, S. R. Riddell, E. H. Warren, and C. S. Carlson. 2009.
Comprehensive assessment of T-cell receptor beta-chain diversity in alphabeta T cells.
*Blood* 114 (19): 4099-4107.
4. Rubelt, F., C. E. Busse, S. A. C. Bukhari, J. P. Burckert, E. Mariotti-Ferrandiz,
L. G. Cowell, C. T. Watson, et al. 2017. Adaptive Immune Receptor Repertoire Community
recommendations for sharing immune-repertoire sequencing data.
*Nat Immunol* 18 (12): 1274-8.
5. Weinstein, J. A., N. Jiang, R. A. White, D. S. Fisher, and S. R. Quake. 2009.
High-throughput sequencing of the zebrafish antibody repertoire.
*Science* 324 (5928): 807-10.
|