File: sav_spec.md

package info (click to toggle)
savvy 2.1.0-4
  • links: PTS, VCS
  • area: main
  • in suites: forky, sid
  • size: 1,452 kB
  • sloc: cpp: 20,126; sh: 68; makefile: 14
file content (71 lines) | stat: -rw-r--r-- 3,543 bytes parent folder | download | duplicates (3)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
# Sparse Allele Vectors Specification

## Variable Length Integer (VLI) Encoding
All quantities are encoded in LEB128 format (https://en.wikipedia.org/wiki/LEB128). Encoded integers can start in the middle of a byte allowing 1 to 7 bits of data to prefix the integer.

## Variable Length String (VLS) Encoding
```
+~~~~~~~~~~+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~+
|   SIZE   |         STRING_DATA         |
+~~~~~~~~~~+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~+
* SIZE: Size of string encoded as VLI.
* STRING_DATA: String payload stored in SIZE bytes.
```


## Header Format
```
+----------------------------------------------------------------+
|   "sav" string in binary + 4 bytes for version (Major.minor)   |
+----------------------------------------------------------------+
| 01110011 01100001 01110110 MMMMMMMM MMMMMMMM mmmmmmmm mmmmmmmm |
+----------------------------------------------------------------+

+-------------------------------------------------------------------------------------------------------------------------------------------------+
| 16 byte UUID (used to verify index matches correct version of file)                                                                             |
+-------------------------------------------------------------------------------------------------------------------------------------------------+
| UUUUUUUU UUUUUUUU UUUUUUUU UUUUUUUU UUUUUUUU UUUUUUUU UUUUUUUU UUUUUUUU UUUUUUUU UUUUUUUU UUUUUUUU UUUUUUUU UUUUUUUU UUUUUUUU UUUUUUUU UUUUUUUU |
+-------------------------------------------------------------------------------------------------------------------------------------------------+

+~~~~~~~~~~~~~~~+VVVVVVVVVVVVVVVVVVV+
| HEADERS_COUNT | HEADERS_ARRAY ... |
+~~~~~~~~~~~~~~~+VVVVVVVVVVVVVVVVVVV+
* HEADERS_COUNT: Number of header key-value pairs.
* HEADERS_ARRAY: Array of HEADERS_COUNT header key-value pairs.
```

## Allele Pairs
Allele pairs are encoded in 1 or more bytes. The first BIT_WIDTH bits of the first byte encodes the non-zero value of the allele. The the remaining bits and any additional bytes in the pair make up a VLI that represents a zero-based offset from the previous non-zero allele.

Allele values can be decoded with the following formula (VALUE + 1) / 2 ^ BIT_WIDTH. With 1-bit values, 0.5 is considered a missing binary genotype. 2-bit values and greater represent binned posterior genotype probabilities.

### Example
```
A 1-bit missing allele with an offset of 25.
+-+--------+
|0|001 1001|
+-+--------+

A 1-bit alternate allele with an offset of 8000.
+-+--------+ +---------+
|1|100 0000| |0111 1101|
+-+--------+ +---------+
```

## Record Format
```
+vvvvvvvv+~~~~~~~~~+vvvvvvvvv+vvvvvvvvv+VVVVVVVVVVVVVVVVVVVVVV+~~~~~~~~~~~~~~+~~~~~~~~~+VVVVVVVVVVVVVVVVVVVVVVV+
| CHROM  |   POS   |   REF   |   ALT   | META_VALUE_ARRAY ... | PLOIDY_LEVEL | APA_SZ  | ALLELE_PAIR_ARRAY ... |
+vvvvvvvv+~~~~~~~~~+vvvvvvvvv+vvvvvvvvv+VVVVVVVVVVVVVVVVVVVVVV+~~~~~~~~~~~~~~+~~~~~~~~~+VVVVVVVVVVVVVVVVVVVVVVV+

* CHROM: Chromosome string stored has VLS.
* POS: Chromosome pos stored has VLI.
* MARKID: Marker ID string stored has VLS.
* REF: Reference haplotype stored has VLS.
* ALT: Alternate haplotype stored has VLS.
* META_VALUE_ARRAY: Array of size META_FIELDS_CNT that stores metadata for each marker. Values correspond to META_FIELDS_ARRAY in header.
* PLOIDY_LEVEL: Ploidy level stored has VLI.
* APA_SZ: Size of allele pair array stored as VLI with one bit prefix.
* ALLELE_PAIR_ARRAY: Array of size APA_SZ that stores alternate alleles Allele Pair encoding.

```