File: Data-model.md

package info (click to toggle)
vg 1.30.0%2Bds-1
  • links: PTS, VCS
  • area: main
  • in suites: bullseye, sid
  • size: 267,848 kB
  • sloc: cpp: 446,974; ansic: 116,148; python: 22,805; cs: 17,888; javascript: 11,031; sh: 5,866; makefile: 4,039; java: 1,415; perl: 1,303; xml: 442; lisp: 242
file content (19 lines) | stat: -rw-r--r-- 1,602 bytes parent folder | download
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
vg uses Protocol Buffers for internal data representation and serialization. In this page we describe the ProtoBuf schema we use and some conceptual ideas behind the API, without getting too deep into the specifics of the API.

### A quick ProtoBuf primer
ProtoBuf is a schema language, much like Apache Avro, FlatBuffers, Cap'n Proto, or many others. In essence, it's a human-readable language developed by Google that allows one to describe objects ('messages' in protobuf-speak). One writes their data model (or 'schema') in this language and then compiles it with a special compiler, `protoc`, which generates source code in a variety of common programming languages (C++, Python, Java, etc.). The generated source code contains getters and setters for fields in the schema. In essence, protobuf just makes it really easy to add/remove fields from a data model and port it into other languages.

### The vg protobuf schema
The vg protobuf schema can be found [here](https://github.com/vgteam/vg/blob/master/src/vg.proto). If you browse through it you'll see our messages and their nested fields. For example, there is a message called 'Alignment' that represents a read aligned to the graph - it's analogous to a SAM record. It has nested fields like 'sequence', 'quality', etc. We'll walk through these, as well as some additional C++ structures that support fast queries on them, in the following sections.

### Graphs, Nodes, and Edges

### Paths

### Alignments, Mappings, and Edits

### Pileups (NodePileup, EdgePileup, and plain old Pileup)

### Translations

### Genotype, Locus, and Support