File: Data-model.md

package info (click to toggle)

vg 1.30.0%2Bds-1

links: PTS, VCS
area: main
in suites: bullseye, sid
size: 267,848 kB
sloc: cpp: 446,974; ansic: 116,148; python: 22,805; cs: 17,888; javascript: 11,031; sh: 5,866; makefile: 4,039; java: 1,415; perl: 1,303; xml: 442; lisp: 242

file content (19 lines) | stat: -rw-r--r-- 1,602 bytes

vg uses Protocol Buffers for internal data representation and serialization. In this page we describe the ProtoBuf schema we use and some conceptual ideas behind the API, without getting too deep into the specifics of the API.

### A quick ProtoBuf primer
ProtoBuf is a schema language, much like Apache Avro, FlatBuffers, Cap'n Proto, or many others. In essence, it's a human-readable language developed by Google that allows one to describe objects ('messages' in protobuf-speak). One writes their data model (or 'schema') in this language and then compiles it with a special compiler, `protoc`, which generates source code in a variety of common programming languages (C++, Python, Java, etc.). The generated source code contains getters and setters for fields in the schema. In essence, protobuf just makes it really easy to add/remove fields from a data model and port it into other languages.

### The vg protobuf schema
The vg protobuf schema can be found [here](https://github.com/vgteam/vg/blob/master/src/vg.proto). If you browse through it you'll see our messages and their nested fields. For example, there is a message called 'Alignment' that represents a read aligned to the graph - it's analogous to a SAM record. It has nested fields like 'sequence', 'quality', etc. We'll walk through these, as well as some additional C++ structures that support fast queries on them, in the following sections.

### Graphs, Nodes, and Edges

### Paths

### Alignments, Mappings, and Edits

### Pileups (NodePileup, EdgePileup, and plain old Pileup)

### Translations

### Genotype, Locus, and Support