File: VG-GBWT-Subcommand.md

package info (click to toggle)
vg 1.30.0%2Bds-1
links: PTS, VCS
area: main
in suites: bullseye, sid
size: 267,848 kB
sloc: cpp: 446,974; ansic: 116,148; python: 22,805; cs: 17,888; javascript: 11,031; sh: 5,866; makefile: 4,039; java: 1,415; perl: 1,303; xml: 442; lisp: 242
file content (299 lines) | stat: -rw-r--r-- 18,386 bytes
# Overview

The `vg gbwt` subcommand is a pipeline for building and manipulating GBWT indexes and GBWTGraphs.

## File types

* [[GBWT|https://github.com/jltsiren/gbwt]] is an FM-index that can store a large number of similar threads (paths over node identifiers) space-efficiently. The default extensions is `.gbwt`.
* R-index is an optional structure that augments the GBWT for faster `locate()` queries. It is not currently used in the VG toolkit. The default extension is `.ri`.
* [[GBWTGraph|https://github.com/jltsiren/gbwtgraph]] is a `HandleGraph` implementation based of the graph induced by the threads stored in a GBWT index. It provides fast access to node sequences and an option to restrict graph traversals to paths supported by the indexed threads. The default extension is `.gg`.

## Pipeline

There are several steps in the pipeline. Most steps require either one input GBWT or multiple input GBWTs. The input GBWTs are either loaded from the files specified in the input args or built in earlier steps. Option `-p` / `--progress` enables printing progress information to stderr.

1. Build GBWT indexes from phased VCF files, paths embedded in the graph, or alignments in GAF/GAM files. Requires **VG version 1.29.0** or later.
2. Merge multiple GBWT indexes. Parallel algorithm requires **VG version 1.29.0** or later.
3. Remove a sample from the GBWT index.
4. Build, augment, or replace the GBWT index with a greedy path cover algorithm.
5. Build a GBWTGraph.
6. Build an r-index.
7. Print GBWT metadata.
8. Extract threads from a GBWT index.

If the pipeline produces an output GBWT, it will be written after step 4.

## Temporary files

The following steps create temporary files:

* Step 1: Construction from phased VCF files. VCF parse files usually require similar space to the compressed VCF files. If the construction produces multiple GBWT indexes, these temporary indexes will usually take similar space to the final GBWT index.
* Step 2: Parallel merging algorithm. The algorithm uses the first input GBWT as the base index. For each subsequent input GBWT, the temporary files take roughly 2-3 bytes per node (e.g. ~50 GB for 1000 paths and their reverse complements with 10 million nodes each). The files are deleted when the merging proceeds to the next input.

Temporary files are deleted when VG exits or fails in a controlled way. If VG crashes, the temporary files may remain.

Temporary directory can be specified with option `-d` / `--temp-dir`. If the option is not specified, VG tries to determine it using environment variables `TMPDIR`, `TMP`, `TEMP`, `TEMPDIR`, and `USERPROFILE` (in that order). If none of the variables is defined, VG defaults to `/tmp`.

## GBWT construction parameters

New GBWT indexes are built in steps 1 and 4. The construction uses the following computational parameters:

* `--buffer-size N`: Use a buffer of `N` million nodes for each construction job (default 100). The buffers typically require 8`N` MiB for each construction job. For best performance, buffer size should be at least an order of magnitude larger than the length of the threads.
* `--id-interval N`: Sample thread is at one out of `N` nodes (default 1024). This is a time-space trade-off. The samples use space proportional to 1/`N`, while `locate()` takes time proportional to `N`. The default is intended for 1000GP indexes, where the samples take similar space to the GBWT itself.

## Search parameters

Parallel GBWT merging (`-b` / `--parallel`) and r-index construction (`-r` / `--r-index`) extract all paths from a GBWT index using multiple parallel search threads.

* `--num-threads N`: Use `N` parallel search threads. The default is the number of logical CPU cores.

# GBWT construction (step 1)

Construction requires an input graph (`-x` / `--xg-name`) and an output GBWT (`-o` / `--output`).

## Construction from phased VCF files

**Example:** 1000GP GBWT construction.
```
# Combine multiple input graphs built with `vg construct -a`.
vg index -x graph-with-alts.xg -L $(for i in $(seq 1 22; echo X; echo Y); do echo chr${i}.vg; done)

# Build the GBWT using 14 parallel jobs. This requires ~200 GiB memory. Construction
# time is bounded by the time required for indexing chr2.
vg gbwt -x graph-with-alts.xg -o 1000gp.gbwt --preset 1000gp --num-jobs 14 -v \
    $(for i in $(seq 1 22; echo X; echo Y); do echo chr${i}.vcf.gz; done)
```

Option `-v` / `--vcf-input` enables construction from phased VCF files. Input args are now interpreted as VCF files instead of GBWT files and each contig must be present in at most one VCF file. The VCF files must the ones used for building the graph or a subset of them. The input graph must contain variant information (often called alt paths) created by `vg construct` option `-a`. An XG index will store the alt paths if it is built with `vg index` option `-L`. (Note that this XG index can be several times larger than an index without alt paths.)

If the construction creates multiple jobs (see below), the resulting temporary GBWTs will automatically be merged in step 2 using the fast algorithm.

### VCF parsing options

* `--ignore-missing`: Do not warn when variants encountered in the VCF file are missing from the input graph.
* `--actual-phasing`: By default, GBWT construction treats unphased homozygous genotypes as phased. With this option, they will remain unphased and create phase breaks in the threads.
* `--force-phasing`: Randomly phase unphased genotypes.
* `--discard-overlaps`: If a haplotype contains overlapping alternate alleles and the overlap cannot be resolved, replace the second alternate allele with the reference allele instead of creating a phase break.
* `--batch-size N`: Generate threads in batches of `N` samples (default 200). Each construction job will buffer the threads generated from the current batch until the haplotype is finished or a phase break occurs. If thread length is `M` nodes, buffering `N` diploid samples typically takes around 8`MN` bytes for that job.

With options `--force-phasing` and `--discard-overlaps`, the generated haplotypes will generally be chromosome-length.

The following options are rarely necessary:

* `--sample-range X-Y`: Index only samples in range `X` to `Y` (inclusive, 0-based).
* `--rename V=P`: VCF contig `V` is the same as reference path `P` in the input graph (may repeat). By default, contig names are assumed to match path names.
* `--vcf-variants`: The names of alt paths in the input graph are based on VCF contig names instead of reference path names.
* `--vcf-region C:X-Y`: Restrict the VCF contig with name `C` to region `X-Y` (inclusive, 1-based; may repeat). This only makes sense if the graph was built with a similar restriction.
* `--exclude-sample X`: Do not index the sample with name `X` (may repeat). This is generally faster than removing the threads in step 3.

### Presets

Presets replace the default GBWT construction parameters and VCF parsing options. Preset `X` can be chosen with option `--preset X` and it can be further modified with subsequent command line arguments.

The following presets are currently available:

* `1000gp`: Generate chromosome-length haplotypes from 1000GP VCF files. Corresponds to `--buffer-size 200 --batch-size 100 --force-phasing --discard-overlaps`.

### Construction jobs

The construction will automatically create one or more jobs for each VCF file, each of them corresponding to one or more contigs (non-alt paths in the input graph). The size of a job is the total length of the paths in it (in nodes), and the maximum size of a job is the length of the longest path in the graph. By default, contigs are assigned to jobs using a first-fit heuristic in decreasing order by length. This can be overridden with option `--inputs-as-jobs`, which creates one job for each VCF file.

Each construction job uses two parallel threads, but one of the threads is usually idle. By default, `vg gbwt` runs `N`/2 construction jobs in parallel, where `N` is the number of logical CPU cores. This can be changed with option `--num-jobs N` (e.g. if there are memory constraints).

## Construction from embedded paths

**Example:**
```
# Build a GBWT of the embedded (reference) paths with each path as a contig in
# the GBWT metadata.
vg gbwt -x graph.xg -o ref-paths.gbwt -E
```

The embedded paths in a graph can be indexed as GBWT threads with option `-E` / `--index-paths`. This construction option does not use input args. By default, the paths are assumed to be reference paths. GBWT metadata will store path names as contig names and use an artificial sample name `ref` for all threads. This can be reversed with option `--paths-as-samples`: path names become sample names and all threads will be over an artificial contig with name `0`.

## Construction from GAF/GAM files

**Example:**

```
# Build a GBWT of the alignments in the GAF files.
vg gbwt -x graph.xg -o alignments.gbwt -A aln1.gaf aln2.gaf
```

The paths corresponding to alignments can be converted to threads and stored in a GBWT index with option `-A`/ `--alignment-input`. Multiple alignment files can be given as input args, but they will be processed sequentially. The default input format is GAF. Alignment files in the GAM format used by older versions of VG can be indexed with option `--gam-format`.

# Merging GBWT indexes (step 2)

There are multiple algorithms for merging GBWT indexes. The insertion and parallel algorithms work with any input GBWTs, while the fast algorithm requires that node identifiers do not overlap between the inputs.

Merging requires an output GBWT (`-o` / `--output`).

## Insertion algorithm

**Example:**
```
# Merge three batches of haplotypes into a single index.
vg gbwt -o merged.gbwt -m batch1.gbwt batch2.gbwt batch3.gbwt
```

Multiple input GBWTs can be merged into a single index with option `-m` / `--merge`. The default (insertion) algorithm uses the first input as the base index. It then loads the subsequent inputs one by one, extracts the threads from them, and inserts them into the base index using the construction algorithm. This is somewhat slower than building the indexes in the first place. Hence the largest index should be used as the base index.

## Fast merging of non-overlapping indexes

**Example:**
```
# Merge GBWTs of individual chromosomes into a single index.
vg gbwt -o all.gbwt -f $(for i in $(seq 1 22; echo X; echo Y); do echo chr${i}.gbwt; done)
```

If each input GBWT uses separate node identifiers, they can be merged quickly with option `-f`/ `--fast`. This is usually the case when the input GBWTs are over separate chromosomes. The fast algorithm is automatically used when the construction from input VCFs creates multiple jobs.

## Parallel merging of overlapping indexes

**Example:**
```
# Merge three batches of haplotypes into a single index.
vg gbwt -o merged.gbwt -b batch1.gbwt batch2.gbwt batch3.gbwt
```

Parallel merging algorithm can be used with option `-b` / `--parallel`. It is primarily intended for merging large GBWTs that already contain thousands of human haplotypes. The algorithm uses the first input as the base index. Then, for each subsequent input, it first extracts all paths from the index, searches simultaneously for them in the base index, and writes the rank array into a number of temporary files. Then it launches a number of merge jobs that stream the rank array from the files and use the information for interleaving the two GBWTs.

### Parallel merging options

* `--chunk-size N`: Search in chunks of `N` sequences (default 1). Each thread is represented using two sequences. If the threads are short (e.g. alignments), larger chunk size may increase speed.
* `--pos-buffer N`: Use `N` MiB position buffers for each search thread (default 64).
* `--thread-buffer N`: Use `N` MiB thread buffers for each search thread (default 256). The compressed thread buffers should be several times larger than the uncompressed position buffers.
* `--merge-buffers N`: Merge 2^`N` thread buffers into one file per merge job (default 6). Larger numbers increase memory usage and make the temporary files larger. Smaller numbers create more temporary files and may make the merge jobs slower.
* `--merge-jobs N`: Run `N` parallel merge jobs. The default `M`/2 jobs on a system with `M` logical CPU cores, but no more than 4 jobs.

# Path cover GBWTs (step 4)

This step builds a GBWT based on artificially generated paths (threads). The path generation algorithm is similar to the one used in the Pan-genome Seed Index (PSI). We generate `N` threads for each component in the graph, starting each from an arbitrary node with the highest priority. We extend the thread greedily, always choosing the extension with the highest priority. See the comments in the [header](https://github.com/jltsiren/gbwtgraph/blob/master/include/gbwtgraph/path_cover.h) for further details.

Note that this is a greedy algorithm for a maximum coverage problem. We try to cover as many nodes as possible with the `N` threads while taking the priorities into account, but some nodes may not be covered.

## Path cover options

Path cover generation requires an input graph (`-x` / `--xg-name`), an output GBWT (`-o` / `--output`), and one algorithm.

* `-a` / `--augment-gbwt`: Augment the input GBWT with a greedy path cover of components with no threads.
* `-l`/ `--local-haplotypes`: Create a path cover based on sampling local haplotypes in the same proportions as in the input GBWT.
* `-P` / `--path-cover`: Create a greedy path cover of all components.
* `-n N` / `--num-paths N`: Generate `N` threads for each component (default 16 for most options).
* `-k K` / `--context-length K`: Use subpaths of length `K` for determining the priority of extensions.

## Greedy path cover

**Example:**
```
# Generate a greedy cover with 16 threads per component.
vg gbwt -x graph.xg -o greedy-cover.gbwt -P
```

The basic greedy path cover algorithm (option `-P` / `--path-cover`) assumes that each path in the graph is equally likely. High-priority nodes and subpaths are those with the lowest coverage in the generated threads so far.

Greedy path cover does not use input GBWTs.

## Sampling local haplotypes

**Example:**
```
# Generate 64 threads per component by sampling the local haplotypes proportionally.
vg gbwt -x graph.xg -o sampled.gbwt -l haplotypes.gbwt
```

The sampling algorithm (option `-l` / `--local-haplotypes`) samples the local haplotypes proportionally. If the input GBWT contains threads for the current component, the highest priority is given to the nodes and subpaths with the highest ratio
```
  true_frequency / (sampled_frequency + 1)
```
so far. If there are no threads, the algorithm generates a greedy path cover for the component.

Sampling requires one input GBWT. The default number of generated threads is `N` = 64.

## Path cover of missing components

**Example:** Simple augmentation.
```
# Augment the GBWT with a greedy path cover of missing components.
vg gbwt -x graph.xg -o augmented.gbwt -a haplotypes.gbwt
```

**Example:** Build and augment.
```
# Combine multiple input graphs built with `vg construct -a` and a decoy graph.
vg index -x graph-with-alts.xg -L $(for i in $(seq 1 22; echo X; echo Y); do echo chr${i}.vg; done) decoy.vg

# Build the GBWT using 14 parallel jobs. Then augment it with a path cover of the decoy graph.
vg gbwt -x graph-with-alts.xg -o 1000gp.gbwt --preset 1000gp --num-jobs 14 -v -a \
    $(for i in $(seq 1 22; echo X; echo Y); do echo chr${i}.vcf.gz; done)
```

The augmentation algorithm (option `-a` / `--augment-gbwt`) augments the input GBWT with a greedy path cover of graph components without threads.

Augmentation requires one input GBWT.

# Other steps

## Removing samples (step 3)

**Example:**
```
# Remove sample NA12878 from the GBWT.
vg gbwt -o removed.gbwt -R NA12878 1000gp.gbwt
```

Option `-R` / `--remove-sample` removes all threads corresponding to the sample with name `X` from the input GBWT (may repeat).

Removing samples requires one input GBWT and an output GBWT (`-o` / `--output`).

## GBWTGraph construction (step 5)

**Example:** Simple GBWTGraph construction.
```
# Build a GBWTGraph based on the input GBWT.
vg gbwt -x graph.xg -g graph.gg haplotypes.gbwt
```

**Example:** Haplotype sampling and GBWTGraph construction.
```
# Sample the local haplotypes and build a GBWTGraph based on the sampled GBWT.
vg gbwt -x graph.xg -o sampled.gbwt -g sampled.gg -l haplotypes.gbwt
```

We can build a GBWTGraph corresponding to the final GBWT with option `-g X` / `--graph-name X` and write it to file `X`. The final GBWT is either the output GBWT if other options require it or the only input GBWT otherwise. The graph requires the final GBWT. It does not work without it or with other GBWT indexes.

GBWTGraph construction requires one input GBWT and an input graph (`-x` / `--xg-name`).

## R-index construction (step 6)

**Example:**
```
# Build an r-index for the input GBWT.
vg gbwt -r haplotypes.ri haplotypes.gbwt
```

We can build an r-index corresponding to the final GBWT with option `-r X` / `--r-index X` and write it to file `X`. The final GBWT is either the output GBWT if other options require it or the only input GBWT otherwise. The r-index requires the final GBWT. It does not work without it or with other GBWT indexes.

R-index construction requires one input GBWT.

## Metadata output (step 7)

There are several options for printing metadata from the input GBWT to stdout.

* `M` / `--metadata`: Print basic metadata information.
* `C` / `--contigs`: Print the total number of contigs.
* `H` / `--haplotypes`: Print the total number of haplotypes.
* `S` / `--samples`: Print the total number of samples.
* `L` / `--list-names`: When used with `-C` or `-S`, prints the names of the contigs or the samples instead of the total.
* `T` / `--thread-names`: Print the names of all threads.

Metadata output requires one input GBWT.

## Thread extraction (step 8)

These options extract threads from the input GBWT. They work even when the GBWT index has been built without metadata.

* `-c` / `--count-threads`: Print the total number of threads to stdout.
* `-e X` / `--extract X`: Extract the threads as `sdsl::int_vector_buffer<0>` to file `X`. This format can be used by the `build_gbwt` tool included in the GBWT repository.

Thread extraction requires one input GBWT.