File: rev-store.md

package info (click to toggle)
ocaml-dune 3.20.2-3
  • links: PTS, VCS
  • area: main
  • in suites: forky, sid
  • size: 33,564 kB
  • sloc: ml: 175,178; asm: 28,570; ansic: 5,251; sh: 1,096; lisp: 625; makefile: 148; python: 125; cpp: 48; javascript: 10
file content (106 lines) | stat: -rw-r--r-- 5,167 bytes parent folder | download | duplicates (2)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
The Revision Store
==================

The revision store is the place where Git data that is relevant to the Dune
package management is cached.

The Concepts
------------

The revision store uses Git in the way of its original slogan, as a
content-addressable file system. A lot of data (code) and meta-data (opam files)
is stored in Git repositories that are often forked from each other, hence to
save space Dune has a Git object cache.

Git is implemented as a way to store revisions and being able to address them.
However, these revisions do not have to have a common ancestor and given Git
uses SHA1 hashes for addressing it is possible to join multiple repositories in
one single Git repository without clashing. A fairly common usecase outside of
Dune are `gh-pages` branches to serve documentation on Github, which do not
share a common parent with the main branch of the repository.

The revision store exploits this feature by putting all revisions of all
repositories into one single large repository to take advantage of caching
effects.

The Advantages
--------------

This way of organizing means that all revisions shared between multiple forks
of a Git repository can reuse the same Git objects that are in common and don't
need to download them nor store them again. Updating repositories can be done
incrementally, as Git knows which revisions are available locally and which
ones need to be fetched.

It is also possible to refer to previous states easily as the commits are part
of a Git repo and checking out an older version is as simple as checking out
the current version of the files.

Considerations and Compromises
------------------------------

An important consideration was that the management of the revision store should
be entirely transparent to the user, they should not need to do any steps to
create nor maintain it. It should get created automatically if needed and all
the steps that are necessary to keep it updated should happen in the
background. The store should always work like a cache that can be discarded
safely without causing data loss.

The revision store should always give out the most recent version of data,
unless explicitly instructed otherwise. This means that :

  * If only a Git source is specified, then the revision store will
    automatically get the newest revision
  * If the source specifies a tag or branch, then the revision store will
    automatically update to the newest revision
  * It a revision is specified, the revision store will only update if the
    revision is not yet cached, otherwise the cached version can be used

The final consideration means that an offline usage is possible if all
repositories specified are specified with their hash.

Due to the fact that the revision store is a Git repository it means that the
data sources that can be added to it also have to be available via Git. This
means that adding repositories that use different version control systems
aren't supported at the moment nor are plain HTTP sources supported.

Support for other kinds of VCSes is a possible extension by replicating similar
concepts with other version control systems, provided they allow for similar
flexibility as the Git way of storing revisions. However at the moment most
users have settled on using Git, hence this version should be able to
accommodate the needs for most users.

Another compromise is that old repositories with long histories and large sizes
have to be cloned before use, thus increasing the size of the initial download
compared to the same metadata downloaded as a compressed tarball. Despite Git
compressing objects, the history of the repositories to be added does increase
the overhead.

A solution to this could be shallow clones which only contain the latest
revisions, however these have [shown to be
problematic](https://blog.cocoapods.org/Master-Spec-Repo-Rate-Limiting-Post-Mortem/)
thus for time being we are fetching the complete histories.

Implementation
--------------

This section describes the current way the revision store is implemented.

As the revision store is not project specific, it is stored in the user's cache
directory (using the [freedesktop.org](https://www.freedesktop.org/wiki/)
specifications, the directory specified by `XDG_CACHE_HOME`), with all dune
instances sharing one single revision store.

The revision store itself is a `bare` Git repository without a worktree. This
is because all repositories in the revision store are equal and checking out
one particular revision would be a waste of disk space as the Git tooling can
be used to construct any revisions out of the bare repository anyway.

Thus every source that is added to the revision store as a remote that tracks
the default branch (or, if a branch is specified explicitly, then that
branch) and fetched, thus storing the required revisions in the revision store.

The implementation of these features is a mix of calling the `git` binary and
implementing parts in OCaml. This means that the `git` binary is required on
the system. Possible future improvements could be using
[ocaml-git](https://github.com/mirage/ocaml-git) to avoid the dependency.