1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242
|
# `gitaly-backup`
`gitaly-backup` is used to create backups of the Git repository data from
Gitaly and Gitaly Cluster.
## Directly backup repository data
1. For each project to backup, find the Gitaly storage name and relative or disk path using either:
- The [Admin area](https://docs.gitlab.com/ee/administration/repository_storage_types.html#from-project-name-to-hashed-path).
- The [repository storage API](https://docs.gitlab.com/ee/api/projects.html#get-the-path-to-repository-storage).
1. Generate the `GITALY_SERVERS` environment variable. This variable specifies
the address and authentication details of each storage to restore to. The
variable takes a base64-encoded JSON object.
For example:
```shell
export GITALY_SERVERS=`echo '{"default":{"address":"unix:/var/opt/gitlab/gitaly.socket","token":""}}' | base64 --wrap=0`
```
1. Generate the backup job file. The job file consists of a series of JSON objects separated by a new line (`\n`).
| Attribute | Type | Required | Description |
|:--------------------|:---------|:---------|:------------|
| `address` | string | no | Address of the Gitaly or Gitaly Cluster server. Overrides the address specified in `GITALY_SERVERS`. |
| `token` | string | no | Authentication token for the Gitaly server. Overrides the token specified in `GITALY_SERVERS`. |
| `storage_name` | string | yes | Name of the storage where the repository is stored. |
| `relative_path` | string | yes | Relative path of the repository. |
| `gl_project_path` | string | no | Name of the project. Used for logging. |
For example, `backup_job.json`:
```json
{
"storage_name":"default",
"relative_path":"@hashed/f5/ca/f5ca38f748a1d6eaf726b8a42fb575c3c71f1864a8143301782de13da2d9202b.git",
"gl_project_path":"diaspora/diaspora-client"
}
{
"storage_name":"default",
"relative_path":"@hashed/6b/86/6b86b273ff34fce19d6b804eff5a3f5747ada4eaa22f1d49c01e52ddb7875b4b.git",
"gl_project_path":"brightbox/puppet"
}
```
1. Pipe the backup job file to `gitaly-backup create`.
```shell
/opt/gitlab/embedded/bin/gitaly-backup create -path $BACKUP_DESTINATION_PATH < backup_job.json
```
| Argument | Type | Required | Description |
|:----------------------|:----------|:---------|:------------|
| `-path` | string | yes | Directory where the backup files will be created. |
| `-parallel` | integer | no | Maximum number of parallel backups. |
| `-parallel-storage` | integer | no | Maximum number of parallel backups per storage. |
| `-id` | string | no | Used to determine a unique path for the backup when a full backup is created. |
| `-layout` | string | no | How backup files are located. Either `pointer` (default) or `legacy`. |
| `-incremental` | bool | no | Indicates whether to create an incremental backup. |
## Directly restore repository data
1. For each project to restore, find the Gitaly storage name and relative or disk path using either:
- The [Admin area](https://docs.gitlab.com/ee/administration/repository_storage_types.html#from-project-name-to-hashed-path).
- The [repository storage API](https://docs.gitlab.com/ee/api/projects.html#get-the-path-to-repository-storage).
1. Generate the `GITALY_SERVERS` environment variable. This variable specifies
the address and authentication details of each storage to restore to. The
variable takes a base64-encoded JSON object.
For example:
```shell
export GITALY_SERVERS=`echo '{"default":{"address":"unix:/var/opt/gitlab/gitaly.socket","token":""}}' | base64 --wrap=0`
```
1. Generate the restore job file. The job file consists of a series of JSON objects separated by a new-line (`\n`).
| Attribute | Type | Required | Description |
|:--------------------|:---------|:---------|:------------|
| `address` | string | no | Address of the Gitaly or Gitaly Cluster server. Overrides the address specified in `GITALY_SERVERS`. |
| `token` | string | no | Authentication token for the Gitaly server. Overrides the token specified in `GITALY_SERVERS`. |
| `storage_name` | string | yes | Name of the storage where the repository is stored. |
| `relative_path` | string | yes | Relative path of the repository. |
| `gl_project_path` | string | no | Name of the project. Used for logging. |
| `always_create` | boolean | no | Create the repository even if no bundle for it exists (for compatibility with existing backups). Defaults to `false` |
For example, `restore_job.json`:
```json
{
"storage_name":"default",
"relative_path":"@hashed/f5/ca/f5ca38f748a1d6eaf726b8a42fb575c3c71f1864a8143301782de13da2d9202b.git",
"gl_project_path":"diaspora/diaspora-client",
"always_create": true
}
{
"storage_name":"default",
"relative_path":"@hashed/6b/86/6b86b273ff34fce19d6b804eff5a3f5747ada4eaa22f1d49c01e52ddb7875b4b.git",
"gl_project_path":"brightbox/puppet"
}
```
1. Pipe the restore job file to `gitaly-backup restore`.
```shell
/opt/gitlab/embedded/bin/gitaly-backup restore -path $BACKUP_SOURCE_PATH < restore_job.json
```
| Argument | Type | Required | Description |
|:----------------------------|:-----------------------|:---------|:------------|
| `-path` | string | yes | Directory where the backup files are stored. |
| `-parallel` | integer | no | Maximum number of parallel restores. |
| `-parallel-storage` | integer | no | Maximum number of parallel restores per storage. |
| `-layout` | string | no | How backup files are located. Either `pointer` (default) or `legacy`. |
| `-remove-all-repositories` | comma-separated list | no | List of storage names to have all repositories removed from before restoring. You must specify `GITALY_SERVERS` for the listed storage names. |
## Path
Path determines where on the local filesystem or in object storage backup files
are created on or restored from. The path is set using the `-path` flag.
### Local Filesystem
If `-path` specifies a local filesystem, it is the root of where all backup
files are created.
### Object Storage
`gitaly-backup` supports streaming backup files directly to object storage
using the [`gocloud.dev/blob`](https://pkg.go.dev/gocloud.dev/blob) library.
`-path` can be used with:
- [Amazon S3](https://pkg.go.dev/gocloud.dev/blob/s3blob). For example `-path=s3://my-bucket?region=us-west-1`.
- [Azure Blob Storage](https://pkg.go.dev/gocloud.dev/blob/azureblob). For example `-path=azblob://my-container`.
- [Google Cloud Storage](https://pkg.go.dev/gocloud.dev/blob/gcsblob). For example `-path=gs//my-bucket`.
## Layouts
The way backup files are arranged on the filesystem or on object storage is
determined by the layout. The layout is set using the `-layout` flag.
### Legacy layout
This layout is designed to be identical to historic `backup.rake` repository
backups. Repository data is stored in bundle files in a pre-determined
directory structure based on each repository's relative path. This directory
structure is then archived into a tar file by `backup.rake`. Each time a backup
is created, this entire directory structure is recreated.
For example, a repository with the relative path of
`@hashed/4e/c9/4ec9599fc203d176a301536c2e091a19bc852759b255bd6818810a42c5fed14a.git`
creates the following structure:
```text
$BACKUP_DESTINATION_PATH/
@hashed/
4e/
c9/
4ec9599fc203d176a301536c2e091a19bc852759b255bd6818810a42c5fed14a.bundle
```
#### Generating full backups
A bundle with all references is created via the RPC `CreateBundle`. It
effectively executes the following:
```shell
git bundle create repo.bundle --all
```
#### Generating incremental backups
This layout does not support incremental backups.
### Pointer layout
This layout is designed to support incremental backups. Each repository backup
cannot overwrite a previous backup because this would leave dangling incremental
backups. To prevent dangling incremental backups, every new full backup is put into a new directory.
The two files called `LATEST` point to:
- The latest full backup.
- The latest increment of that full backup.
These pointer files enable looking up
backups from object storage without needing directory traversal (directory
traversal typically requires additional permissions). In addition to the bundle
files, each backup writes a full list of refs and their target object IDs.
When the pointer files are not found, the pointer layout will fall back to
using the legacy layout.
For example, a repository with the relative path of
`@hashed/4e/c9/4ec9599fc203d176a301536c2e091a19bc852759b255bd6818810a42c5fed14a.git`
and a backup ID of `20210930065413` will create the following structure:
```text
$BACKUP_DESTINATION_PATH/
@hashed/
4e/
c9/
4ec9599fc203d176a301536c2e091a19bc852759b255bd6818810a42c5fed14a/
LATEST
20210930065413/
001.bundle
001.refs
LATEST
```
#### Generating full backups
1. A full list of references is retrieved via the RPC `ListRefs`. This list is written to `001.refs` in the same format as [`git-show-ref`](https://git-scm.com/docs/git-show-ref#_output).
1. A bundle is generated using the retrieved reference names. Effectively, by running:
```shell
awk '{print $2}' 001.refs | git bundle create repo.bundle --stdin
```
1. The backup and increment pointers are written.
#### Generating incremental backups
1. The next increment is calculated by finding the increment `LATEST` file and
adding 1. For example, `001` + `1` = `002`.
1. A full list of references is retrieved using the `ListRefs` RPC. This list is
written to the calculated next increment (for example, `002.refs`) in the same
format as [`git-show-ref`](https://git-scm.com/docs/git-show-ref#_output).
1. The full list of the previous increments references is retrieved by reading
the file. For example, `001.refs`.
1. A bundle is generated using the negated list of reference targets of the
previous increment and the new list of retrieved reference names
by effectively running:
```shell
{ awk '{print "^" $1}' 001.refs; awk '{print $2}' 002.refs; } | git bundle create repo.bundle --stdin
```
Negating the object IDs from the previous increment ensures that we stop
traversing commits when we reach the HEAD of the branch at the time of the
last incremental backup.
|