File: compute_special_remote_interface.mdwn

package info (click to toggle)
git-annex 10.20250721-1
  • links: PTS, VCS
  • area: main
  • in suites: forky, sid
  • size: 74,484 kB
  • sloc: haskell: 90,982; javascript: 9,103; sh: 1,469; makefile: 213; perl: 137; ansic: 44
file content (168 lines) | stat: -rw-r--r-- 6,776 bytes parent folder | download | duplicates (3)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
The [[special_remotes/compute]] special remote uses this interface to run
compute programs.

When an compute special remote is initremoted, a program is specified:

    git-annex initremote myremote type=compute program=git-annex-compute-foo

The user adds an annexed file that is computed by the program by running
a command like one of these:

    git-annex addcomputed --to=myremote -- convert file.raw file.jpeg passes=10
    git-annex addcomputed --to=myremote -- compress in out --level=9
    git-annex addcomputed --to=myremote -- clip foo 2:01-3:00 combine with bar to baz

## security

Security is very important here, because a user who enables a compute
special remote and runs `git pull` followed by `git-annex get` is running
the compute program with inputs under the control of anyone who has
commit access to the repository.

The contents of input files should be assumed to be untrusted, and so
should the filenames of input and output files, as well as everything
else passed to the program in `ARGV` and the environment.

The program should make sure that whatever user input is passed
to it can result in only safe and expected behavior. The program should
avoid exposing user input to the shell unprotected, or otherwise executing
it. (Except when the program is explicitly running user input in some form
of sandbox.)

## program parameters and environment

Whatever values the user passes to `git-annex addcomputed` are passed to
the program in `ARGV`, followed by any values that the user provided to 
`git-annex initremote`.

To simplify the program's option parsing, any value that the user provides
that is in the form "foo=bar" will also result in an environment variable
being set, eg `ANNEX_COMPUTE_passes=10` or `ANNEX_COMPUTE_--level=9`.

The program is run in a temporary directory, which will be cleaned up after
it exits. It may be run in a subdirectory of the temporary directory. This
is done when `git-annex addcomputed` was run in a subdirectory of the git
repository.

Anything that the program outputs to stderr will be displayed to the user.
This stderr should be used for error messages, and possibly computation
output, but not for progress displays.

If the program exits nonzero, nothing it computed will be stored in the 
git-annex repository.

## input files

Before doing any computation, the program needs to communicate with
git-annex about what input files it needs, and what output files it will
generate.

The content of any file in the repository can be an input to the
computation. The program requests an input by writing a line to stdout:

    INPUT file.raw

Then it can read a line from stdin, which will be the path to the content
(eg a `.git/annex/objects/` path).

If the program needs multiple input files, it should output multiple
`INPUT` lines first, and then read multiple paths from stdin. This
allows retrieval of the inputs to potentially run in parallel.

If an input file is not available, the program's stdin will be closed
without a path being written to it. So when reading from stdin fails, 
the program should exit.

When `git-annex addcomputed --fast` is being used to add a computation to
the git-annex repository without actually performing it, the response to
each `INPUT` will be an empty line rather than the path to an input file.
This can also happen when an input file is not available for whatever
reason. In this case, the program should proceed with the rest of its
output to stdout (eg `OUTPUT` and `REPRODUCIBLE`), but should not perform
any computation.

## output files

For each output file that it will compute, the program should write a
line to stdout, indicating the name of the file that will be added to the
git-annex repository by `git-annex compute`.

    OUTPUT file.jpeg

Then it should read a line from stdin, which is the path, in the program's
temporary directory, where it should write the output file. Often this will
be the same filename, but it also may be a sanitized version. It's
important to use that sanitized version to avoid path traversal attacks, as
well as problems like filenames that look like dashed options. 
If there is a path traversal attack, the program's stdin will be closed
without a path being written to it.

The program must write a regular file to the output file. Symlinks
or other special files will not be accepted as output files.

If git-annex sees that an output file is growing, it will use its file size
when displaying progress to the user. So if possible, the program should
write the content to the file it is computing directly, rather than writing
to somewhere else and renaming it at the end. But, if the program seeks
around and writes out of order, it should write to a file somewhere else
and rename it at the end.

## other messages

As well as `INPUT` and `OUTPUT` described above, there are some other
messages that the program can output. All of these are optional.

* `PROGRESS 50%`
  
  To indicate its current progress while performing the computation,
  the program can output lines like this. This is not needed if the program
  streams output to an output file.

* `REPRODUCIBLE`
  
  This indicates that the results of the computation are expected to be
  bit-for-bit reproducible. That makes `git-annex addcomputed` behave as if
  the `--reproducible` option is set.

* `SANDBOX`

  After outputting this line, the program can read a line from stdin
  that will be the path to the directory it should sandbox to (which
  corresponds to the top of the git repository, so may be above its working
  directory). Any `INPUT` lines that come after `SANDBOX` will have input
  files be provided via paths that are inside the sandbox directory. Usually
  that is done by making hard links, but it will fall back to copying annexed
  files if the filesystem does not support hard links.

* `INPUT-REQUIRED`

  This works the same as `INPUT`, except when `git-annex addcomputed --fast`
  is being used to add a computation to the git-annex repository without
  actually performing it, the input file will be provided as a response
  to this, rather than the empty line provided as a response to `INPUT`.
  
  If the input file is not available for some reason, an empty line will
  still be provided as a response to this.

## example

An example `git-annex-compute-foo` shell script follows:

    #!/bin/sh
    set -e
    if [ "$1" != "convert" ]; then
    	echo "Usage: convert input output [passes=n]" >&2
     	exit 1
    fi
    if [ -z "$ANNEX_COMPUTE_passes" ]; then
    	ANNEX_COMPUTE_passes=1
    fi
    echo "INPUT $2"
    read input
    echo "OUTPUT $3"
    read output
    echo REPRODUCIBLE

    if [ -n "$input" ]; then
        frobnicate --passes="$ANNEX_COMPUTE_passes" -i "$input" -o "$output" >&2
    fi