File: TODO

package info (click to toggle)
libhsync 0.5.7-1.2
  • links: PTS
  • area: main
  • in suites: woody
  • size: 1,060 kB
  • ctags: 543
  • sloc: sh: 7,944; ansic: 5,413; makefile: 154
file content (238 lines) | stat: -rw-r--r-- 8,136 bytes parent folder | download
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
                                                  -*- indented-text -*-
$Id: TODO,v 1.19 2000/08/25 08:41:20 mbp Exp $

 * Tool to generate standalone signatures.  But don't conflict with
   xdelta here.

 * Meta-programming

   * Plot lengths of each function

   * Some kind of statistics on delta each day

 * Encoding format

   * Include a version in the signature and difference fields

   * Remember to update them if we ever ship a buggy version (nah!) so
     that other parties can know not to trust the encoded data.

 * abstract encoding

   In fact, we can vary on several different variables:

     * what signature format are we using

     * what command protocol are we using

     * what search algorithm are we using?

     * what implementation version are we?

   Some are more likely to change than others.  We need a chart
   showing which source files depend on which variable.

 * Error handling

   * What happens if the user terminates the request?

 * Do HTTP CONNECT

   * This might be a nice place to use select!

 * Non-blocking IO

   * Rewrite decoder to use mapptr for input.

   * Somehow do nonblocking output.  Presumably we want a different
     callback to say when we're ready to do output, but also to
     refrain from input when there's no more space.

   * Also, this should interoperate smoothly with zlib.

   * Then we could use select for custom timeouts!

 * Delete rubbish modules from CVS

 * Encoding implementation

   * Join up copy commands through the copyq if this is not done yet.

   * Join up signature commands

 * Encoding algorithm

   * Self-referential copy commands

     Suppose we have a file with repeating blocks.  The gdiff format
     allows for COPY commands to extend into the *output* file so that
     they can easily point this out.  By doing this, they get
     compression as well as differencing.

     It'd be pretty simple to implement this, I think: as we produce
     output, we'd also generate checksums (using the search block
     size), and add them to the sum set.  Then matches will fall out
     automatically, although we might have to specially allow for
     short blocks.

     However, I don't see many files which have repeated 1kB chunks,
     so I don't know if it would be worthwhile.

   * Extended files

     Suppose the new file just has data added to the end.  At the
     moment, we'll match everything but the last block of the old
     file.  It won't match, because at the moment the search block
     size is only reduced at the end of the *new* file.  This is a
     little inefficient, because ideally we'd know to look for the
     last block using the shortened length.

     This is a little hard to implement, though perhaps not
     impossible.  The current rolling search algorithm can only look
     for one block size at any time.  Can we do better?  Can we look
     for all block lengths that could match anything?

     Remember also that at the moment we don't send the block length
     in the signature; it's implied by the length of the new block
     that it matches.  This is kind of cute, and importantly helps
     reduce the length of the signature.

   * State-machine searching

     Building a state machine from a regular expression is a brilliant
     idea.  (I think `The Practice of Programming' walks through the
     construction of this at a fairly simple level.)

     In particular, we can search for any of a large number of
     alternatives in a very efficient way, with much less effort than
     it would take to search for each the hard way.  Remember also the
     string-searching algorithms and how much time they can take.

     I wonder if we can use similar principles here rather than the
     current simple rolling-sum mechanism?  Could it let us match
     variable-length signatures?

   * Cross-file matches

     If the downstream server had many similar URLs, it might be nice
     if it could draw on all of them as a basis.  At the moment
     there's no way to express this, and I think the work of sending
     up signatures for all of them may be too hard.

     Better just to make sure we choose the best basis if there is
     none present.  Perhaps this needs to weigh several factors.

     One factor might be that larger files are better because they're
     more likely to have a match.  I'm not sure if that's very strong,
     because they'll just bloat the request.  Another is that more
     recent files might be more useful.

 * Support gzip compression of the difference stream.  Does this
   belong here, or should it be in the client and libhsync just have
   an interface that lets it cleanly plug in?

 * Licensing

   * Will the GNU Lesser GPL work?  Specifically, will it be a problem
     in distributing this with Mozilla or Apache?

 * Checksums

   * Do we really need to require that signatures arrive after the
     data they describe?  Does it make sense in HTTP to resume an
     interrupted transfer?

     I hope we can do this.  If we can't, however, then we should
     relax this constraint and allow signatures to arrive before the
     data they describe.  (Really?  Do we care?)

   * Allow variable-length checksums in the signature; the signature
     will have to describe the length of the sums and we must compare
     them taking this into account.

 * Testing

   * test broken pipes

   * Test files >2GB, >4GB.  Presumably these must be done in streams
     so that the disk requirements to run the test suite are not too
     ridiculous.  I wonder if it will take too long to run these
     tests?  Probably, but perhaps we can afford to run just one
     carefully-chosen test.

 * Web site in Latte?

 * Use slprintf not strnprintf, etc.

 * Long files

   * How do we handle the large signatures required to support large
     files?  In particular, how do we choose an appropriate block size
     when the length is unknown?  Perhaps we should allow a way for
     the signature to scale up as it grows.

   * What do we need to do to compile in support for this?

     * On GNU, defining _LARGEFILE_SOURCE as we now do should be
       sufficient.

     * SCO and similar things on 32-bit platforms may be more
       difficult.

     * On larger Unix platforms we hope that large file support will
       be the default.

 * Perhaps make extracted signatures still be wrapped in commands.
   What would this lead to?

   * We'd know how much signature data we expect to read, rather than
     requiring it to be terminated by the caller.

 * Selective trace of particular areas of the library.

 * Try to make libhsync more generally useful

   * Tease apart the different algorithms inside the library so that
     they can be independently re-used.

   * Use more specific names than `encode' and `decode'.  I like
     `apply' as a description of what the client does.

   * Don't imply any particular encoding format.  I don't think we
     need callbacks; it's enough that all the users can write their
     own implementations of the loops.

   * This ought to make it simpler to get push-structured interfaces.

 * More thorough testing

   * mdfour

     * keep some example files and check that they give the expected
       results.

     * try using different input chunk sizes

   * abstract i/o

     * rewrite cat to use our routines, and make sure it passes the
       file through correctly however we call it.  Different
       strategies are possible: using the loop functions or not, using
       mapptrs, etc.

   * test within the proxy

     * We need dynamically generated reproducible content that varies
       in an interesting way.  cvsweb sounds ideal: we'll mirror the
       output of cvsweb through a proxy and then not through a proxy,
       and see how we go.

   * run regression suite from CVS every night

 * Portability

   * Tux only knows what portability assumptions we've made.

   * In particular, running this on a risc box would be interesting.
     Endianness?  We now have us4.samba.org which is a big SGI MIPS
     machine, and so it will do pretty well as a test.