File: design-notes.txt

package info (click to toggle)
cvs2svn 2.4.0-4
  • links: PTS
  • area: main
  • in suites: stretch
  • size: 3,720 kB
  • sloc: python: 22,383; sh: 512; perl: 121; makefile: 84
file content (705 lines) | stat: -rw-r--r-- 28,862 bytes parent folder | download | duplicates (3)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
                         How cvs2svn Works
                         =================

                       Theory and requirements
                       ------ --- ------------

There are two main problem converting a CVS repository to SVN:

- CVS does not record enough information to determine what actually
  happened to a repository.  For example, CVS does not record:

  - Which file modifications were part of the same commit

  - The timestamp of tag and branch creations

  - Exactly which revision was the base of a branch (there is
    ambiguity between x.y, x.y.2.0, x.y.4.0, etc.)

  - When the default branch was changed (for example, from a vendor
    branch back to trunk).

- The timestamps in a CVS archive are not reliable.  It can easily
  happen that timestamps are not even monotonic, and large errors (for
  example due to a failing server clock battery) are not unusual.

The absolutely crucial, sine qua non requirement of a conversion is
that the dependency relationships within a file be honored, mainly:

- A revision depends on its predecessor

- A branch creation depends on the revision from which it branched,
  and commits on the branch depend on the branch creation

- A tag creation depends on the revision being tagged

These dependencies are reliably defined in the CVS repository, and
they trump all others, so they are the scaffolding of the conversion.

Moreover, it is highly desirable that the timestamps of the SVN
commits be monotonically increasing.

Within these constraints we also want the results of the conversion to
resemble the history of the CVS repository as closely as possible.
For example, the set of file changes grouped together in an SVN commit
should be the same as the files changed within the corresponding CVS
commit, insofar as that can be achieved in a manner that is consistent
with the dependency requirements.  And the SVN commit timestamps
should recreate the time of the CVS commit as far as possible without
violating the monotonicity requirement.

The basic idea of the conversion is this: create the largest
conceivable changesets, then split up changesets as necessary to break
any cycles in the graph of changeset dependencies.  When all cycles
have been removed, then do a topological sort of the changesets (with
ambiguities resolved using CVS timestamps) to determine a
self-consistent changeset commit order.

The quality of the conversion (not in terms of correctness, but in
terms of minimizing the number of svn commits) is mostly determined by
the cleverness of the heuristics used to split up cycles.  And all of
this has to be affordable, especially in terms of conversion time and
RAM usage, for even the largest CVS repositories.


                            Implementation
                            --------------

A cvs2svn run consists of a number of passes.  Each pass saves the
data it produces to files on disk, so that a) we don't hold huge
amounts of state in memory, and b) the conversion process is
resumable.

The intermediate files are referred to here by the symbolic constants
holding their filenames in config.py.


CollectRevsPass (formerly called pass1)
===============

The goal of this pass is to collect from the CVS files all of the data
that will be required for the conversion.  If the --use-internal-co
option was used, this pass also collects the file delta data; for
-use-rcs or -use-cvs, the actual file contents are read again in
OutputPass.

To collect this data, we walk over the repository, collecting data
about the RCS files into an instance of CollectData.  Each RCS file is
processed with rcsparse.parse(), which invokes callbacks from an
instance of cvs2svn's _FileDataCollector class (which is a subclass of
rcsparse.Sink).

While a file is being processed, all of the data for the file (except
for contents and log messages) is held in memory.  When the file has
been read completely, its data is converted into an instance of
CVSFileItems, and this instance is manipulated a bit then pickled and
stored to CVS_ITEMS_STORE.

For each RCS file, the first thing the parser encounters is the
administrative header, including the head revision, the principal
branch, symbolic names, RCS comments, etc.  The main thing that
happens here is that _FileDataCollector.define_tag() is invoked on
each symbolic name and its attached revision, so all the tags and
branches of this file get collected.

Next, the parser hits the revision summary section.  That's the part
of the RCS file that looks like this:

   1.6
   date 2002.06.12.04.54.12;    author captnmark;       state Exp;
   branches
        1.6.2.1;
   next 1.5;

   1.5
   date 2002.05.28.18.02.11;    author captnmark;       state Exp;
   branches;
   next 1.4;

   [...]

For each revision summary, _FileDataCollector.define_revision() is
invoked, recording that revision's metadata in various variables of
the _FileDataCollector class instance.

Next, the parser encounters the *real* revision data, which has the
log messages and file contents.  For each revision, it invokes
_FileDataCollector.set_revision_info(), which sets some more fields in
_RevisionData.

When the parser is done with the file, _ProjectDataCollector takes the
resulting CVSFileItems object and manipulates it to handle some CVS
features:

   - If the file had a vendor branch, make some adjustments to the
     file dependency graph to reflect implicit dependencies related to
     the vendor branch.  Also delete the 1.1 revision in the usual
     case that it doesn't contain any useful information.

   - If the file was added on a branch rather than on trunk, then
     delete the "dead" 1.1 revision on trunk in the usual case that it
     doesn't contain any useful information.

   - If the file was added on a branch after it already existed on
     trunk, then recent versions of CVS add an extra "dead" revision
     on the branch.  Remove this revision in the usual case that it
     doesn't contain any useful information, and sever the branch from
     trunk (since the branch version is independent of the trunk
     version).

   - If the conversion was started with the --trunk-only option, then

     1. graft any non-trunk default branch revisions onto trunk
        (because they affect the history of the default branch), and

     2. delete all branches and tags and all remaining branch
        revisions.

Finally, the CVSFileItems instance is stored to a database and
statistics about how symbols were used in the file are recorded.

That's it -- the RCS file is done.

When every CVS file is done, CollectRevsPass is complete, and:

   - The basic information about each project is stored to PROJECTS.

   - The basic information about each file and directory (filename,
     path, etc) is written as a pickled CVSPath instance to
     CVS_PATHS_DB.

   - Information about each symbol seen, along with statistics like
     how often it was used as a branch or tag, is written as a pickled
     symbol_statistics._Stat object to SYMBOL_STATISTICS.  This
     includes the following information:

         ID -- a unique positive identifying integer

         NAME -- the symbol name

         TAG_CREATE_COUNT -- the number of times the symbol was used
             as a tag

         BRANCH_CREATE_COUNT -- the number of times the symbol was
             used as a branch

         BRANCH_COMMIT_COUNT -- the number of files in which there was
             a commit on a branch with this name.

         BRANCH_BLOCKERS -- the set of other symbols that ever
             sprouted from a branch with this name.  (A symbol cannot
             be excluded from the conversion unless all of its
             blockers are also excluded.)

         POSSIBLE_PARENTS -- a count of in how many files each other
             branch could have served as the symbol's source.

     These data are used to look for inconsistencies in the use of
     symbols under CVS and to decide which symbols can be excluded or
     forced to be branches and/or tags.  The POSSIBLE_PARENTS data is
     used to pick the "optimum" parent from which the symbol should
     sprout in as many files as possible.

     For a multiproject conversion, distinct symbol records (and IDs)
     are created for symbols in separate projects, even if they have
     the same name.  This is to prevent symbols in separate projects
     from being filled at the same time.

   - Information about each CVS event is converted into a CVSItem
     instance and stored to CVS_ITEMS_STORE.  There are several types
     of CVSItems:

         CVSRevision -- A specific revision of a specific CVS file.

         CVSBranch -- The creation of a branch tag in a specific CVS
             file.

         CVSTag -- The creation of a non-branch tag in a specific CVS
             file.

     The CVSItems are grouped into CVSFileItems instances, one per
     CVSFile.  But a multi-file commit will still be scattered all
     over the place.

   - Selected metadata for each CVS revision, including the author and
     log message, is written to METADATA_INDEX_TABLE and
     METADATA_STORE.  The purpose is twofold: first, to save space by
     not having to save this information multiple times, and second
     because CVSRevisions that have the same metadata are candidates
     to be combined into an SVN changeset.

     First, an SHA digest is created for each set of metadata.  The
     digest is constructed so that CVSRevisions that can be combined
     are all mapped to the same digest.  CVSRevisions that were part
     of a single CVS commit always have a common author and log
     message, therefore these fields are always included in the
     digest.  Moreover:

     - if ctx.cross_project_commits is False, we avoid combining CVS
       revisions from separate projects by including the project.id in
       the digest.

     - if ctx.cross_branch_commits is False, we avoid combining CVS
       revisions from different branches by including the branch name
       in the digest.

     During the database creation phase, the database keeps track of a
     map

       digest (20-byte string) -> metadata_id (int)

     to allow the record for a set of metadata to be located
     efficiently.  As data are collected, it stores a map

       metadata_id (int) -> (author, log_msg,) (tuple)

     into the database for use in future passes.  CVSRevision records
     include the metadata_id.

During this run, each CVSFile, Symbol, CVSItem, and metadata record is
assigned an arbitrary unique ID that is used throughout the conversion
to refer to it.


CleanMetadataPass
=================

Encode the cvs revision metadata as UTF-8, ensuring that all entries
can be decoded using the chosen encodings.  Output the results to
METADATA_CLEAN_INDEX_TABLE and METADATA_CLEAN_STORE.


CollateSymbolsPass
==================

Use the symbol statistics collected in CollectRevsPass and any runtime
options to determine which symbols should be treated as branches,
which as tags, and which should be excluded from the conversion
altogether.

Create SYMBOL_DB, which contains a pickle of a list of TypedSymbol
(Branch, Tag, or ExcludedSymbol) instances indicating how each symbol
should be processed in the conversion.  The IDs used for a TypedSymbol
is the same as the ID allocated to the corresponding symbol in
CollectRevsPass, so references in CVSItems do not have to be updated.


FilterSymbolsPass
=================

This pass works through the CVSFileItems instances stored in
CVS_ITEMS_STORE, processing all of the items from each file as a
group.  (This is the last pass in which all of the CVSItems for a file
are in memory at once.)  It does the following things:

   - Exclude any symbols that CollateSymbolsPass determined should be
     excluded, and any revisions on such branches.  Also delete
     references from other CVSItems to those that are being deleted.

   - Transform any branches to tags or vice versa, also depending on
     the results of CollateSymbolsPass, and fix up the references from
     other CVSItems.

   - Decide what line of development to use as the parent for each
     symbol in the file, and adjust the file's dependency tree
     accordingly.

   - For each CVSRevision, record the list of symbols that the
     revision opens and closes.

   - Write each surviving CVSRevision to CVS_REVS_DATAFILE.  Each line
     of the file has the format

         METADATA_ID TIMESTAMP CVS_REVISION

     where TIMESTAMP is a fixed-width timestamp, and CVS_REVISION is
     the pickled CVSRevision in a format that does not contain any
     newlines.  These summaries will be sorted in SortRevisionsPass
     then used by InitializeChangesetsPass to create preliminary
     RevisionChangesets.

   - Write the CVSSymbols to CVS_SYMBOLS_DATAFILE.  Each line of the
     file has the format

         SYMBOL_ID CVS_SYMBOL

     where CVS_SYMBOL is the pickled CVSSymbol in a format that does
     not contain any newlines.  This information will be sorted by
     SYMBOL_ID in SortSymbolsPass then used to create preliminary
     SymbolChangesets.

   - Invokes callback methods of the registered RevisionCollector.
     The purpose of RevisionCollectors and RevisionReaders is
     documented in the file revision-reader.txt.


SortRevisionsPass
=================

Sort CVS_REVS_DATAFILE (written by FilterSymbolsPass), creating
CVS_REVS_SORTED_DATAFILE.  The sort groups items that might be added
to the same changeset together and, within a group, sorts revisions by
timestamp.  This step makes it easy for InitializeChangesetsPass to
read the initial draft of RevisionChangesets straight from the file.


SortSymbolsPass
===============

Sort CVS_SYMBOLS_DATAFILE (written by FilterSymbolsPass), creating
CVS_SYMBOLS_SORTED_DATAFILE.  The sort groups together symbol items
that might be added to the same changeset (though not in anything
resembling chronological order).  The output of this pass is used by
InitializeChangesetsPass.


InitializeChangesetsPass
========================

This pass creates first-draft changesets, splitting them using
COMMIT_THRESHOLD and breaking up any revision changesets that have
internal dependencies.

The raw material for creating revision changesets is
CVS_REVS_SORTED_DATAFILE, which already has CVSRevisions sorted in
such a way that potential changesets are grouped together and sorted
by date.  The contents of this file are read line by line, and the
corresponding CVSRevisions are accumulated into a changeset.  Whenever
the metadata_id changes, or whenever there is a time gap of more than
COMMIT_THRESHOLD (currently set to 5 minutes) between CVSRevisions,
then a new changeset is started.

At this point a revision changeset can have internal dependencies if
two commits were made to the same file with the same log message
within COMMIT_THRESHOLD of each other.  The next job of this pass is
to split up changesets in such a way to break such internal
dependencies.  This is done by sorting the CVSRevisions within a
changeset by timestamp, then choosing the split point that breaks the
most internal dependencies.  This procedure is continued recursively
until there are no more dependencies internal to a single changeset.

Analogously, the CVSSymbol items from CVS_SYMBOLS_SORTED_DATAFILE are
grouped into symbol changesets.  (Symbol changesets cannot have
internal dependencies, so there is no need to break them up at this
stage.)

Finally, this pass writes a CVSItem database with the CVSItems written
in order grouped by the preliminary changeset to which they belong.
Even though the preliminary changesets still have to be split up to
form final changesets, grouping the CVSItems this way improves the
locality of disk accesses and thereby speeds up later passes.

The result of this pass is two databases:

   - CVS_ITEM_TO_CHANGESET, which maps CVSItem ids to the id of the
     changeset containing the item, and

   - CHANGESETS_STORE and CHANGESETS_INDEX, which contain the
     changeset objects themselves, indexed by changeset id.

   - CVS_ITEMS_SORTED_STORE and CVS_ITEMS_SORTED_INDEX_TABLE, which
     contain the pickled CVSItems ordered by changeset.


BreakRevisionChangesetCyclesPass
================================

There can still be cycles in the dependency graph of
RevisionChangesets caused by:

   - Interleaved commits.  Since CVS commits are not atomic, it can
     happen that two commits are in progress at the same time and each
     alters the same two files, but in different orders.  These should
     be small cycles involving only a few revision changesets.  To
     resolve these cycles, one or more of the RevisionChangesets have
     to be split up (eventually becoming separate svn commits).

   - Cycles involving a RevisionChangeset formed by the accidental
     combination of unrelated items within a short period of time that
     have the same author and log message.  These should also be small
     cycles involving only a few changesets.

The job of this pass is to break up such cycles (those involving only
CVSRevisions).

This pass works by building up the graph of revision changesets and
their dependencies in memory, then attempting a topological sort of
the changesets.  Whenever the topological sort stalls, that implies
the existence of a cycle, one of which can easily be determined.  This
cycle is broken through the use of heuristics that try to determine an
"efficient" way of splitting one or more of the changesets that are
involved.

The new RevisionChangesets are written to
CVS_ITEM_TO_CHANGESET_REVBROKEN, CHANGESETS_REVBROKEN_STORE, and
CHANGESETS_REVBROKEN_INDEX, along with the unmodified
SymbolChangesets.  These files are in the same format as the analogous
files produced by InitializeChangesetsPass.


RevisionTopologicalSortPass
===========================

Topologically sort the RevisionChangesets, thereby picking the order
in which the RevisionChangesets will be committed.  (Since the
previous pass eliminated any dependency cycles, this sort is
guaranteed to succeed.)  Ambiguities in the topological sort are
resolved using the changesets' timestamps.  Then simplify the
changeset graph into a linear chain by converting each
RevisionChangeset into an OrderedChangeset that stores dependency
links only to its commit-order predecessor and successor.  This
simplified graph enforces the commit order that resulted from the
topological sort, even after the SymbolChangesets are added back into
the graph later.  Store the OrderedChangesets into
CHANGESETS_REVSORTED_STORE and CHANGESETS_REVSORTED_INDEX along with
the unmodified SymbolChangesets.


BreakSymbolChangesetCyclesPass
==============================

It is possible for there to be cycles in the graph of SymbolChangesets
caused by:

   - Split creation of branches.  It is possible that branch A depends
     on branch B in one file, but B depends on A in another file.
     These cycles can be large, but they only involve
     SymbolChangesets.

Break up such dependency loops.  Output the results to
CVS_ITEM_TO_CHANGESET_SYMBROKEN, CHANGESETS_SYMBROKEN_STORE, and
CHANGESETS_SYMBROKEN_INDEX.


BreakAllChangesetCyclesPass
===========================

The complete changeset graph (including both RevisionChangesets and
BranchChangesets) can still have dependency cycles cause by:

   - Split creation of branches.  The same branch tag can be added to
     different files at completely different times.  It is possible
     that the revision that was branched later depends on a
     RevisionChangeset that involves a file on the branch that was
     created earlier.  These cycles can be large, but they always
     involve a SymbolChangeset.  To resolve these cycles, the
     SymbolChangeset is split up into two changesets.

In fact, tag changesets do not have to be considered--CVSTags cannot
participate in dependency cycles because no other CVSItem can depend
on a CVSTag.

Since the input of this pass has been through
RevisionTopologicalSortPass, all revision cycles have already been
broken up and the order that the RevisionChangesets will be committed
has been determined.  In this pass, the complete changeset graph is
created in memory, including the linear list of OrderedChangesets from
RevisionTopologicalSortPass plus all of the symbol changesets.
Because this pass doesn't break up any OrderedChangesets, it is
constrained to finding places within the revision changeset sequence
in which the symbol changeset commits can be inserted.

The new changesets are written to CVS_ITEM_TO_CHANGESET_ALLBROKEN,
CHANGESETS_ALLBROKEN_STORE, and CHANGESETS_ALLBROKEN_INDEX, which are
in the same format as the analogous files produced by
InitializeChangesetsPass.


TopologicalSortPass
===================

Now that the earlier passes have broken up any dependency cycles among
the changesets, it is possible to order all of the changesets in such
a way that all of a changeset's dependencies are committed before the
changeset itself.  This pass does so by again building up the graph of
changesets in memory, then at each step picking a changeset that has
no remaining dependencies and removing it from the graph.  Whenever
more than one dependency-free changeset is available, symbol
changesets are chosen before revision changesets.  As changesets are
processed, the timestamp sequence is ensured to be monotonic by the
simple expedient of adjusting retrograde timestamps to be later than
their predecessor.  Timestamps that lie in the future, on the other
hand, are assumed to be bogus and are adjusted backwards, also to be
just later than their predecessor.

This pass writes a line to CHANGESETS_SORTED_DATAFILE for each
RevisionChangeset, in the order that the changesets should be
committed.  Each lines contains

    CHANGESET_ID TIMESTAMP

where CHANGESET_ID is the id of the changeset in the
CHANGESETS_ALLBROKEN_* databases and TIMESTAMP is the timstamp that
should be assigned to it when it is committed.  Both values are
written in hexadecimal.


CreateRevsPass (formerly called pass5)
==============

This pass generates SVNCommits from Changesets and records symbol
openings and closings.  (One Changeset can result in multiple
SVNCommits, for example if it causes symbols to be filled or copies to
a vendor branch.)

This pass does the following:

1. Creates a database file to map Subversion revision numbers to
   SVNCommit instances (SVN_COMMITS_STORE and
   SVN_COMMITS_INDEX_TABLE).  Creates another database file to map CVS
   Revisions to their Subversion Revision numbers
   (CVS_REVS_TO_SVN_REVNUMS).

2. When a file is copied to a symbolic name in cvs2svn, it is copied
   from a specific source: either a CVSRevision, or a copy created by
   a previous CVSBranch of the file.  The copy has to be made from an
   SVN revision that is during the lifetime of the source.  The SVN
   revision when the source was created is called the symbol's
   "opening", and the SVN revision when it was deleted or overwritten
   is called the symbol's "closing".  In this pass, the
   SymbolingsLogger class writes out a line to
   SYMBOL_OPENINGS_CLOSINGS for each symbol opening or closing.  Note
   that some openings do not have closings, namely if the
   corresponding source is still present at the HEAD revision.

   The format of each line is:

       SYMBOL_ID SVN_REVNUM TYPE CVS_SYMBOL_ID

   For example:

       1c 234 O 1a7
       34 245 O 1a9
       18a 241 C 1a7
       122 201 O 1b3

   Here is what the columns mean:

   SYMBOL_ID -- The id of the branch or tag that has an opening in
       this SVN_REVNUM, in hexadecimal.

   SVN_REVNUM -- The Subversion revision number in which the opening
       or closing occurred.  (There can be multiple openings and
       closings per SVN_REVNUM).

   TYPE -- "O" for openings and "C" for closings.

   CVS_SYMBOL_ID -- The id of the CVSSymbol instance whose opening or
       closing is being described, in hexadecimal.

   Each CVSSymbol that tags a non-dead file has exactly one opening
   and either zero or one closing.  The closing, if it exists, always
   occurs in a later SVN revision than the opening.

   See SymbolingsLogger for more details.


SortSymbolOpeningsClosingsPass (formerly called pass6)
==============================

This pass sorts SYMBOL_OPENINGS_CLOSINGS into
SYMBOL_OPENINGS_CLOSINGS_SORTED.  This orders the file first by symbol
ID, and second by Subversion revision number, thus grouping all
openings and closings for each symbolic name together.


IndexSymbolsPass (formerly called pass7)
================

This pass iterates through all the lines in
SYMBOL_OPENINGS_CLOSINGS_SORTED, writing out a pickle file
(SYMBOL_OFFSETS_DB) mapping SYMBOL_ID to the file offset in
SYMBOL_OPENINGS_CLOSINGS_SORTED where SYMBOL_ID is first encountered.
This will allow us to seek to the various offsets in the file and
sequentially read only the openings and closings that we need.


OutputPass (formerly called pass8)
==========

This pass opens the svn-commits database and sequentially plays out
all the commits to either a Subversion repository or to a dumpfile.
It also decides what sources to use to fill symbols.

In --dumpfile mode, the result of this pass is a Subversion repository
dumpfile (suitable for input to 'svnadmin load').  The dumpfile is the
data's last static stage: last chance to check over the data, run it
through svndumpfilter, move the dumpfile to another machine, etc.

When not in --dumpfile mode, no full dumpfile is created.  Instead,
miniature dumpfiles representing a single revisions are created,
loaded into the repository, and then removed.

In both modes, the dumpfile revisions are created by walking through
the SVN_COMMITS_* database.

The database in MIRROR_NODES_STORE and MIRROR_NODES_INDEX_TABLE holds
a skeletal mirror of the repository structure at each SVN revision.
This mirror keeps track of which files existed on each LOD, but does
not record any file contents.  cvs2svn requires this information to
decide which paths to copy when filling branches and tags.

When .cvsignore files are modified, cvs2svn computes the corresponding
svn:ignore properties and applies the properties to the parent
directory.  The .cvsignore files themselves are not included in the
output unless the --keep-cvsignore option was specified.  But in
either case, the .cvsignore files are recorded within the repository
mirror as if they were being written to disk, to ensure that the
containing directory is not pruned if the directory in CVS still
contained a .cvsignore file.


                  ===============================
                      Branches and Tags Plan.
                  ===============================

This pass is also where tag and branch creation is done.  Since
subversion does tags and branches by copying from existing revisions
(then maybe editing the copy, making subcopies underneath, etc), the
big question for cvs2svn is how to achieve the minimum number of
operations per creation.  For example, if it's possible to get the
right tag by just copying revision 53, then it's better to do that
than, say, copying revision 51 and then sub-copying in bits of
revision 52 and 53.

Tags are created as soon as cvs2svn encounters the last CVS Revision
that is a source for that tag.  The whole tag is created in one
Subversion commit.

Branches are created as soon as all of their prerequisites are in
place.  If a branch creation had to be broken up due to dependency
cycles, then non-final parts are also created as soon as their
prerequisites are ready.  In such a case, the SymbolChangeset
specifies how much of the branch can be created in each step.

How just-in-time branch creation works:

In order to make the "best" set of copies/deletes when creating a
branch, cvs2svn keeps track of two sets of trees while it's making
commits:

   1. A skeleton mirror of the subversion repository, that is, a
      record of which file existed on which LOD for each SVN revision.

   2. A tree for each CVS symbolic name, and the svn file/directory
      revisions from which various parts of that tree could be copied.

Each LOD is recorded as a tree using the following schema: unique keys
map to marshal.dumps() representations of dictionaries, which in turn
map path component names to other unique keys:

   root_key  ==> { entryname1 : entrykey1, entryname2 : entrykey2, ... }
   entrykey1 ==> { entrynameX : entrykeyX, ... }
   entrykey2 ==> { entrynameY : entrykeyY, ... }
   entrykeyX ==> { etc, etc ...}
   entrykeyY ==> { etc, etc ...}

(The leaf nodes -- files -- are represented by None.)

The repository mirror allows cvs2svn to remember what paths exist in
what revisions.

For details on how branches and tags are created, please see the
docstring the SymbolingsLogger class (and its methods).