1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550
|
How cvs2svn Works
=================
A cvs2svn run consists of eight passes. Each pass saves the data it
produces to files on disk, so that a) we don't hold huge amounts of
state in memory, and b) the conversion process is resumable.
CollectRevsPass (formerly called pass1)
===============
The goal of this pass is to write a summary of each CVS file as a
pickled CVSFile to 'cvs2svn-cvs-files.db', and a summary of each CVS
file's revisions as a list of pickled CVSRevisions to
'cvs2svn-cvs-items.pck'. In each case, items are assigned an
arbitrary key that is used to refer to them.
We walk over the repository, collecting data about the RCS files into
an instance of CollectData. Each RCS file is processed with
rcsparse.parse(), which invokes callbacks from an instance of
cvs2svn's _FileDataCollector class (which is a subclass of
rcsparse.Sink).
For each RCS file, the first thing the parser encounters is the
administrative header, including the head revision, the principal
branch, symbolic names, RCS comments, etc. The main thing that
happens here is that _FileDataCollector.define_tag() is invoked on
each symbolic name and its attached revision, so all the tags and
branches of this file get collected. When this stage is done, the
parser invokes admin_completed(), which writes the CVSFile to the
database.
Next, the parser hits the revision summary section. That's the part
of the RCS file that looks like this:
1.6
date 2002.06.12.04.54.12; author captnmark; state Exp;
branches
1.6.2.1;
next 1.5;
1.5
date 2002.05.28.18.02.11; author captnmark; state Exp;
branches;
next 1.4;
[...]
For each revision summary, _FileDataCollector.define_revision() is
invoked, recording that revision's metadata in various variables of
the _FileDataCollector class instance.
After finishing the revision summaries, the parser invokes
_FileDataCollector.tree_completed(), which loops over the revision
information stored, determining if there are instances where a higher
revision was committed "before" a lower one (rare, but it can happen
when there was clock skew on the repository machine). If there are
any, it "resyncs" the timestamp of the earlier rev to be just before
that of the later rev, but saves the original timestamp in
self._rev_data[blah].original_timestamp, so we can later write out a
record to the resync file indicating that an adjustment was made (this
makes it possible to catch the other parts of this commit and resync
them similarly; more details below).
Next, the parser encounters the *real* revision data, which has the
log messages and file contents. For each revision, it invokes
_FileDataCollector.set_revision_info(), which writes a record to
'cvs2svn-cvs-items.pck'.
Also, for resync'd revisions, a line like this is written out to
'cvs2svn-resync.txt':
3d6c1329 18a 3d6c1328
The fields are:
NEW_TIMESTAMP METADATA_ID OLD_TIMESTAMP
(The resync file will be explained later.)
That's it -- the RCS file is done.
When every CVS file is done, CollectRevsPass is complete, and:
- 'cvs2svn-cvs-files.db' contains a record of every CVS file.
- 'cvs2svn-cvs-items.pck' contains a summary of every revision to
every CVS file, including a reference to the corresponding CVS
file record in 'cvs2svn-cvs-files.db'. The revisions are sorted
in groups, one per CVSFile. But a multi-file commit will still
be scattered all over the place.
- 'cvs2svn-resync.txt' contains a small amount of resync data, in
no particular order.
- 'cvs2svn-symbol-stats.pck' contains a pickled list of symbol
statistics entries (instances of
cvs2svn_lib.symbol_statistics._Stats) for each symbol that was
seen in the CVS repository. This includes the following
information:
ID NAME TAG_COUNT BRANCH_COUNT BRANCH_COMMIT_COUNT BLOCKERS
where ID is a unique integer identifying this symbol, NAME is the
symbol name, TAG_COUNT and BRANCH_COUNT are the number of CVS
files on which this symbol was used as a tag or branch
respectively, and BRANCH_COMMIT_COUNT is the number of files for
which commits were made on a branch with the given name.
BLOCKERS is a list of other symbols that were defined on branches
named NAME. (A symbol cannot be excluded if it has any blockers
that are not also being excluded.) These data are used to look
for inconsistencies in the use of symbols under CVS and to decide
which symbols can be excluded or forced to be branches and/or
tags.
- 'cvs2svn-metadata.db' contains information that will help
determine what CVSRevisions are allowed to be combined into a
single SVNCommit. This class maps each CVSRevision to an SHA
digest that is constructed so that CVSRevisions that can be
combined are all mapped to the same digest.
CVSRevisions that were part of a single CVS commit always have a
common author and log message, therefore these fields are always
included in the digest. Moreover, if ctx.cross_project_commits
is False, we avoid combining CVS revisions from separate projects
by including the project.id in the digest. This database
contains two mappings for each digest:
digest (40-byte string) -> metadata_id (int)
metadata_id (int as hex) -> (project_id, author, log_msg,) (tuple)
The first mapping is used to locate the metadata_id for the
metadata record having a specific digest, and the second is used
as a key to locate the actual metadata. CVSRevision records
include the metadata_id.
CollateSymbolsPass
==================
Use the symbol statistics collected in CollectRevsPass and any
command-line options to determine which symbols should be treated as
branches, which as tags, and which symbols should be excluded from the
conversion altogether.
Create 'cvs2svn-symbols.pck', which contains a pickle of a list of
BranchSymbol, TagSymbol, and ExcludedSymbol objects indicating how
each symbol should be processed in the conversion.
ResyncRevsPass (formerly called pass2)
==============
This is where the resync file is used. The goal of this pass is to
output the information from 'cvs2svn-cvs-items.pck to a new file,
'cvs2svn-cvs-items-resync.pck' (resynched items) with its
corresponding index file, 'cvs2svn-cvs-items-resync-index.dat'. It
has the same content as the original file, except that the timestamps
of some CVSRevisions have been resynced.
First, read the whole resync file into a hash table that maps each
metadata_id to a list of lists. Each sublist represents one of the
timestamp adjustments from CollectRevsPass, and looks like this:
[old_time_lower, old_time_upper, new_time]
The reason to map each metadata_id to a list of sublists, instead of
to one list, is that sometimes you'll get the same metadata for
unrelated commits (for example, the same author commits many times
using the empty log message, or a log message that just says "Doc
tweaks."). So each metadata_id may need to "fan out" to cover
multiple commits, but without accidentally unifying those commits.
Now we loop over the CVSRevisions in 'cvs2svn-cvs-items.pck, and for
each CVSRevision write a line to 'cvs2svn-revs-resync.txt'. Each line
of this file looks like this:
3dc32955 5a 12ab
The fields are:
1. a fixed-width timestamp
2. the metadata_id of the metadata (project, log message, author)
associated with this CVSRevision, as a hexadecimal string.
3. the integer unique ID for this CVSRevision, as a hexadecimal
string.
Any CVSRevision record in 'cvs2svn-cvs-items.pck' whose metadata_id
matches some resync entry and appears to be part of the same commit as
one of the sublists in that entry, gets tweaked. The tweak is to
adjust the commit time of the line to the new_time, which is taken
from the resync hash and results from the adjustment described in
CollectRevsPass.
The way we figure out whether a given line needs to be tweaked is to
loop over all the sublists, seeing if this commit's original time
falls within the old<-->new time range for the current sublist. If it
does, we tweak the line before writing it out, and then conditionally
adjust the sublist's range to account for the timestamp we just
adjusted (since it could be an outlier). Note that this could, in
theory, result in separate commits being accidentally unified, since
we might gradually adjust the two sides of the range such that they are
eventually more than COMMIT_THRESHOLD seconds apart. However, this is
really a case of CVS not recording enough information to disambiguate
the commits; we'd know we have a time range that exceeds the
COMMIT_THRESHOLD, but we wouldn't necessarily know where to divide it
up. We could try some clever heuristic, but for now it's not
important -- after all, we're talking about commits that weren't
important enough to have a distinctive log message anyway, so does it
really matter if a couple of them accidentally get unified? Probably
not.
SortRevsPass (formerly called pass3)
============
This is where we deduce the changesets, that is, the grouping of file
changes into single commits.
It's very simple -- run 'sort' on 'cvs2svn-revs-resync.txt',
converting it to 'cvs2svn-revs-resync-s.txt'. Because of the way the
data is laid out, this causes commits with the same metadata_id (that
is, the same author, log message, and optionally the same project) to
be grouped together. Poof! We now have the CVS changes grouped by
logical commit.
In some cases, the changes in a given commit may be interleaved with
other commits that went on at the same time, because the sort gives
precedence to date before metadata_id. However, CreateDatabasesPass
detects this by seeing that the metadata_id is different, and
re-separates the commits.
CreateDatabasesPass (formerly called pass4):
===================
Find and create a database containing the last CVS revision that is a
source (also referred to as an "opening" revision) for each symbol.
This will result in a database containing key-value pairs whose key is
the id for a CVSRevision, and whose value is a list of symbol ids for
which that CVSRevision is the last "opening."
The format for this file is:
'cvs2svn-symbol-last-cvs-revs.db':
Key Value
CVS Revision ID array of symbol ids
For example:
5c --> [3, 8]
62 --> [15]
4d --> [29, 5]
f --> [18, 12]
AggregateRevsPass (formerly called pass5)
=================
Primarily, this pass gathers CVS revisions into Subversion revisions
(a Subversion revision is comprised of one or more CVS revisions)
before we actually begin committing (where "committing" means either
to a Subversion repository or to a dump file).
This pass does the following:
1. Creates a database file to map Subversion revision numbers to
SVNCommit instances ('cvs2svn-svn-commits.db'). Creates another
database file to map CVS Revisions to their Subversion Revision
numbers ('cvs2svn-cvs-revs-to-svn-revnums.db').
2. When a file is copied to a symbolic name in cvs2svn, there are a
range of valid Subversion revisions that we can copy the file from.
The first valid Subversion revision number for a symbolic name is
called the "Opening", and the first *invalid* Subversion revision
number encountered after the "Opening" is called the "Closing". In
this pass, the SymbolingsLogger class writes out a line (for each
symbolic name that it opens) to cvs2svn-symbolic-names.txt if it is
the first possible source revision (the "opening" revision) for a
copy to create a branch or tag, or if it is the last possible
revision (the "closing" revision) for a copy to create a branch or
tag. Not every opening will have a corresponding closing.
The format of each line is:
SYMBOL_ID SVN_REVNUM TYPE BRANCH_ID CVS_FILE_ID
For example:
1c 234 O * 1a7
34 245 O * 1a9
18a 241 C 34 1a7
122 201 O 7e 1b3
Here is what the columns mean:
SYMBOL_ID: The id of the branch or tag that starts or ends in this
CVS Revision (there can be multiples per CVS rev).
SVN_REVNUM: The Subversion revision number that is the opening or
closing for this SYMBOLIC_NAME.
TYPE: "O" for Openings and "C" for Closings.
BRANCH_ID: The id of the branch where this opening or closing
happened. '*' denotes the default branch.
CVS_FILE_ID: The ID of the CVS file where this opening or closing
happened, in hexadecimal.
See SymbolingsLogger for more details.
SortSymbolsPass (formerly called pass6)
===============
This pass merely sorts 'cvs2svn-symbolic-names.txt' into
'cvs2svn-symbolic-names-s.txt'. This orders the file first by
symbolic name, and second by Subversion revision number, thus grouping
all openings and closings for each symbolic name together.
IndexSymbolsPass (formerly called pass7)
================
This pass iterates through all the lines in
'cvs2svn-symbolic-names-s.txt', writing out a database file
('cvs2svn-symbolic-name-offsets.db') mapping SYMBOL_ID to the file
offset in 'cvs2svn-symbolic-names-s.txt' where SYMBOL_ID is first
encountered. This will allow us to seek to the various offsets in the
file and sequentially read only the openings and closings that we
need.
OutputPass (formerly called pass8)
==========
This pass has very little "thinking" to do--it basically opens the
svn-nums-to-cvs-revs.db and, starting with Subversion revision 2
(revision 1 creates /trunk, /tags, and /branches), sequentially plays
out all the commits to either a Subversion repository or to a
dumpfile.
In --dumpfile mode, the result of this pass is a Subversion repository
dumpfile (suitable for input to 'svnadmin load'). The dumpfile is the
data's last static stage: last chance to check over the data, run it
through svndumpfilter, move the dumpfile to another machine, etc.
When not in --dumpfile mode, no full dumpfile is created. Instead,
miniature dumpfiles representing a single revision are created, loaded
into the repository, and then removed.
In both modes, the dumpfile revisions are created by walking through
'cvs2svn-data.s-revs.txt'.
The databases 'cvs2svn-svn-nodes.db' and 'cvs2svn-svn-revisions.db'
form a skeletal (metadata only, no content) mirror of the repository
structure that cvs2svn is creating. They provide data about previous
revisions that cvs2svn requires while constructing the dumpstream.
===============================
Branches and Tags Plan.
===============================
This pass is also where tag and branch creation is done. Since
subversion does tags and branches by copying from existing revisions
(then maybe editing the copy, making subcopies underneath, etc), the
big question for cvs2svn is how to achieve the minimum number of
operations per creation. For example, if it's possible to get the
right tag by just copying revision 53, then it's better to do that
than, say, copying revision 51 and then sub-copying in bits of
revision 52 and 53.
Also, since CVS does not version symbolic names, there is the
secondary question of *when* to create a particular tag or branch.
For example, a tag might have been made at any time after the youngest
commit included in it, or might even have been made piecemeal; and the
same is true for a branch, with the added constraint that for any
particular file, the branch must have been created before the first
commit on the branch.
Answering the second question first: cvs2svn creates tags as soon as
possible and branches as late as possible.
Tags are created as soon as cvs2svn encounters the last CVS Revision
that is a source for that tag. The whole tag is created in one
Subversion commit.
For branches, this is "just in time" creation -- the moment it sees
the first commit on a branch, it snaps the entire branch into
existence (or as much of it as possible), and then outputs the branch
commit.
The reason we say "as much of it as possible" is that it's possible to
have a branch where some files have branch commits occuring earlier
than the other files even have the source revisions from which the
branch sprouts (this can happen if the branch was created piecemeal,
for example). In this case, we create as much of the branch as we
can, that is, as much of it as there are source revisions available to
copy, and leave the rest for later. "Later" might mean just until
other branch commits come in, or else during a cleanup stage that
happens at the end of this pass (about which more later).
How just-in-time branch creation works:
In order to make the "best" set of copies/deletes when creating a
branch, cvs2svn keeps track of two sets of trees while it's making
commits:
1. A skeleton mirror of the subversion repository, that is, an
array of revisions, with a tree hanging off each revision. (The
"array" is actually implemented as an anydbm database itself,
mapping string representations of numbers to root keys.)
2. A tree for each CVS symbolic name, and the svn file/directory
revisions from which various parts of that tree could be copied.
Both tree sets live in anydbm databases, using the same basic schema:
unique keys map to marshal.dumps() representations of dictionaries,
which in turn map entry names to other unique keys:
root_key ==> { entryname1 : entrykey1, entryname2 : entrykey2, ... }
entrykey1 ==> { entrynameX : entrykeyX, ... }
entrykey2 ==> { entrynameY : entrykeyY, ... }
entrykeyX ==> { etc, etc ...}
entrykeyY ==> { etc, etc ...}
(The leaf nodes -- files -- are also dictionaries, for simplicity.)
The repository mirror allows cvs2svn to remember what paths exist in
what revisions.
For details on how branches and tags are created, please see the
docstring the SymbolingsLogger class (and its methods).
-*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*-
- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -
-*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*-
Some older notes and ideas about cvs2svn. Not deleted, because they
may contain suggestions for future improvements in design.
-----------------------------------------------------------------------
An email from John Gardiner Myers <jgmyers@speakeasy.net> about some
considerations for the tool.
------
From: John Gardiner Myers <jgmyers@speakeasy.net>
Subject: Thoughts on CVS to SVN conversion
To: gstein@lyra.org
Date: Sun, 15 Apr 2001 17:47:10 -0700
Some things you may want to consider for a CVS to SVN conversion utility:
If converting a CVS repository to SVN takes days, it would be good for
the conversion utility to keep its progress state on disk. If the
conversion fails halfway through due to a network outage or power
failure, that would allow the conversion to be resumed where it left off
instead of having to start over from an empty SVN repository.
It is a short step from there to allowing periodic updates of a
read-only SVN repository from a read/write CVS repository. This allows
the more relaxed conversion procedure:
1) Create SVN repository writable only by the conversion tool.
2) Update SVN repository from CVS repository.
3) Announce the time of CVS to SVN cutover.
4) Repeat step (2) as needed.
5) Disable commits to CVS repository, making it read-only.
6) Repeat step (2).
7) Enable commits to SVN repository.
8) Wait for developers to move their workspaces to SVN.
9) Decomission the CVS repository.
You may forward this message or parts of it as you seem fit.
------
-----------------------------------------------------------------------
Further design thoughts from Greg Stein <gstein@lyra.org>
* timestamp the beginning of the process. ignore any commits that
occur after that timestamp; otherwise, you could miss portions of a
commit (e.g. scan A; commit occurs to A and B; scan B; create SVN
revision for items in B; we missed A)
* the above timestamp can also be used for John's "grab any updates
that were missed in the previous pass."
* for each file processed, watch out for simultaneous commits. this
may cause a problem during the reading/scanning/parsing of the file,
or the parse succeeds but the results are garbaged. this could be
fixed with a CVS lock, but I'd prefer read-only access.
algorithm: get the mtime before opening the file. if an error occurs
during reading, and the mtime has changed, then restart the file. if
the read is successful, but the mtime changed, then restart the
file.
* use a separate log to track unique branches and non-branched forks
of revision history (Q: is it possible to create, say, 1.4.1.3
without a "real" branch?). this log can then be used to create a
/branches/ directory in the SVN repository.
Note: we want to determine some way to coalesce branches across
files. It can't be based on name, though, since the same branch name
could be used in multiple places, yet they are semantically
different branches. Given files R, S, and T with branch B, we can
tie those files' branch B into a "semantic group" whenever we see
commit groups on a branch touching multiple files. Files that are
have a (named) branch but no commits on it are simply ignored. For
each "semantic group" of a branch, we'd create a branch based on
their common ancestor, then make the changes on the children as
necessary. For single-file commits to a branch, we could use
heuristics (pathname analysis) to add these to a group (and log what
we did), or we could put them in a "reject" kind of file for a human
to tell us what to do (the human would edit a config file of some
kind to instruct the converter).
* if we have access to the CVSROOT/history, then we could process tags
properly. otherwise, we can only use heuristics or configuration
info to group up tags (branches can use commits; there are no
commits associated with tags)
* ideally, we store every bit of data from the ,v files to enable a
complete restoration of the CVS repository. this could be done by
storing properties with CVS revision numbers and stuff (i.e. all
metadata not already embodied by SVN would go into properties)
* how do we track the "states"? I presume "dead" is simply deleting
the entry from SVN. what are the other legal states, and do we need
to do anything with them?
* where do we put the "description"? how about locks, access list,
keyword flags, etc.
* note that using something like the SourceForge repository will be an
ideal test case. people *move* their repositories there, which means
that all kinds of stuff can be found in those repositories, from
wherever people used to run them, and under whatever development
policies may have been used.
For example: I found one of the projects with a "permissions 644;"
line in the "gnuplot" repository. Most RCS releases issue warnings
about that (although they properly handle/skip the lines), and CVS
ignores RCS newphrases altogether.
# vim:tw=70
|