This is a description of how symbols (tags and branches) are handled
by cvs2svn, determined by reading the code.
CVSFile -- a single file within the CVS repository. This object
basically only records the filename of the corresponding RCS
file, and the relative filename that this file will have
within the SVN repository. A single CVSFile object is used
for all of the CVSItems on all lines of development related to
The following terms and the corresponding classes represent
project-wide concepts. For example, a project will only have a single
Branch named "foo" even if many files appear on that branch. Each of
these objects is assigned a unique integer ID during CollectRevsPass
which is preserved during the entire conversion (even if, say, a
Branch is mutated into a Tag).
Trunk -- the main line of development for a particular Project in
CVS. The Trunk class inherits from LineOfDevelopment.
Symbol -- a Branch or a Tag within a particular Project (see
below). Instances of this class are also used to represent
symbols early in the conversion, before it has been decided
whether to convert the symbol as a Branch or as a Tag. A
Symbol contains an id, a Project, and a name.
Branch -- a symbol within a particular Project that will be
treated as a branch in SVN. Usually corresponds to a branch
tag in CVS, but might be a non-branch tag that was mutated in
CollateSymbolsPass. In SVN, this will correspond to a
subdirectory of the project's "branches" directory. The
Branch class inherits from Symbol and from LineOfDevelopment.
Tag -- a symbol within a particular Project that will be treated
as a tag in SVN. Usually corresponds to a non-branch tag in
CVS, but might be a branch tag that was mutated in
CollateSymbolsPass. In SVN, this will correspond to a
subdirectory of the project's "tags" directory. The Tags
class inherits from Symbol and from LineOfDevelopment.
ExcludedSymbol -- a CVS symbol that will be excluded from the
LineOfDevelopment -- a Trunk, Branch, or Tag.
The following terms and the corresponding classes represent particular
CVS events in particular CVS files. For example, the CVSBranch
representing the creation of Branch "foo" in one file will be distinct
from the CVSBranch representing the creation of branch "foo" in
another file, even if the two files are in the same Project. Each
CVSItem is assigned a unique integer ID during CollectRevsPass which
is preserved during the entire conversion (even if, say, a CVSBranch
is mutated into a CVSTag).
CVSItem -- abstract base class representing any discernible event
within a single RCS file, for example the creation of revision
1.6, or the tagging of the file with tag "bar". Each CVSItem
has a unique integer ID.
CVSRevision -- a particular revision within a particular file
(e.g., file.txt:1.6). A CVSRevision occurs on a particular
LineOfDevelopment. CVSRevision inherits from CVSItem.
CVSSymbol -- a CVSBranch or CVSTag (see below). CVSSymbol
inherits from CVSItem.
CVSBranch -- the creation of a particular Branch on a particular
file. A CVSBranch has a Symbol instance telling the Symbol
associated with the branch, and also records the
LineOfDevelopment from which the branch was created. In the
SVN repository, a CVSBranch corresponds to an "svn copy" of a
file to a subdirectory of the project's "branches" directory.
CVSBranch inherits from CVSSymbol.
CVSTag -- the creation of a particular Tag on a particular file.
A CVSTag has a Symbol instance telling the Symbol associated
with the tag, and also records the LineOfDevelopment from
which the tag was created. In the SVN repository, a CVSTag
corresponds to an "svn copy" of a file to a subdirectory of
the project's "tags" directory. CVSTag inherits from
Collect all information about CVS tags and branches from the CVS
For each project, create a Trunk object to represent the trunk line of
development for that project. The Trunk object for one Project is
distinct from the Trunk objects for other Projects. For each symbol
name seen in each project, create a Symbol object. The Symbol object
contains its id, project, and name.
The very first thing that is done when a symbol is read is that the
Project's symbol transform rules are given a chance to transform the
symbol name (or even cause it to be discarded). The result of the
transformation is used as the symbol name in the rest of the program.
Because this transformation process is so low-level, it is capable of
making a more fundamental kind of change than the symbol strategy
rules that come later:
* Symbols can be renamed.
* Symbols can be fully discarded, as if they never appeared in the
CVS repository. This can even be done for a malformed symbol or
for a branch symbol that refers to the same branch as another
branch symbol (which would otherwise be a fatal error).
* Two distinct symbols in different files within the same project
can be transformed to the same name, in which case they are
treated as a single symbol.
* Two distinct symbols within a single file can be transformed to
the same name, provided they refer to the same revision number.
This effectively discards one of the symbols.
* Two symbols with the same name in different files can be given
distinct names, in which case they are treated as completely
For each Symbol object, collect the following statistics:
* In how many files was the symbol used as a branch and in how many
was it used as a tag.
* In how many files was there a commit on a branch with that name.
* Which other symbols branched off of a branch with that name.
* In how many files could each other line of development have served
as the source of this symbol. These are called the "possible
parents" of the symbol.
These statistics are used in CollateSymbolsPass to determine which
symbols can be excluded or converted from tags to branches or vice
The possible parents information is important because CVS is ambiguous
about what line of development was the source of a branch. A branch
numbered 1.3.6 might have been created from trunk (revision 1.3), from
branch 1.3.2, or from branch 1.3.4; it is simply impossible to tell
based on the information in a single RCS file.
[Actually, the situation is even more confusing. If a branch tag is
deleted from CVS, the branch number is recycled. So it is even
possible that branch 1.3.6 was created from branch 1.3.8 or 1.3.10 or
... We address this confusion by noting the order that the branches
were listed in the RCS file header. It appears that CVS lists
branches in the header in reverse chronological order of creation.]
For each tag seen within each file, create a CVSTag object recording
its id, CVSFile, Symbol, and the id of the CVSRevision being tagged.
For each branch seen within each file, create a CVSBranch object
recording an id, CVSFile, Symbol, the branch number (e.g., '1.4.2'),
the id of the CVSRevision from which the branch sprouts, and the id of
the first CVSRevision on the branch (if any).
For each revision seen within each file, create a CVSRevision object
recording (among other things) and id, the line of development (trunk
or branch) on which the revision appeared, a list of ids of CVSTags
tagging the revision, and a list of ids of CVSBranches sprouting from
This pass also adjusts the CVS dependency tree to work around some CVS
quirks. (See design-notes.txt for the details.) These adjustments
can result in CVSBranches being deleted, for example, if a file was
added on a branch. In such a case, any CVSRevisions that were
previously on the branch will be created by adding the file to the
branch directory, rather than copying the file from the source
directory to the branch directory.
Allow the project's symbol strategy rules to affect how symbols are
* A symbol can be excluded from the conversion output (as long as
there are no other non-excluded symbols that depend on it). In
this case, the Symbol will be converted into an ExcludedSymbol
* A tag symbol can be converted as a branch. In this case, the
Symbol will be converted into a Branch instance.
* A branch symbol can be converted as a tag, provided there were
never any commits on the branch. In this case, the Symbol will be
converted into a Tag instance.
* The SVN path where a symbol will be placed is determined.
Typically, symbols are laid out in the standard
trunk/branches/tags Subversion repository layout, but strategy
rules can in fact place symbols arbitrarily.
* The preferred parent of each symbol is determined. The preferred
parent of a Symbol is chosen to be the line of development that
appeared as a possible parent of this symbol in the most CVSFiles.
This pass creates the symbol database, SYMBOL_DB, which is accessed in
later passes via the SymbolDatabase class. The SymbolDatabase
contains TypedSymbol (Branch, Tag, or ExcludedSymbol) instances
indicating how each symbol should be processed in the conversion. The
ID used for a TypedSymbol is the same as the ID allocated to the
corresponding symbol in CollectRevsPass, so references in CVSItems do
not have to be updated.
Iterate through all of the CVSItems, mutating CVSTags to CVSBranches
and vice versa and excluding other CVSSymbols as specified by the
types of the TypedSymbols in the SymbolDatabase. Additionally, filter
out any CVSRevisions that reside on excluded CVSBranches.
Write a line of text to CVS_SYMBOLS_DATAFILE for each surviving
CVSSymbol, containing its Symbol id and a pickled version of the
CVSSymbol. (This file will be sorted in SortSymbolsPass then used in
InitializeChangesetsPass to create SymbolChangesets.)
Also adjust the file's dependency tree by grafting CVSSymbols onto
their preferred parents. This is not always possible; if not, leave
the CVSSymbol where it was.
Finally, record symbol "openings" and "closings". A CVSSymbol is
considered "opened" by the CVSRevision or CVSBranch from which the
CVSSymbol sprouts. A CVSSymbol is considered "closed" by the
CVSRevision that overwrites or deletes the CVSSymbol's opening.
(Every CVSSymbol has an opening, but not all of them have closings;
for example, the opening CVSRevision might still exist at HEAD.)
Record in each CVSRevision and CVSBranch a list of all of the
CVSSymbols that it opens. Record in each CVSRevision a list of all of
the CVSSymbols that it closes (CVSBranches cannot close CVSSymbols).
Sort CVS_SYMBOLS_DATAFILE, creating CVS_SYMBOLS_SORTED_DATAFILE. The
sort groups together symbol items that might be added to the same
Read CVS_SYMBOLS_SORTED_DATAFILE, grouping CVSSymbol items with the
same Symbol id into SymbolChangesets.
Read in the changeset graph consisting only of SymbolChangesets and
break up symbol changesets as necessary to break any cycles that are
Read in the entire changeset graph and break up symbol changesets as
necessary to break any cycles that are found.
Update the conversion statistics with excluded symbols omitted.
Create SVNCommits and assign svn revision numbers to each one. Create
a database (SVN_COMMITS_*) to map svn revision numbers to SVNCommits
and another (CVS_REVS_TO_SVN_REVNUMS) to map each CVSRevision id to
the number of the svn revision containing it.
Also, SymbolingsLogger writes a line to SYMBOL_OPENINGS_CLOSINGS for
each opening or closing for each CVSSymbol, noting in what SVN
revision the opening or closing occurred.
This pass sorts SYMBOL_OPENINGS_CLOSINGS into
SYMBOL_OPENINGS_CLOSINGS_SORTED. This orders the file first by symbol
ID, and second by Subversion revision number, thus grouping all
openings and closings for each symbolic name together.
Iterate through all the lines in SYMBOL_OPENINGS_CLOSINGS_SORTED,
writing out a pickled map to SYMBOL_OFFSETS_DB telling at what offset
in SYMBOL_OPENINGS_CLOSINGS_SORTED the lines corresponding to each
Symbol begin. This will allow us to seek to the various offsets in
the file and sequentially read only the openings and closings that we
The filling of a symbol is triggered when SVNSymbolCommit.commit()
calls SVNRepositoryMirror.fill_symbol(). The SVNSymbolCommit contains
the list of CVSSymbols that have to be copied to a symbol directory in
this revision. However, we still have to do a lot of work to figure
out what SVN revision number to use as the source of these copies, and
also to group file copies together into directory copies when
The SYMBOL_OPENINGS_CLOSINGS_SORTED file lists the opening and closing
SVN revision of each revision that has to be copied to the symbol
directory. We use this information to try to find SVN revision
numbers that can serve as the source for as many files as possible, to
avoid having to pick and choose sources from many SVN revisions.
Furthermore, when a bunch of files in a directory have to be copied at
the same time, it is cheaper to copy the directory as a whole. But if
not *all* of the files within the directory had to be copied, then the
unneeded files have to be deleted again from the copied directory. Or
if some of the files have to be copied from different source SVN
revision numbers, then those files have to be overwritten in the
copied directory with the correct versions.
Finally, it can happen that a single Symbol has to be filled multiple
times (because the initial SymbolChangeset had to be broken up). In
this case, the first fill can copy the source directory to the
destination directory (maybe with fixups), but subsequent copies have
to copy individual files to avoid overwriting content that is already
present in the destination directory.
To figure all of this out, we need to know all of the files that
existed in every previous SVN revision, in every line of development.
This is done using the SVNRepositoryMirror class, which keeps a
skeleton record of the entire SVN history in a database using data
structures similar to those used by SVN itself.