File: clmdist.azm

package info (click to toggle)
mcl 1%3A14-137%2Bds-9
links: PTS, VCS
area: main
in suites: bullseye
size: 13,412 kB
sloc: ansic: 53,278; ml: 5,793; sh: 4,492; perl: 3,967; makefile: 432; python: 43
file content (383 lines) | stat: -rw-r--r-- 15,543 bytes
parent folder | download | duplicates (3)
\import{mcx.zmm}

\begin{pud::man}{

   {name}{clm dist}
   {html_title}{The clm dist manual}
   {author}{Stijn van Dongen}
   {section}{1}
   {synstyle}{long}
   {defstyle}{long}

   \man_share
}

\${html}{\"pud::man::maketoc"}

\def{p1}{\bf{P1}}
\def{p2}{\bf{P2}}
\def{D}{\bf{D}}
\def{N}{\bf{N}}
\def{d1}{\bf{d1}}
\def{d2}{\bf{d2}}
\def{sjd}{\bf{sjd}}

\sec{name}{NAME}
\NAME{clm_dist}{compute the distance between two or more partitions\
 (clusterings).}

\par{
   The distance that is computed can be any of
   \it{split/join distance}, \it{variance of information},
   or \it{Mirkin metric}.}

\disclaim_clm{dist}

\sec{synopsis}{SYNOPSIS}
\par{
   \clm{dist} [options] <file name> <file name>+}

\par{
   \clm{dist}
      \synoptopt{-mode}{<sj|vi|mk|sc>}{distance type}
      \synoptopt{-o}{fname}{output file}
      \synoptopt{--chain}{only compare consecutive clusterings}
      \synoptopt{--one-to-many}{compare first clustering to all others}
      \synoptopt{--sort}{sort clusterings based on coarseness}
      \synoptopt{--index}{output Rand, adjusted Rand and Jaccard indices}
      \synoptopt{-digits}{k}{output decimals}
      \stdsynopt
      <file name> <file name>+}

\sec{description}{DESCRIPTION}

\par{
   \clm{dist} computes distances between clusterings. It can compute the
   \it{split/join distance} (described below), the \it{variance of information
   measure}, and the \it{Mirkin metric}.  By default it computes the chosen distance
   for all pairs of distances in the clusterings provided.  Clusterings must be in
   the mcl matrix format (cf. \mysib{mcxio}), and are supplied on the command
   line as the names of the files in which they are stored.
   It is possible to compare only consecutive clusterings by using
   the \genopt{--chain} option.
   }

\par{
   Currently, \clm{dist} cannot compute different distance types simultaneously.}

\par{
   The output is linewise, each line giving information about
   the distance between a pair of clusterings. A line has the
   following format:}

\verbatim{\:/
d  d1  d2  N  v  name1  name2  [v50,v75,v90,v95,v99]}

\car{
   where \v{dT} is the distance between the two clusterings, \v{d1} is the
   distance from the first clustering to the greatest common subclustering
   (alternatively called GCS, intersection, or meet) of the two clusterings,
   \v{d2} is similarly the distance from the second clustering to the GCS,
   \v{N} is the number of nodes in the set over which the clusterings are
   defined, \v{name1} is the name of the file containing the first clustering,
   \v{name2} is the name of the file containing the second clustering, and
   \v{vXX} is the number of \it{volatile nodes} at stringency factor \v{0.XX}
   (i.e. 0.5 for \v{v50}).  Refer to \clmref{vol} for a definition of
   \it{volatile node}.

}

\sec{options}{OPTIONS}

\'begin{itemize}{\mcx_itemopts}

\item{\defopt{-mode}{<sj|vi|mk>}{distance type}}
\car{
   Use \bf{sj} for the \it{split/join distance} (described below), \bf{vi} for
   the \it{variance of information measure} and \bf{mk} for the \it{Mirkin metric}.}


\item{\defopt{--chain}{only compare consecutive clusterings}}
\car{
   This option can be used if you know that the clusterings are nested
   clusterings (or appoximately so) and ordered from coarse to fine-grained
   or vice versa. An example of this is the set of clusterings resulting
   from applying \mcl with a range of inflation parameters.
}

\item{\defopt{--one-to-many}{compare first clustering to all others}}
\car{
   Use this option for example to compare a gold standard classification
   to a collection of clusterings.
   Bear in mind that sub-clustering and super-clustering are also
   ways for a clustering to be compatible with a gold standard.
   This means that the simple numerical criterion of distance between
   clusters (by whatever method) is only partially informative.
   For the Mirkin, variation of information and split/join metrics
   it pays to take into account the constituent distances \v{d1}
   and \v{d2} (see above). Assuming that the first clustering
   given as argument represents a gold standard, a small value
   for \v{d1} implies that the second clustering is (nearly) a superclustering,
   and similarly a small value for \v{d2} implies that it is (nearly)
   a subclustering.
}


\item{\defopt{--sort}{sort clusterings based on coarseness}}
\car{
   This option can be useful in conjunction with the \genopt{--chain}
   option, in case the list of clusterings supplied is not necessarily
   ordered by granularity.
}

\item{\defopt{--index}{output Rand, adjusted Rand and Jaccard indices}}
\car{
   As described.
   }

\item{\defopt{-o}{fname}{output file}}

\item{\defopt{-digits}{k}{output decimals}}
\car{
   The number of decimals printed when using the variance of information measure.}

\stddefopt

\end{itemize}

\sec{}{SPLIT/JOIN DISTANCE}
\par{
   For each pair of clusterings \bf{C1}, \bf{C2}, two numbers are given,
   say \bf{d1} and \bf{d2}. Then \bf{d1} + \bf{d2} equals the number
   of nodes that have to be exchanged in order to transform any of the two
   clusterings into the other, and you can think of (\bf{d1}+\bf{d2})/\bf{2N}
   as the percentage that the two clusterings differ. The split/join
   distance has a linearity property with respect to the meet of \bf{C1} and
   \bf{C2}, see below.}

\par{
   The split/join distance \sjd is very handy in computing the consistency of
   two or more clusterings of the same graph, or comparing clusterings made
   with different resource (but otherwise identical) parameters. The latter is
   for finding out whether you can settle for cheaper mcl settings, or whether
   you need to switch to more expensive settings. The former is for finding out
   whether clusterings are identical, conflicting, or whether one is (almost) a
   subclustering of the other - mostly for comparing a set of clusterings of
   different granularity, made by letting the mcl parameter \genopt{-I} vary.
   The \secref{examples} section contains examples of all these \clm{dist} uses,
   and the use of \clm{info} and \clm{meet} is also discussed there.}

\par{
   \sjd is a metric distance on the space of partitions of
   a set of a given fixed cardinality. It has the following linearity
   property. Let \p1 and \p2 be partitions, then}

\par{
   \sjd(\p1, \p2) = \sjd(\p1, \D) + \sjd(\p2, \D)}

\par{
   where \D (for Dutch Doorsnede)
   is the intersection of \p1 and \p2, i.e. the unique clustering
   that is both a subclustering of \p1 and \p2 \it{and} a superclustering of
   all other subclusterings of \p1 and \p2. Sloppily worded, \D is the largest
   subclustering of both \p1 and \p2.  See the \secref{references} section for
   a pointer to the technical report in which \sjd was first defined (and in
   which the non-trivial triangle inequality is proven).}

\par{
   Because it is useful to know whether one partition (or clustering)
   is almost a subclustering of the other, \clm{dist} returns the
   two constituents \sjd(\p1,\D) and \sjd(\p2,\D).}

\par{
   Let \p1 and \p2 be two clusterings of a graph of cardinality \N,
   and suppose \clm{dist} returns the integers \d1 and \d2.  You can think of
   \bf{100 * (d1 + d2) / N} as the percentage that \p1 and \p2 differ.

   This interpretation is in fact slightly conservative.
   The numerator is the number of nodes that need to be exchanged in order to
   transform one into the other. This number may grow as large as
   \bf{2*N - 2*sqrt(N)}, so it would be justified to take 50 as a scaling
   factor rather than 100.}

\par{
   For example, if \bf{A} and \bf{B} are both clusterings of a graph
   on a set of 9058 nodes and \clm{dist} returns [38, 2096], this conveys
   that \bf{A} is almost a subclustering of \bf{B} (by splitting 38 nodes
   in \bf{A} we obtain a clustering \bf{D} that is a subclustering of \bf{B}),
   and that \bf{B} is much less granular than \bf{A}. The latter is
   because we can obtain \bf{B} from \bf{D} by \it{joining} 2096 nodes
   in some way.}


\sec{examples}{EXAMPLES}
   
\: todo: different resource levels on huge for -I 1.4
\: comparing different clusterings of huge.
\: comparing the latter with \clm{meet}.

\par{
   The following is an example of several mcl validation tools
   applied to a set of clusterings on a protein graph of 9058 nodes.
   In the first experiment, six
   different clusterings were generated for different values of the inflation
   parameter, which was respectively set to 1.2, 1.6, 2.0, 2.4, 2.8, and 3.2.
   It should be noted that protein graphs seem somewhat special in that an
   inflation parameter setting as low as 1.2 still produces a very acceptable
   clustering. The six clusterings are scrutinized using \clm{dist},
   \clm{info}, and \clm{meet}.
   In the second experiment, four different clusterings were generated
   with identical flow (i.e. inflation) parameter, but
   with different resource parameters. \clm{dist} is used to choose
   a sufficient resource level.}

\par{
   High \genopt{-P/-S/-R} values make \mcl more accurate but also
   more time and memory consuming.  Run \mcl with different settings for these
   parameters, holding other parameters fixed. If the expensive and supposedly
   more accurate clusterings are very similar to the clusterings resulting from
   cheaper settings, the cheaper setting is sufficient.  If the distances
   between cheaper clusterings and more expensive clusterings are large, this
   is an indication that you need the expensive settings. In that case, you may
   want to increase the \genopt{-P/-S/-R} parameters (or simply the
   \genopt{-scheme} parameter) until associated
   clusterings at nearby resource levels are very similar.}

\par{
   In this particular example, the validation tools do not reveal that one
   clustering in particular can be chosen as 'best', because all clusterings
   seem at least acceptable.  They do aid however in showing the relative
   merits of each clusterings.  The most important issue in this respect is
   cluster granularity. The table below shows the output of \clm{info}.}

\verbatim{
     Efficiency  Mass frac  Area frac  Cl weight  Mx link weight
1.2   0.42364     0.98690    0.02616    52.06002    50.82800
1.6   0.58297     0.95441    0.01353    55.40282    50.82800
2.0   0.63279     0.92386    0.01171    58.09409    50.82800
2.4   0.65532     0.90702    0.01091    59.58283    50.82800
2.8   0.66854     0.84954    0.00940    63.19183    50.82800
3.2   0.67674     0.82275    0.00845    66.10831    50.82800}

\car{
   This data shows that there is exceptionally strong cluster structure present
   in the input graph. The 1.2 clustering captures almost all edge mass using
   only 2.5 percent of 'area'. The 3.2 clustering still captures 82 percent of
   the mass using less than 1 percent of area.  We continue with looking at the
   mutual consistency of the six clusterings. Below is a table that shows all
   pairwise distances between the clusterings.}

\verbatim{
    |   1.6  |   2.0  |   2.4  |   2.8  |   3.2  |   3.6
-----------------------------------------------------------.
1.2 |2096,38 |2728,41 |3045,48 |3404,45 |3621,43 |3800, 42 |
-----------------------------------------------------------|
1.6 |        | 797,72 |1204,76 |1638,78 |1919,70 |2167, 69 |
-----------------------------------------------------------|
2.0 |        |        | 477,68 | 936,78 |1235,85 |1504, 88 |
-----------------------------------------------------------|
2.4 |        |        |        | 498,64 | 836,91 |1124,103 |
-----------------------------------------------------------|
2.8 |        |        |        |        | 384,95 | 688,119 |
-----------------------------------------------------------|
3.2 |        |        |        |        |        | 350,110 |
-----------------------------------------------------------.
}

\par{
   The table shows that the different clusterings are pretty consistent with
   each other, because for two different clusterings it is generally true that
   one is almost a subclustering of the other. The interpretation for the
   distance between the 1.6 and the 3.2 clustering for example, is that by
   rearranging 43 nodes in the 3.2 clustering, we obtain a subclustering of the
   1.6 clustering. The table shows that for any pair of clusterings, at most
   119 entries need to be rearranged in order to make one a subclustering of
   the other.}

\par{
   The overall consistency becomes all the more clear by looking at the meet of
   all the clusterings:}

\verbatim{
clm meet -o meet out12 out16 out20 out24 out28 out32
clm dist meet out12 out16 out20 out24 out28 out32}

\car{
   results in the following distances between the respective clusterings
   and their meet.}

\verbatim{
    |   1.2  |    1.6 |  2.0   |   2.4  |  2.8   |  3.2    |  
-------------- --------------------------------------------.
meet|  0,3663|  0,1972| 0,1321 |  0,958 | 0,559  | 0,283   |
-------------- --------------------------------------------.}

\car{
   This shows that by rearranging only 283 nodes in the 3.2 clustering,
   one obtains a subclustering of all other clusterings.}

\par{
   In the last experiment, \mcl was run with inflation parameter 1.4,
   for each of the four different preset pruning schemes \v{k=1,2,3,4}.
   The \clm{dist} distances between the different clusterings
   are shown below.}

\verbatim{
    |  k=2   |   k=3  |   k=4  |
-------------------------------.
k=1 |  17,17 |  16,16 |  16,16 |
-------------------------------.
k=2 |        |   3,3  |   5,5  |
-------------------------------.
k=3 |        |        |   4,4  |
-------------------------------.}

\car{
   This example is a little boring in that the cheapest scheme seems adequate.
   If anything, the gaps between the \v{k=1} scheme and the rest are a little
   larger than the three gaps between the \v{k=2}, \v{k=3}, and \v{k=4}
   clusterings. Had all distances been much larger, then such an observation
   would be reason to choose the \v{k=2} setting.}

\par{
   Note that you need not feel uncomfortable with the clusterings
   still being different at high resource levels, if ever so slightly.
   In all likelihood, there are anyway nodes which are not in any core of
   attraction, and that are on the boundary between two or more clusterings.
   They may go one way or another, and these are the nodes which
   will go different ways even at high resource levels.
   Such nodes may be stable in clusterings obtained for lower inflation
   values (i.e. coarser clusterings), in which the different clusters
   to which they are attracted are merged.}

\sec{author}{AUTHOR}
\par{
   Stijn van Dongen.}


\sec{seealso}{SEE ALSO}
\par{
   \mysib{mclfamily} for an overview of all the documentation
   and the utilities in the mcl family.}

\sec{references}{REFERENCES}

\par{
   Stijn van Dongen. \it{Performance criteria for graph clustering and Markov
   cluster experiments}.  Technical Report INS-R0012, National Research
   Institute for Mathematics and Computer Science in the Netherlands,
   Amsterdam, May 2000.\|
   \httpref{http://www.cwi.nl/ftp/CWIreports/INS/INS-R0012.ps.Z}}

\par{
   Marina Meila. \it{Comparing Clusterings \- An Axiomatic View}.
   In \it{Proceedings of the 22nd International Conference on Machine Learning},
   Bonn, Germany, 2005.}

\par{
   Marina Meila. \it{Comparing Clusterings},
   UW Statistics Technical Report 418.\|
   \httpref{http://www.stat.washington.edu/www/research/reports/2002/tr418.ps}}

\end{pud::man}