
|
--------------------------------------------------------------------------------
dnadiff is a wrapper for nucmer and analysis utilities that provides
detailed information on the differences between two genomes, and also
provides a high level report file that quantifies the differences
between the two inputs.
Use Cases:
+ diff'ing two strains of the same species
+ diff'ing two assemblies of the same organism
+ diff'ing a draft assembly and a closely related finished genome
If any of this code is used in any publication, please cite the following:
Versatile and open software for comparing large genomes.
S. Kurtz, A. Phillippy, A.L. Delcher,
M. Smoot, M. Shumway, C. Antonescu, and S.L. Salzberg.
Genome Biology (2004), 5:R12.
--------------------------------------------------------------------------------
This manual is also available as HTML documentation included in this
distribution, or at:
http://mummer.sourceforge.net
http://mummer.sourceforge.net/manual
http://mummer.sourceforge.net/examples
-- DESCRIPTION --
dnadiff is a wrapper around nucmer that builds an alignment using
default parameters, and runs many of nucmer's helper scripts to
process the output and report alignment statistics, SNPs, breakpoints,
etc. It is designed for evaluating the sequence and structural
similarity of two highly similar sequence sets. E.g. comparing two
different assemblies of the same organism, or comparing two strains of
the same species.
-- dnadiff EXAMPLE --
To compare two strains of the same species, type:
"dnadiff genome1.fna genome2.fna"
Output will be...
out.report - Summary of alignments, differences and SNPs
out.delta - Standard nucmer alignment output
out.1delta - 1-to-1 alignment from delta-filter -1
out.mdelta - M-to-M alignment from delta-filter -m
out.1coords - 1-to-1 coordinates from show-coords -THrcl .1delta
out.mcoords - M-to-M coordinates from show-coords -THrcl .mdelta
out.snps - SNPs from show-snps -rlTHC .1delta
out.rdiff - Classified ref breakpoints from show-diff -rH .mdelta
out.qdiff - Classified qry breakpoints from show-diff -qH .mdelta
out.unref - Unaligned reference IDs and lengths (if applicable)
out.unqry - Unaligned query IDs and lengths (if applicable)
For more information on the formats and meanings of all the files
produced, please see the documentation for the corresponding
utility. This document serves to describe running the dnadiff script
and interpreting the produced .report file.
-- RUNNING 'dnadiff' --
USAGE: dnadiff [options] <Reference> <Query>
or dnadiff [options] -d <Delta File>
DESCRIPTION:
Run comparative analysis of two sequence sets using nucmer and its
associated utilities with recommended parameters. See MUMmer
documentation for a more detailed description of the
output. Produces the following output files:
.delta - Standard nucmer alignment output
.1delta - 1-to-1 alignment from delta-filter -1
.mdelta - M-to-M alignment from delta-filter -m
.1coords - 1-to-1 coordinates from show-coords -THrcl .1delta
.mcoords - M-to-M coordinates from show-coords -THrcl .mdelta
.snps - SNPs from show-snps -rlTHC .1delta
.rdiff - Classified alignment breakpoints from show-diff -rH .mdelta
.qdiff - Classified alignment breakpoints from show-diff -qH .mdelta
.report - Summary of alignments, differences and SNPs
MANDATORY:
Reference Set the input reference multi-FASTA filename
Query Set the input query multi-FASTA filename
or
Delta File Unfiltered .delta alignment file from nucmer
OPTIONS:
-d|delta Provide precomputed delta file for analysis
-h
--help Display help information and exit
-p|prefix Set the prefix of the output files (default "out")
-V
--version Display the version information and exit
-- NOTES --
The -p option is recommended to avoid overwriting previous
output. A simple naming convention is for files A.fna and B.fna, to
set "-p A_B". It is safest to let dnadiff run nucmer automatically, so
avoid using the -d option unless the delta file was already generated
with "nucmer --maxmatch" and has not been filtered.
-- OUTPUT FILES --
dnadiff produces many outputs, however all but one are produced by
other utilities in the MUMmer package. Please see their corresponding
documentation for more information. This section will only describe
the .report file generated by dnadiff and tips on interpreting it.
*** .report OUTPUT ***
Report statistics are broken into two columns - reference and
query. Rows are grouped by themed alignment metrics and are described
here. Summary counts are estimates and do not represent the exact
number of occurrences of a particular evolutionary event. When reading
a reference column, think number of XYZ in reference with regard to
the query. When reading a query column, think number of XYZ in query
with regard to the reference.
[Sequences] - Sequence-centric stats.
TotalSeqs - Total number of input sequences.
AlignedSeqs - Number of input sequences with at least one alignment.
UnalignedSeqs - Number of input sequences with no alignment.
[Bases] - Base-pair-centric stats.
TotalBases - Total number of bases in the input sequences.
AlignedBases - Total number of bases contained within an alignment.
UnalignedBases - Total number of unaligned bases. This is a rough
measure for the amount of "unique" sequence in the
reference and query.
[Alignments] - Alignment-centric stats.
1-to-1 - Number of alignment blocks comprising the 1-to-1
mapping of reference to query. This is a subset of
the M-to-M mapping, with repeats removed.
TotalLength - Total length of 1-to-1 alignment blocks.
AvgLength - Average length of 1-to-1 alignment blocks.
AvgIdentity - Average identity of 1-to-1 alignment blocks.
M-to-M - Number of alignment blocks comprising the
many-to-many mapping of reference to query. The
M-to-M mapping represents the smallest set of
alignments that maximize the coverage of both
reference and query. This is a superset of the 1-to-1
mapping.
TotalLength - Total length of M-to-M alignment blocks.
AvgLength - Average length of M-to-M alignment blocks.
AvgIdentity - Average identity of M-to-M alignment blocks.
[Features] - Structural alignment features, such as
rearrangements. These counts are rough estimates
based on an automated analysis of the
alignments. Features are identified by scanning the
reference (or query) from low to high, and noting the
positions where the query alignments are
inconsistently ordered or oriented with respect to
the reference.
Breakpoints - Number of non-maximal alignment endpoints,
i.e. endpoints that do not occur at the beginning or
end of a sequence.
Relocations - Number of breaks in the alignment where adjacent
1-to-1 alignment blocks are in the same sequence, but
not consistently ordered. A separate feature is
recorded for each end of a relocation, so this is
really a count of relocation endpoints.
Translocations - Number of breaks in the alignment where adjacent
1-to-1 alignment blocks are in different sequences. A
separate feature is recorded for each end of a
translocation, so this is really a count of
translocation endpoints.
Inversions - Number of breaks in the alignment where adjacent
1-to-1 alignment blocks are inverted with respect to
one another. A separate feature is recorded for each
end of an inversion, so this is really a count of
inversion endpoints.
Insertions - Rough count of insertion events. Note that this is
slightly different from "UnalignedBases" because it
counts duplications as insertions, whereas
UnalignedBases does not. Also, this count does not
included sequences that have no alignments as
insertions, whereas UnalignedBases does. Note than
insertions in R can be viewed as deletions from Q.
InsertionSum - Rough sum of inserted sequence.
InsertionAvg - Average length of insertion.
TandemIns - Rough count of tandem duplication insertion
events. Note that expansions in R can be viewed as
collapses in Q.
TandemInsSum - Rough sum of tandem duplication insertions.
TandemInsAvg - Average length of tandem duplications.
[SNPs] - Single Nucleotide Polymorphism counts.
TotalSNPs - Total number of SNPs, same for both sequences.
AC - A-to-C SNP. For reference, this means reference 'A'
to query 'C'. For query, this means query 'A' to
reference 'C'. The same convention applies below.
AG - A-to-G SNP.
AT - A-to-T SNP.
CA - C-to-A SNP.
CG - C-to-G SNP.
CT - C-to-T SNP.
TotalGSNPs - Single Nucleotide Polymorphisms bounded by 20 exact,
base-pair matches on both sides.
AC - A-to-C SNP.
AG - A-to-G SNP.
AT - A-to-T SNP.
CA - C-to-A SNP.
CG - C-to-G SNP.
CT - C-to-T SNP.
TotalIndels - Single Nucleotide Insertions/Deleltions.
A. - A InDel. For reference, 'A' insertion in
reference. For query, 'A' insertion in query. The
same convention applies below.
C. - C InDel.
G. - G InDel.
T. - T InDel.
TotalGIndels - Single Nucleotide Insertions/Deleltions bounded by 20
exact, base-pair matches on both sides.
A. - A InDel.
C. - C InDel.
G. - G InDel.
T. - T InDel.
|