1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230
|
--------------------------------------------------------------------------------
dnadiff is a wrapper for nucmer and analysis utilities that provides
detailed information on the differences between two genomes, and also
provides a high level report file that quantifies the differences
between the two inputs.
Use Cases:
+ diff'ing two strains of the same species
+ diff'ing two assemblies of the same organism
+ diff'ing a draft assembly and a closely related finished genome
If any of this code is used in any publication, please cite the following:
Versatile and open software for comparing large genomes.
S. Kurtz, A. Phillippy, A.L. Delcher,
M. Smoot, M. Shumway, C. Antonescu, and S.L. Salzberg.
Genome Biology (2004), 5:R12.
--------------------------------------------------------------------------------
This manual is also available as HTML documentation included in this
distribution, or at:
http://mummer.sourceforge.net
http://mummer.sourceforge.net/manual
http://mummer.sourceforge.net/examples
-- DESCRIPTION --
dnadiff is a wrapper around nucmer that builds an alignment using
default parameters, and runs many of nucmer's helper scripts to
process the output and report alignment statistics, SNPs, breakpoints,
etc. It is designed for evaluating the sequence and structural
similarity of two highly similar sequence sets. E.g. comparing two
different assemblies of the same organism, or comparing two strains of
the same species.
-- dnadiff EXAMPLE --
To compare two strains of the same species, type:
"dnadiff genome1.fna genome2.fna"
Output will be...
out.report - Summary of alignments, differences and SNPs
out.delta - Standard nucmer alignment output
out.1delta - 1-to-1 alignment from delta-filter -1
out.mdelta - M-to-M alignment from delta-filter -m
out.1coords - 1-to-1 coordinates from show-coords -THrcl .1delta
out.mcoords - M-to-M coordinates from show-coords -THrcl .mdelta
out.snps - SNPs from show-snps -rlTHC .1delta
out.rdiff - Classified ref breakpoints from show-diff -rH .mdelta
out.qdiff - Classified qry breakpoints from show-diff -qH .mdelta
out.unref - Unaligned reference IDs and lengths (if applicable)
out.unqry - Unaligned query IDs and lengths (if applicable)
For more information on the formats and meanings of all the files
produced, please see the documentation for the corresponding
utility. This document serves to describe running the dnadiff script
and interpreting the produced .report file.
-- RUNNING 'dnadiff' --
USAGE: dnadiff [options] <Reference> <Query>
or dnadiff [options] -d <Delta File>
DESCRIPTION:
Run comparative analysis of two sequence sets using nucmer and its
associated utilities with recommended parameters. See MUMmer
documentation for a more detailed description of the
output. Produces the following output files:
.delta - Standard nucmer alignment output
.1delta - 1-to-1 alignment from delta-filter -1
.mdelta - M-to-M alignment from delta-filter -m
.1coords - 1-to-1 coordinates from show-coords -THrcl .1delta
.mcoords - M-to-M coordinates from show-coords -THrcl .mdelta
.snps - SNPs from show-snps -rlTHC .1delta
.rdiff - Classified alignment breakpoints from show-diff -rH .mdelta
.qdiff - Classified alignment breakpoints from show-diff -qH .mdelta
.report - Summary of alignments, differences and SNPs
MANDATORY:
Reference Set the input reference multi-FASTA filename
Query Set the input query multi-FASTA filename
or
Delta File Unfiltered .delta alignment file from nucmer
OPTIONS:
-d|delta Provide precomputed delta file for analysis
-h
--help Display help information and exit
-p|prefix Set the prefix of the output files (default "out")
-V
--version Display the version information and exit
-- NOTES --
The -p option is recommended to avoid overwriting previous
output. A simple naming convention is for files A.fna and B.fna, to
set "-p A_B". It is safest to let dnadiff run nucmer automatically, so
avoid using the -d option unless the delta file was already generated
with "nucmer --maxmatch" and has not been filtered.
-- OUTPUT FILES --
dnadiff produces many outputs, however all but one are produced by
other utilities in the MUMmer package. Please see their corresponding
documentation for more information. This section will only describe
the .report file generated by dnadiff and tips on interpreting it.
*** .report OUTPUT ***
Report statistics are broken into two columns - reference and
query. Rows are grouped by themed alignment metrics and are described
here. Summary counts are estimates and do not represent the exact
number of occurrences of a particular evolutionary event. When reading
a reference column, think number of XYZ in reference with regard to
the query. When reading a query column, think number of XYZ in query
with regard to the reference.
[Sequences] - Sequence-centric stats.
TotalSeqs - Total number of input sequences.
AlignedSeqs - Number of input sequences with at least one alignment.
UnalignedSeqs - Number of input sequences with no alignment.
[Bases] - Base-pair-centric stats.
TotalBases - Total number of bases in the input sequences.
AlignedBases - Total number of bases contained within an alignment.
UnalignedBases - Total number of unaligned bases. This is a rough
measure for the amount of "unique" sequence in the
reference and query.
[Alignments] - Alignment-centric stats.
1-to-1 - Number of alignment blocks comprising the 1-to-1
mapping of reference to query. This is a subset of
the M-to-M mapping, with repeats removed.
TotalLength - Total length of 1-to-1 alignment blocks.
AvgLength - Average length of 1-to-1 alignment blocks.
AvgIdentity - Average identity of 1-to-1 alignment blocks.
M-to-M - Number of alignment blocks comprising the
many-to-many mapping of reference to query. The
M-to-M mapping represents the smallest set of
alignments that maximize the coverage of both
reference and query. This is a superset of the 1-to-1
mapping.
TotalLength - Total length of M-to-M alignment blocks.
AvgLength - Average length of M-to-M alignment blocks.
AvgIdentity - Average identity of M-to-M alignment blocks.
[Features] - Structural alignment features, such as
rearrangements. These counts are rough estimates
based on an automated analysis of the
alignments. Features are identified by scanning the
reference (or query) from low to high, and noting the
positions where the query alignments are
inconsistently ordered or oriented with respect to
the reference.
Breakpoints - Number of non-maximal alignment endpoints,
i.e. endpoints that do not occur at the beginning or
end of a sequence.
Relocations - Number of breaks in the alignment where adjacent
1-to-1 alignment blocks are in the same sequence, but
not consistently ordered. A separate feature is
recorded for each end of a relocation, so this is
really a count of relocation endpoints.
Translocations - Number of breaks in the alignment where adjacent
1-to-1 alignment blocks are in different sequences. A
separate feature is recorded for each end of a
translocation, so this is really a count of
translocation endpoints.
Inversions - Number of breaks in the alignment where adjacent
1-to-1 alignment blocks are inverted with respect to
one another. A separate feature is recorded for each
end of an inversion, so this is really a count of
inversion endpoints.
Insertions - Rough count of insertion events. Note that this is
slightly different from "UnalignedBases" because it
counts duplications as insertions, whereas
UnalignedBases does not. Also, this count does not
included sequences that have no alignments as
insertions, whereas UnalignedBases does. Note than
insertions in R can be viewed as deletions from Q.
InsertionSum - Rough sum of inserted sequence.
InsertionAvg - Average length of insertion.
TandemIns - Rough count of tandem duplication insertion
events. Note that expansions in R can be viewed as
collapses in Q.
TandemInsSum - Rough sum of tandem duplication insertions.
TandemInsAvg - Average length of tandem duplications.
[SNPs] - Single Nucleotide Polymorphism counts.
TotalSNPs - Total number of SNPs, same for both sequences.
AC - A-to-C SNP. For reference, this means reference 'A'
to query 'C'. For query, this means query 'A' to
reference 'C'. The same convention applies below.
AG - A-to-G SNP.
AT - A-to-T SNP.
CA - C-to-A SNP.
CG - C-to-G SNP.
CT - C-to-T SNP.
TotalGSNPs - Single Nucleotide Polymorphisms bounded by 20 exact,
base-pair matches on both sides.
AC - A-to-C SNP.
AG - A-to-G SNP.
AT - A-to-T SNP.
CA - C-to-A SNP.
CG - C-to-G SNP.
CT - C-to-T SNP.
TotalIndels - Single Nucleotide Insertions/Deleltions.
A. - A InDel. For reference, 'A' insertion in
reference. For query, 'A' insertion in query. The
same convention applies below.
C. - C InDel.
G. - G InDel.
T. - T InDel.
TotalGIndels - Single Nucleotide Insertions/Deleltions bounded by 20
exact, base-pair matches on both sides.
A. - A InDel.
C. - C InDel.
G. - G InDel.
T. - T InDel.
|