1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764
|
# Usage
## summary
|Category |Command|Function |Input |In.sorted |In.flag-consistency|Output |Out.sorted |Out.unique |
|:-------------------|:------|:----------------------------------------------------------------------------|:-----------|:----------------|:------------------|:--------|:-----------|:-----------|
|Counting |count |Generate k-mers (sketch) from FASTA/Q sequences |fastx |/ |/ |.unik |optional |optional |
|Information |info |Information of binary files |.unik |optional |no need |tsv |/ |/ |
| |num |Quickly inspect the number of k-mers in binary files |.unik |optional |no need |tsv |/ |/ |
|Format conversion |view |Read and output binary format to plain text |.unik |optional |required |tsv |/ |/ |
| |dump |Convert plain k-mer text to binary format |tsv |optional |/ |.unik |optional |follow input|
| |encode |Encode plain k-mer texts to integers |tsv |/ |/ |tsv |/ |/ |
| |decode |Decode encoded integers to k-mer texts |tsv |/ |/ |tsv |/ |/ |
|Set operations |concat |Concatenate multiple binary files without removing duplicates |.unik |optional |required |.unik |optional |no |
| |inter |Intersection of k-mers in multiple binary files |.unik |required |required |.unik |yes |yes |
| |common |Find k-mers shared by most of the binary files |.unik |required |required |.unik |yes |yes |
| |union |Union of k-mers in multiple binary files |.unik |optional |required |.unik |optional |yes |
| |diff |Set difference of k-mers in multiple binary files |.unik |1th file required|required |.unik |optional |yes |
|Split and merge |sort |Sort k-mers to reduce the file size and accelerate downstream analysis |.unik |optional |required |.unik |yes |optional |
| |split |Split k-mers into sorted chunk files |.unik |optional |required |.unik |yes |optional |
| |tsplit |Split k-mers according to TaxId |.unik |required |required |.unik |yes |yes |
| |merge |Merge k-mers from sorted chunk files |.unik |required |required |.unik |yes |optional |
|Subset |head |Extract the first N k-mers |.unik |optional |required |.unik |follow input|follow input|
| |sample |Sample k-mers from binary files |.unik |optional |required |.unik |follow input|follow input|
| |grep |Search k-mers from binary files |.unik |optional |required |.unik |follow input|optional |
| |filter |Filter out low-complexity k-mers |.unik |optional |required |.unik |follow input|follow input|
| |rfilter|Filter k-mers by taxonomic rank |.unik |optional |required |.unik |follow input|follow input|
|Searching on genomes|locate |Locate k-mers in genome |.unik, fasta|optional |required |tsv |/ |/ |
| |map |Mapping k-mers back to the genome and extract successive regions/subsequences|.unik, fasta|optional |required |bed/fasta|/ |/ |
## unikmer
```text
unikmer - a versatile toolkit for k-mers with taxonomic information
unikmer is a toolkit for nucleic acid k-mer analysis, providing functions
including set operation on k-mers optional with TaxIds but without count
information.
K-mers are either encoded (k<=32) or hashed (k<=64) into 'uint64',
and serialized in binary file with the extension '.unik'.
TaxIds can be assigned when counting k-mers from genome sequences,
and LCA (Lowest Common Ancestor) is computed during set opertions
including computing union, intersection, set difference, unique and
repeated k-mers.
Version: v0.20.0
Author: Wei Shen <shenwei356@gmail.com>
Documents : https://bioinf.shenwei.me/unikmer
Source code: https://github.com/shenwei356/unikmer
Dataset (optional):
Manipulating k-mers with TaxIds needs taxonomy file from e.g.,
NCBI Taxonomy database, please extract "nodes.dmp", "names.dmp",
"delnodes.dmp" and "merged.dmp" from link below into ~/.unikmer/ ,
ftp://ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gz ,
or some other directory, and later you can refer to using flag
--data-dir or environment variable UNIKMER_DB.
For GTDB, use 'taxonkit create-taxdump' to create NCBI-style
taxonomy dump files, or download from:
https://github.com/shenwei356/gtdb-taxonomy
Note that TaxIds are represented using uint32 and stored in 4 or
less bytes, all TaxIds should be in the range of [1, 4294967295].
Usage:
unikmer [command]
Available Commands:
autocompletion Generate shell autocompletion script (bash|zsh|fish|powershell)
common Find k-mers shared by most of the binary files
concat Concatenate multiple binary files without removing duplicates
count Generate k-mers (sketch) from FASTA/Q sequences
decode Decode encoded integer to k-mer text
diff Set difference of k-mers in multiple binary files
dump Convert plain k-mer text to binary format
encode Encode plain k-mer texts to integers
filter Filter out low-complexity k-mers (experimental)
grep Search k-mers from binary files
head Extract the first N k-mers
info Information of binary files
inter Intersection of k-mers in multiple binary files
locate Locate k-mers in genome
map Mapping k-mers back to the genome and extract successive regions/subsequences
merge Merge k-mers from sorted chunk files
num Quickly inspect the number of k-mers in binary files
rfilter Filter k-mers by taxonomic rank
sample Sample k-mers from binary files
sort Sort k-mers to reduce the file size and accelerate downstream analysis
split Split k-mers into sorted chunk files
tsplit Split k-mers according to taxid
union Union of k-mers in multiple binary files
version Print version information and check for update
view Read and output binary format to plain text
Flags:
-c, --compact write compact binary file with little loss of speed
--compression-level int compression level (default -1)
--data-dir string directory containing NCBI Taxonomy files, including nodes.dmp,
names.dmp, merged.dmp and delnodes.dmp (default "/home/shenwei/.unikmer")
-h, --help help for unikmer
-I, --ignore-taxid ignore taxonomy information
-i, --infile-list string file of input files list (one file per line), if given, they are
appended to files from cli arguments
--max-taxid uint32 for smaller TaxIds, we can use less space to store TaxIds. default value
is 1<<32-1, that's enough for NCBI Taxonomy TaxIds (default 4294967295)
-C, --no-compress do not compress binary file (not recommended)
--nocheck-file do not check binary file, when using process substitution or named pipe
-j, --threads int number of CPUs to use (default 4)
--verbose print verbose information
Use "unikmer [command] --help" for more information about a command.
```
## count
```text
Generate k-mers (sketch) from FASTA/Q sequences
K-mer:
1. K-mer code (k<=32)
2. Hashed k-mer (ntHash, k<=64)
K-mer sketches:
1. Scaled MinHash
2. Minimizer
3. Closed Syncmer
Usage:
unikmer count [flags] -K -k <k> -u -s [-t <taxid>] <seq files> -o <out prefix>
Flags:
-K, --canonical only keep the canonical k-mers
--circular circular genome
-H, --hash save hash of k-mer, automatically on for k>32. This flag overides
global flag -c/--compact
-h, --help help for count
-k, --kmer-len int k-mer length
-l, --linear output k-mers in linear order, duplicate k-mers are not removed
-W, --minimizer-w int minimizer window size
-V, --more-verbose print extra verbose information
-o, --out-prefix string out file prefix ("-" for stdout) (default "-")
-T, --parse-taxid parse taxid from FASTA/Q header
-r, --parse-taxid-regexp string regular expression for passing taxid
-d, --repeated only count duplicate k-mers, for removing singleton in FASTQ
-D, --scale int scale/down-sample factor (default 1)
-B, --seq-name-filter strings list of regular expressions for filtering out sequences by
header/name, case ignored.
-s, --sort sort k-mers, this significantly reduce file size for k<=25. This
flag overides global flag -c/--compact
-S, --syncmer-s int closed syncmer length
-t, --taxid uint32 global taxid
-u, --unique only count unique k-mers, which are not duplicate
```
## info
```text
Information of binary files
Tips:
1. For lots of small files (especially on SDD), use big value of '-j' to
parallelize counting.
Usage:
unikmer info [flags]
Aliases:
info, stats
Flags:
-a, --all all information, including number of k-mers
-b, --basename only output basename of files
-h, --help help for info
-o, --out-file string out file ("-" for stdout, suffix .gz for gzipped out) (default "-")
-e, --skip-err skip error, only show warning message
--symbol-false string smybol for false (default "✕")
--symbol-true string smybol for true (default "✓")
-T, --tabular output in machine-friendly tabular format
```
## num
```text
Quickly inspect the number of k-mers in binary files
Attention:
1. This command is designed to quickly inspect the number of k-mers in binary file,
2. For non-sorted file, it returns '-1' unless switching on flag '-f/--force'.
Usage:
unikmer num [flags]
Flags:
-b, --basename only output basename of files
-n, --file-name show file name
-f, --force read the whole file and count k-mers
-h, --help help for num
-o, --out-file string out file ("-" for stdout, suffix .gz for gzipped out) (default "-")
```
## view
```text
Read and output binary format to plain text
Attentions:
1. The 'canonical/scaled/hashed' flags of all files should be consistent.
2. Input files should ALL have or don't have taxid information.
Usage:
unikmer view [flags]
Flags:
-a, --fasta output in FASTA format, with encoded integer as FASTA header
-q, --fastq output in FASTQ format, with encoded integer as FASTQ header
-g, --genome strings genomes in (gzipped) fasta file(s) for decoding hashed k-mers
-h, --help help for view
-o, --out-file string out file ("-" for stdout, suffix .gz for gzipped out) (default "-")
-n, --show-code show encoded integer along with k-mer
-N, --show-code-only only show encoded integers, faster than cutting from result of -n/--show-cde
-t, --show-taxid show taxid
-T, --show-taxid-only show taxid only
```
## dump
```text
Convert plain k-mer text to binary format
Attentions:
1. Input should be one k-mer per line, or tab-delimited two columns
with a k-mer and it's taxid.
2. You can also assign a global taxid with flag -t/--taxid.
Usage:
unikmer dump [flags]
Flags:
-K, --canonical save the canonical k-mers
-O, --canonical-only only save the canonical k-mers. This flag overides -K/--canonical
-H, --hash save hash of k-mer, automatically on for k>32. This flag overides global
flag -c/--compact
--hashed giving hash values of k-mers, This flag overides global flag -c/--compact
-h, --help help for dump
-k, --kmer-len int k-mer length
-o, --out-prefix string out file prefix ("-" for stdout) (default "-")
-s, --sorted input k-mers are sorted
-t, --taxid uint32 global taxid
-u, --unique remove duplicate k-mers
```
## encode
```text
Encode plain k-mer texts to integers
Usage:
unikmer encode [flags]
Flags:
-a, --all output all data: orginial k-mer, parsed k-mer, encoded integer, encode bits
-K, --canonical keep the canonical k-mers
-H, --hash save hash of k-mer, automatically on for k>32
-h, --help help for encode
-o, --out-file string out file ("-" for stdout, suffix .gz for gzipped out) (default "-")
```
## decode
```text
Decode encoded integers to k-mer texts
Usage:
unikmer decode [flags]
Flags:
-a, --all output all data: encoded integer, decoded k-mer
-h, --help help for decode
-k, --kmer-len int k-mer length
-o, --out-file string out file ("-" for stdout, suffix .gz for gzipped out) (default "-")
```
## concat
```text
Concatenate multiple binary files without removing duplicates
Attentions:
1. The 'canonical/scaled/hashed' flags of all files should be consistent.
2. Input files should ALL have or don't have taxid information.
Usage:
unikmer concat [flags]
Flags:
-h, --help help for concat
-n, --number int number of k-mers (default -1)
-o, --out-prefix string out file prefix ("-" for stdout) (default "-")
-s, --sorted input k-mers are sorted
-t, --taxid uint32 global taxid
```
## inter
```text
Intersection of k-mers in multiple binary files
Attentions:
0. All input files should be sorted, and output file is sorted.
1. The 'canonical/scaled/hashed' flags of all files should be consistent.
2. Taxid information could be inconsistent when using flag --mix-taxid.
Tips:
1. For comparing TWO files with really huge number of k-mers,
you can use 'unikmer sort -u -m 100M' for each file,
and then 'unikmer merge -' from them.
2. Put the smallest file in the beginning to reduce memory usage.
Usage:
unikmer inter [flags]
Flags:
-h, --help help for inter
-m, --mix-taxid allow part of files being whithout taxids
-o, --out-prefix string out file prefix ("-" for stdout) (default "-")
```
## common
```text
Find k-mers shared by most of the binary files
This command is similar to "unikmer inter" but with looser restriction,
k-mers shared by some number/proportion of multiple files are outputted.
Attentions:
0. All input files should be sorted, and output file is sorted.
1. The 'canonical/scaled/hashed' flags of all files should be consistent.
2. Taxid information could be inconsistent when using flag --mix-taxid.
3. At most 65535 input files allowed.
Tips:
1. For comparing TWO files with really huge number of k-mers,
you can use 'unikmer sort -u -m 100M' for each file,
and then 'unikmer merge -' from them.
2. Put the smallest file in the beginning to reduce memory usage.
Usage:
unikmer common [flags]
Flags:
-h, --help help for common
-m, --mix-taxid allow part of files being whithout taxids
-n, --number int minimum number of files that share a k-mer (overides -p/--proportion)
-o, --out-prefix string out file prefix ("-" for stdout) (default "-")
-p, --proportion float minimum proportion of files that share a k-mer (default 1)
```
## union
```text
Union of k-mers in multiple binary files
Attentions:
1. The 'canonical/scaled/hashed' flags of all files should be consistent.
2. Input files should ALL have or don't have taxid information.
Tips:
1. 'unikmer sort -u' is slightly faster in cost of more memory usage.
2. For really huge number of k-mers, you can use 'unikmer sort -m 100M -u'.
3. For large number of sorted .unik files, you can use 'unikmer merge'.
Usage:
unikmer union [flags]
Flags:
-h, --help help for union
-o, --out-prefix string out file prefix ("-" for stdout) (default "-")
-s, --sort sort k-mers, this significantly reduce file size for k<=25. This flag
overides global flag -c/--compact
```
## diff
```text
Set difference of k-mers in multiple binary files
Attentions:
0. The first file should be sorted.
1. The 'canonical/scaled/hashed' flags of all files should be consistent.
2. By default taxids in the 2nd and later files are ignored.
3. You can switch on flag -t/--compare-taxid, and input
files should ALL have or don't have taxid information.
A same k-mer found but query taxid equals to target taxid,
or query taxid is ancester of target taxid, this k-mer remains.
Tips:
1. Increasing threads number (-j/--threads) to accelerate computation
when dealing with lots of files, in cost of more memory occupation.
Usage:
unikmer diff [flags]
Flags:
-t, --compare-taxid take taxid into consideration. type unikmer "diff -h" for detail
-h, --help help for diff
-o, --out-prefix string out file prefix ("-" for stdout) (default "-")
-s, --sort sort k-mers, this significantly reduce file size for k<=25. This flag
overides global flag -c/--compact
```
## sort
```text
Sort k-mers in binary files to reduce file size
Attentions:
1. The 'canonical/scaled/hashed' flags of all files should be consistent.
2. Input files should ALL have or don't have taxid information.
Notes:
1. When sorting from large number of files, this command is equivalent to
'unikmer split' + 'unikmer merge'.
Tips:
1. You can use '-m/--chunk-size' to limit memory usage, and chunk file size
depends on k-mers and file save mode (sorted/compact/normal).
2. Increasing value of -j/--threads can accelerates splitting stage,
in cost of more memory occupation.
3. For sorted input files, the memory usage is very low and speed is fast.
Usage:
unikmer sort [flags]
Flags:
-m, --chunk-size string split input into chunks of N k-mers, supports K/M/G suffix, type "unikmer
sort -h" for detail
--force overwrite tmp dir
-h, --help help for sort
-k, --keep-tmp-dir keep tmp dir
-M, --max-open-files int max number of open files (default 400)
-o, --out-prefix string out file prefix ("-" for stdout) (default "-")
-d, --repeated only print duplicate k-mers
-t, --tmp-dir string directory for intermediate files (default "./")
-u, --unique remove duplicate k-mers
```
## split
```text
Split k-mers into sorted chunk files
Attentions:
1. The 'canonical/scaled/hashed' flags of all files should be consistent.
2. Input files should ALL have or don't have taxid information.
Tips:
1. You can use '-m/--chunk-size' to limit memory usage, and chunk file size
depends on k-mers and file save mode (sorted/compact/normal).
2. Increasing value of -j/--threads can accelerates splitting stage,
in cost of more memory occupation.
3. For sorted input files, the memory usage is very low and speed is fast.
Usage:
unikmer split [flags]
Flags:
-m, --chunk-size string split input into chunks of N k-mers, supports K/M/G suffix, type "unikmer
sort -h" for detail
--force overwrite output directory
-h, --help help for split
-O, --out-dir string output directory
-d, --repeated split for further printing duplicate k-mers
-u, --unique split for further removing duplicate k-mers
```
## tsplit
```text
Split k-mers according to taxid
Attentions:
1. The 'canonical/scaled/hashed' flags of all files should be consistent.
2. Input files should ALL have taxid information.
3. Input files should be sorted using 'unikmer sort'.
4. All k-mers will loaded into RAM, for big input files,
you can 'split' them first, 'tsplit' and then 'concat'
for every taxid.
Tips:
1. Increasing value of -j/--threads can accelerates splitting stage,
in cost of more memory occupation.
Usage:
unikmer tsplit [flags]
Flags:
--force overwrite output directory
-h, --help help for tsplit
-O, --out-dir string output directory
-o, --out-prefix string out file prefix (default "tsplit")
```
## merge
```text
Merge k-mers from sorted chunk files
Attentions:
1. The 'canonical/scaled/hashed' flags of all files should be consistent.
2. Input files should ALL have or don't have taxid information.
3. Input files should be sorted.
Tips:
1. If you don't need to compute unique or repeated k-mers,
use 'unikmer concat -s', which is faster.
Usage:
unikmer merge [flags]
Flags:
--force overwrite tmp dir
-h, --help help for merge
-D, --is-dir intput files are directory containing chunk files
-k, --keep-tmp-dir keep tmp dir
-M, --max-open-files int max number of open files (default 400)
-o, --out-prefix string out file prefix ("-" for stdout) (default "-")
-p, --pattern string chunk file pattern (regular expression) (default "^chunk_\\d+\\.unik$")
-d, --repeated only print duplicate k-mers
-t, --tmp-dir string directory for intermediate files (default "./")
-u, --unique remove duplicate k-mers
```
## head
```text
Extract the first N k-mers
Attentions:
1. The 'canonical/scaled/hashed' flags of all files should be consistent.
2. Input files should ALL have or don't have taxid information.
Usage:
unikmer head [flags]
Flags:
-h, --help help for head
-n, --number int number of k-mers to extract (default 10)
-o, --out-prefix string out file prefix ("-" for stdout) (default "-")
```
## sample
```text
Sample k-mers from binary files.
The Sampling type is fixed sampling.
Attentions:
1. The 'canonical/scaled/hashed' flags of all files should be consistent.
2. Input files should ALL have or don't have taxid information.
Usage:
unikmer sample [flags]
Flags:
-h, --help help for sample
-o, --out-prefix string out file prefix ("-" for stdout) (default "-")
-s, --start int start location (default 1)
-w, --window int window size (default 1)
```
## grep
```text
Search k-mers from binary files
Attentions:
1. The 'canonical/scaled/hashed' flags of all files should be consistent.
2. Canonical k-mers are used and outputted.
3. Input files should ALL have or don't have taxid information.
Tips:
1. Increase value of '-j' for better performance when dealing with
lots of files, especially on SDD.
2. For searching using binary .unik file, use 'unikmer inter --mix-taxid',
which is faster than 'unikmer grep' in single-thread mode.
Usage:
unikmer grep [flags]
Flags:
-D, --degenerate query k-mers contains degenerate base
--force overwrite output directory
-h, --help help for grep
-v, --invert-match invert the sense of matching, to select non-matching records
-m, --multiple-outfiles write results into separated files for multiple input files
-O, --out-dir string output directory (default "unikmer-grep")
-o, --out-prefix string out file prefix ("-" for stdout) (default "-")
-S, --out-suffix string output suffix (default ".grep")
-q, --query strings query k-mers/taxids (multiple values delimted by comma supported)
-f, --query-file strings query file (one k-mer/taxid per line)
-t, --query-is-taxid queries are taxids
-F, --query-unik-file strings query file in .unik format
-d, --repeated only print duplicate k-mers
-s, --sort sort k-mers, this significantly reduce file size for k<=25. This flag
overides global flag -c/--compact
-u, --unique remove duplicate k-mers
```
## filter
```text
Filter out low-complexity k-mers (experimental)
Attentions:
1. This command only detects single base repeat now.
Usage:
unikmer filter [flags]
Flags:
-h, --help help for filter
-v, --invert invert result, i.e., output low-complexity k-mers
-o, --out-prefix string out file prefix ("-" for stdout) (default "-")
-d, --penalty-d int penalty for different bases (default 1)
-s, --penalty-s int penalty for successive bases (default 3)
-t, --threshold int penalty threshold for filter, higher is stricter (default 15)
-w, --window int window size for checking penalty (default 7)
```
## rfilter
```text
Filter k-mers by taxonomic rank
Attentions:
1. Flag -L/--lower-than and -H/--higher-than are exclusive, and can be
used along with -E/--equal-to which values can be different.
2. A list of pre-ordered ranks is in ~/.unikmer/ranks.txt, you can use
your list by -r/--rank-file, the format specification is below.
3. All ranks in taxonomy database should be defined in rank file.
4. Ranks can be removed with black list via -B/--black-list.
5. TaxIds with no rank can be optionally discarded by -N/--discard-noranks.
6. But when filtering with -L/--lower-than, you can use
-n/--save-predictable-norank to save some special ranks without order,
where rank of the closest higher node is still lower than rank cutoff.
Rank file:
1. Blank lines or lines starting with "#" are ignored.
2. Ranks are in decending order and case ignored.
3. Ranks with same order should be in one line separated with comma (",", no space).
4. Ranks without order should be assigned a prefix symbol "!" for each rank.
Usage:
unikmer rfilter [flags]
Flags:
-B, --black-list strings black list of ranks to discard, e.g., '"no rank", "clade"'
-N, --discard-noranks discard ranks without order, type "unikmer filter --help" for details
-R, --discard-root discard root taxid, defined by --root-taxid
-E, --equal-to strings output taxIDs with rank equal to some ranks, multiple values can be
separated with comma "," (e.g., -E "genus,species"), or give multiple
times (e.g., -E genus -E species)
-h, --help help for rfilter
-H, --higher-than string output ranks higher than a rank, exclusive with --lower-than
--list-order list defined ranks in order
--list-ranks list ordered ranks in taxonomy database
-L, --lower-than string output ranks lower than a rank, exclusive with --higher-than
-o, --out-prefix string out file prefix ("-" for stdout) (default "-")
-r, --rank-file string user-defined ordered taxonomic ranks, type "unikmer rfilter --help"
for details
--root-taxid uint32 root taxid (default 1)
-n, --save-predictable-norank do not discard some special ranks without order when using -L, where
rank of the closest higher node is still lower than rank cutoff
```
## locate
```text
Locate k-mers in genome
Attention:
0. All files should have the 'canonical' flag.
1. The 'canonical/scaled/hashed' flags of all files should be consistent.
2. Output is BED6 format.
3. When using experimental flag --circular, leading subsequence of k-1 bp
is appending to end of sequence. End position of k-mers that crossing
sequence end would be greater than sequence length.
Usage:
unikmer locate [flags]
Flags:
--circular circular genome. type "unikmer locate -h" for details
-g, --genome strings genomes in (gzipped) fasta file(s)
-h, --help help for locate
-o, --out-prefix string out file prefix ("-" for stdout) (default "-")
-B, --seq-name-filter strings list of regular expressions for filtering out sequences by
header/name, case ignored.
```
## map
```text
Mapping k-mers back to the genome and extract successive regions/subsequences
Attention:
0. By default, only unique-mapped k-mers are considered.
You can use -M/--allow-multiple-mapped-kmerss to allow mutiple-mapped k-mers.
1. The 'canonical/scaled/hashed' flags of all files should be consistent.
2. Default output is in BED3 format, with left-closed and right-open
0-based interval.
3. When using flag --circular, end position of subsequences that
crossing genome sequence end would be greater than sequence length.
Usage:
unikmer map [flags]
Aliases:
map, uniqs
Flags:
-M, --allow-multiple-mapped-kmers allow multiple mapped k-mers
--circular circular genome. type "unikmer uniqs -h" for details
-g, --genome strings genomes in (gzipped) fasta file(s)
-h, --help help for map
-X, --max-gap-num int max number of gaps (consecutive unmapped k-mers)
-x, --max-gap-size int max gap size (the number of consecutive unmapped k-mers)
-m, --min-len int minimum length of subsequence (default 200)
-o, --out-prefix string out file prefix ("-" for stdout) (default "-")
-a, --output-fasta output fasta format instead of BED3
-B, --seq-name-filter strings list of regular expressions for filtering out sequences by
header/name, case ignored
-W, --seqs-in-a-file-as-one-genome treat seqs in a genome file as one genome
```
|