File: alignment-thin.md

package info (click to toggle)
bali-phy 4.0-1
  • links: PTS, VCS
  • area: main
  • in suites: forky, sid, trixie
  • size: 15,392 kB
  • sloc: cpp: 120,442; xml: 13,966; haskell: 9,975; python: 2,936; yacc: 1,328; perl: 1,169; lex: 912; sh: 343; makefile: 26
file content (132 lines) | stat: -rw-r--r-- 3,135 bytes parent folder | download
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
% alignment-thin(1)
% Benjamin Redelings
% Feb 2018

# NAME

**alignment-thin** - Remove sequences or columns from an alignment.

# SYNOPSIS

**alignment-thin** _alignment-file_ [OPTIONS]

# DESCRIPTION

Remove sequences or columns from an alignment.

# GENERAL OPTIONS:
**-h**, **--help**
: Print usage information.

**-V**, **--verbose**
: Output more log messages on stderr.


# SEQUENCE FILTERING OPTIONS:
**-p** _arg_, **--protect** _arg_
: Sequences that cannot be removed (comma-separated).

**-k** _arg_, **--keep** _arg_
: Remove sequences not in comma-separated list _arg_.

**-r** _arg_, **--remove** _arg_
: Remove sequences in comma-separated list _arg_.

**-l** _arg_, **--longer-than** _arg_
: Remove sequences not longer than _arg_.

**-s** _arg_, **--shorter-than** _arg_
: Remove sequences not shorter than _arg_.

**-c** _arg_, **--cutoff** _arg_
: Remove similar sequences with #mismatches < cutoff.

**-d** _arg_, **--down-to** _arg_
: Remove similar sequences down to _arg_ sequences.

**--remove-gappy** _arg_
: Remove _arg_ outlier sequences -- defined as sequences that are missing too many conserved sites.

**--conserved** _arg_ (=0.75)
: Fraction of sequences that must contain a letter for it to be considered conserved.


# COLUMN FILTERING OPTIONS:
**-K** _arg_, **--keep-columns** _arg_
: Keep columns from this sequence

**-m** _arg_, **--min-letters** _arg_
: Remove columns with fewer than _arg_ letters.

**-u** _arg_, **--remove-unique** _arg_
: Remove insertions in a single sequence if longer than _arg_ letters

**-e**, **--erase-empty-columns**
: Remove columns with no characters (all gaps).


# OUTPUT OPTIONS:
**-S**, **--sort**
: Sort partially ordered columns to group similar gaps.

**-L**, **--show-lengths**
: Just print out sequence lengths.

**-N**, **--show-names**
: Just print out sequence lengths.

**-F** _arg_, **--find-dups** _arg_
: For each sequence, find the closest other sequence.


# EXAMPLES:
 
Remove columns without a minimum number of letters:
```
% alignment-thin --min-letters=5 file.fasta > file-thinned.fasta
```

Remove sequences by name:
```
% alignment-thin --remove=seq1,seq2 file.fasta > file2.fasta
```

```
% alignment-thin --keep=seq1,seq2   file.fasta > file2.fasta
```

Remove short sequences:
```
% alignment-thin --longer-than=250 file.fasta > file-long.fasta
```

Remove similar sequences with <= 5 differences from the closest other sequence:
```
% alignment-thin --cutoff=5 file.fasta > more-than-5-differences.fasta
```

Remove similar sequences until we have the right number of sequences:
```
% alignment-thin --down-to=30 file.fasta > file-30taxa.fasta
```

Remove dissimilar sequences that are missing conserved columns:
```
% alignment-thin --remove-gappy=10 file.fasta > file2.fasta
```

Protect some sequences from being removed:
```
% alignment-thin --down-to=30 file.fasta --protect=seq1,seq2 > file2.fasta
```

```
% alignment-thin --down-to=30 file.fasta --protect=@filename > file2.fasta
```


# REPORTING BUGS:
 BAli-Phy online help: <http://www.bali-phy.org/docs.php>.

Please send bug reports to <bali-phy-users@googlegroups.com>.