File: msa_split.1

package info (click to toggle)
phast 1.7%2Bdfsg-2
  • links: PTS, VCS
  • area: main
  • in suites: trixie
  • size: 13,116 kB
  • sloc: ansic: 54,210; makefile: 364; sh: 348; perl: 321
file content (228 lines) | stat: -rw-r--r-- 8,660 bytes parent folder | download | duplicates (5)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
.TH MSA_SPLIT "1" "May 2016" "msa_split 1.4" "User Commands"
.SH NAME
msa_split \- Partitions a multiple sequence alignment either at designated
.SH DESCRIPTION
Partitions a multiple sequence alignment either at designated
columns, or according to specified category labels, and outputs
sub\-alignments for the partitions.  Optionally splits an
associated annotations file.
.SH EXAMPLE
.PP
(See below for details on options)
.PP
1. Read an alignment for a whole human chromosome from a MAF file
and extract sub\-alignments in 1Mb windows overlapping by 1kb.  Use
sufficient statistics (SS) format for output (can be used by
phyloFit, phastCons, or exoniphy).  Set window boundaries between
alignment blocks, if possible.
.IP
msa_split chr1.maf \fB\-\-refseq\fR chr1.fa \fB\-\-in\-format\fR MAF
\fB\-\-windows\fR 1000000,1000 \fB\-\-out\-format\fR SS
\fB\-\-between\-blocks\fR 5000 \fB\-\-out\-root\fR chr1
.PP
(Windows will be defined using the coordinate system of the first
sequence in the alignment, assumed to be the reference sequence;
output will be to chr1.1\-1000000.ss, chr1.999001\-1999000.ss, ...)
.PP
2. As in (1), but report unordered sufficient statistics (much
more compact and adequate for use with phyloFit).
.IP
msa_split chr1.maf \fB\-\-refseq\fR chr1.fa \fB\-\-in\-format\fR MAF
\fB\-\-windows\fR 1000000,1000 \fB\-\-out\-format\fR SS
\fB\-\-between\-blocks\fR 5000 \fB\-\-out\-root\fR chr1 \fB\-\-unordered\-ss\fR
.PP
3. Extract sub\-alignments of sites in conserved elements and not
in conserved elements, as defined by a BED file (coordinates
assumed to be for 1st sequence).  Read multiple alignment in FASTA
format.
.IP
msa_split mydata.fa \fB\-\-features\fR conserved.bed \fB\-\-by\-category\fR
\fB\-\-out\-root\fR mydata
.PP
(Output will be to mydata.background\-0.fa and mydata.bed_feature\-1.fa
[latter has sites of category number 1, defined by bed file]
3. Extract sub\-alignments of sites in each of the three codon
positions, as defined by a GFF file (coordinates assumed to be for
1st sequence).  Reverse complement genes on minus strand.
.IP
msa_split chr22.maf \fB\-\-in\-format\fR MAF \fB\-\-features\fR chr22.gff
\fB\-\-by\-category\fR \fB\-\-catmap\fR "NCATS 3 ; CDS 1\-3" \fB\-\-do\-cats\fR CDS
\fB\-\-reverse\-compl\fR \fB\-\-out\-root\fR chr22 \fB\-\-out\-format\fR SS
.PP
(Output will be to chr22.cds\-1.ss, chr22.cds\-2.ss, chr22.cds\-3.ss)
.PP
4. Split an alignment into pieces corresponding to the genes in a
GFF file.  Assume genes are defined by the tag "transcript_id".
.IP
msa_split cftr.fa \fB\-\-features\fR cftr.gff \fB\-\-by\-group\fR transcript_id
.PP
5. Obtain a sub\-alignment for each of a set of regulatory regions,
as defined in a BED file.
.IP
msa_split chr22.maf \fB\-\-in\-format\fR MAF \fB\-\-refseq\fR chr22.fa
\fB\-\-features\fR chr22.reg.bed \fB\-\-for\-features\fR
\fB\-\-out\-root\fR chr22.reg
.SH OPTIONS
.SS Splitting options
.HP
\fB\-\-windows\fR, \fB\-w\fR <win_size,win_overlap>
.IP
Split the alignment into "windows" of size <win_size> bases,
overlapping by <win_overlap>.
.HP
\fB\-\-by\-category\fR, \fB\-L\fR
.IP
(Requires \fB\-\-features\fR) Split by category, as defined by
annotations file and (optionally) category map (see
\fB\-\-catmap\fR)
.HP
\fB\-\-by\-group\fR, \fB\-P\fR <tag>
.IP
(Requires \fB\-\-features\fR) Split by groups in annotation file,
as defined by specified tag.  Splits midway between every
pair of consecutive groups.  Features will be sorted by group.
There should be no overlapping features (see 'refeature
\fB\-\-unique\fR').
.HP
\fB\-\-for\-features\fR, \fB\-F\fR
(Requires \fB\-\-features\fR) Extract section of alignment
corresponding to every feature.  There will be no output for
regions not covered by features.
.HP
\fB\-\-by\-index\fR, \fB\-p\fR <indices>
List of explicit indices at which to split alignment
(comma\-separated).  If the list of indices is "10,20",
then sub\-alignments will be output for sites 1\-9, 10\-19, and
20\-<msa_len>.  Note that the indices are relative to the
input alignment, and not necessarily in genomic coordinates.
.HP
\fB\-\-npartitions\fR, \fB\-n\fR <number>
.IP
Split alignment equally into specified number of partitions.
.HP
\fB\-\-between\-blocks\fR, \fB\-B\fR <radius>
(Not for use with \fB\-\-by\-category\fR or \fB\-\-for\-features\fR) Try to
partition at sites between alignment blocks.  Assumes a
reference sequence alignment, with the first sequence as the
reference seq (as created by multiz).  Blocks of 30 sites with
gaps in all sequences but the reference seq are assumed to
indicate boundaries between alignment blocks.  Partition
indices will not be moved more than <radius> sites.
.HP
\fB\-\-features\fR, \fB\-g\fR <fname>
.IP
(For use with \fB\-\-by\-category\fR, \fB\-\-by\-group\fR, \fB\-\-for\-features\fR, or
\fB\-\-windows\fR) Annotations file.  May be GFF, BED, or genepred
format.  Coordinates are assumed to be in the coordinate frame of
the first sequence in the alignment (assumed to be the reference
sequence).
.HP
\fB\-\-catmap\fR, \fB\-c\fR <fname>|<string>
(Optionally use with \fB\-\-by\-category\fR) Mapping of feature types
to category numbers.  Can either give a filename or an
"inline" description of a simple category map, e.g.,
\fB\-\-catmap\fR "NCATS = 3 ; CDS 1\-3" or \fB\-\-catmap\fR "NCATS = 1 ; UTR
1".
.HP
\fB\-\-refidx\fR, \fB\-d\fR <frame_index>
.IP
(For use with \fB\-\-windows\fR or \fB\-\-by\-index\fR) Index of frame of
reference for split indices.  Default is 1 (1st sequence
assumed reference).
.SH File names & formats, type of output, etc.
.HP
\fB\-\-in\-format\fR, \fB\-i\fR FASTA|PHYLIP|MPM|MAF|SS
Input alignment file format.
Default is to guess format from
.IP
file contents.
.HP
\fB\-\-refseq\fR, \fB\-M\fR <fname>
.IP
(For use with \fB\-\-in\-format\fR MAF) Name of file containing
reference sequence, in FASTA format.
.HP
\fB\-\-out\-format\fR, \fB\-o\fR FASTA|PHYLIP|MPM|SS
Output alignment file format.
Default is FASTA.
.HP
\fB\-\-out\-root\fR, \fB\-r\fR <name>
Filename root for output files (default "msa_split").
.HP
\fB\-\-sub\-features\fR, \fB\-f\fR
(For use with \fB\-\-features\fR)
Output subsets of features
corresponding to subalignments.
Features overlapping
partition boundaries will be discarded.
Not permitted with
.HP
\fB\-\-by\-category\fR.
.HP
\fB\-\-reverse\-compl\fR, \fB\-s\fR
.IP
Reverse complement all segments having at least one feature on
the reverse strand and none on the positive strand.  For use
with \fB\-\-by\-group\fR.  Can also be used with \fB\-\-by\-category\fR to ensure
all sites in a category are represented in the same strand
orientation.
.HP
\fB\-\-gap\-strip\fR, \fB\-G\fR ALL|ANY|<seqno>
.IP
Strip columns in output alignments containing all gaps, any
gaps, or gaps in the specified sequence (<seqno>; indexing
begins with one).  Default is not to strip any columns.
.HP
\fB\-\-seqs\fR, \fB\-l\fR <seq_list>
Include only specified sequences in output.
Indicate by
.IP
sequence number or name (numbering starts with 1 and is
evaluated *after* \fB\-\-order\fR is applied).
.HP
\fB\-\-exclude\fR, \fB\-x\fR
Exclude rather than include specified sequences.
.HP
\fB\-\-order\fR, \fB\-O\fR <name_list>
.IP
Change order of rows in alignment to match sequence names
specified in name_list.  If a name appears in name_list but
not in the alignment, a row of gaps will be inserted.
.HP
\fB\-\-min\-informative\fR, \fB\-I\fR <n>
.IP
Only output alignments having at least <n> informative sites
(sites at which at least two non\-gap and non\-N gaps are present).
.HP
\fB\-\-do\-cats\fR, \fB\-C\fR <cat_list>
(For use with \fB\-\-by\-category\fR) Output sub\-alignments for only the
specified categories (column\-delimited list).
.HP
\fB\-\-tuple\-size\fR, \fB\-T\fR <tuple_size>
.IP
(for use with \fB\-\-by\-category\fR or \fB\-\-out\-format\fR SS) Size of tuples
of columns to consider in downstream analysis (e.g., with
context\-dependent phylogenetic models; see 'phyloFit').  With
\fB\-\-by\-category\fR, insert tuple_size\-1 columns of missing data
between sites that were not adjacent in the original alignment,
to avoid creating artificial context.  With \fB\-\-out\-format\fR SS,
express sufficient statistics in terms of tuples of specified size.
.HP
\fB\-\-unordered\-ss\fR, \fB\-z\fR
(For use with \fB\-\-out\-format\fR SS)
Suppress the portion of the
sufficient statistics concerned with the order in which columns
appear.
.HP
\fB\-\-summary\fR, \fB\-S\fR
.IP
Output summary of each output alignment to a file with suffix
".sum" (includes base frequencies and numbers of gapped columns).
.SS Other
.HP
\fB\-\-quiet\fR, \fB\-q\fR
Proceed quietly.
.HP
\fB\-\-help\fR, \fB\-h\fR
.IP
Print this help message.