File: gffread.1

package info (click to toggle)
gffread 0.12.7-3
  • links: PTS, VCS
  • area: main
  • in suites: bookworm
  • size: 1,420 kB
  • sloc: cpp: 2,783; sh: 96; makefile: 73
file content (224 lines) | stat: -rw-r--r-- 7,684 bytes parent folder | download | duplicates (3)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
.\" DO NOT MODIFY THIS FILE!  It was generated by help2man 1.47.8.
.TH GFFREAD "1" "June 2019" "gffread 0.11.2" "User Commands"
.SH NAME
gffread \- GFF/GTF utility providing format conversions, region filtering, FASTA sequence extraction
.SH SYNOPSIS
.B gffread
<input_gff> [\-g <genomic_seqs_fasta> | <dir>][\-s <seq_info.fsize>]
[\-o <outfile.gff>] [\-t <tname>] [\-r [[<strand>]<chr>:]<start>..<end> [\-R]]
[\-CTVNJMKQAFPGUBHZWTOLE] [\-w <exons.fa>] [\-x <cds.fa>] [\-y <tr_cds.fa>]
[\-i <maxintron>] [\-\-sort\-by <refseq_list.txt>]
.SH DESCRIPTION
.IP
Filter and convert GFF3/GTF2 records, extract corresponding sequences etc.
By default (i.e. without \fB\-O\fR) only process transcripts, ignore other features.
.IP
<input_gff> is a GFF file, use '\-' for stdin
.SH OPTIONS
.TP
\fB\-i\fR
discard transcripts having an intron larger than <maxintron>
.TP
\fB\-l\fR
discard transcripts shorter than <minlen> bases
.TP
\fB\-r\fR
only show transcripts overlapping coordinate range <start>..<end>
(on chromosome/contig <chr>, strand <strand> if provided)
.TP
\fB\-R\fR
for \fB\-r\fR option, discard all transcripts that are not fully
contained within the given range
.TP
\fB\-U\fR
discard single\-exon transcripts
.TP
\fB\-C\fR
coding only: discard mRNAs that have no CDS features
.HP
\fB\-\-nc\fR non\-coding only: discard mRNAs that have CDS features
.HP
\fB\-\-ignore\-locus\fR : discard locus features and attributes found in the input
.TP
\fB\-A\fR
use the description field from <seq_info.fsize> and add it
as the value for a 'descr' attribute to the GFF record
.TP
\fB\-s\fR
<seq_info.fsize> is a tab\-delimited file providing this info
for each of the mapped sequences:
<seq\-name> <seq\-length> <seq\-description>
(useful for \fB\-A\fR option with mRNA/EST/protein mappings)
.PP
Sorting: (by default, chromosomes are kept in the order they were found)
.HP
\fB\-\-sort\-alpha\fR : chromosomes (reference sequences) are sorted alphabetically
.HP
\fB\-\-sort\-by\fR : sort the reference sequences by the order in which their
.IP
names are given in the <refseq.lst> file
.SS "Misc options:"
.TP
\fB\-F\fR
attempt to preserve all GFF attributes preservation
.HP
\fB\-\-keep\-exon\-attrs\fR : for \fB\-F\fR option, do not attempt to reduce redundant
.IP
exon/CDS attributes
.TP
\fB\-G\fR
do not keep exon attributes, move them to the transcript feature
(for GFF3 output)
.HP
\fB\-\-keep\-genes\fR : in transcript\-only mode (default), also preserve gene records
.HP
\fB\-\-keep\-comments\fR: for GFF3 input/output, try to preserve comments
.TP
\fB\-O\fR
process other non\-transcript GFF records (by default non\-transcript
records are ignored)
.TP
\fB\-V\fR
discard any mRNAs with CDS having in\-frame stop codons (requires \fB\-g\fR)
.TP
\fB\-H\fR
for \fB\-V\fR option, check and adjust the starting CDS phase
if the original phase leads to a translation with an
in\-frame stop codon
.TP
\fB\-B\fR
for \fB\-V\fR option, single\-exon transcripts are also checked on the
opposite strand (requires \fB\-g\fR)
.TP
\fB\-P\fR
add transcript level GFF attributes about the coding status of each
transcript, including partialness or in\-frame stop codons (requires \fB\-g\fR)
.HP
\fB\-\-add\-hasCDS\fR : add a "hasCDS" attribute with value "true" for transcripts
.IP
that have CDS features
.HP
\fB\-\-adj\-stop\fR stop codon adjustment: enables \fB\-P\fR and performs automatic
.IP
adjustment of the CDS stop coordinate if premature or downstream
.TP
\fB\-N\fR
discard multi\-exon mRNAs that have any intron with a non\-canonical
splice site consensus (i.e. not GT\-AG, GC\-AG or AT\-AC)
.TP
\fB\-J\fR
discard any mRNAs that either lack initial START codon
or the terminal STOP codon, or have an in\-frame stop codon
(i.e. only print mRNAs with a complete CDS)
.HP
\fB\-\-no\-pseudo\fR: filter out records matching the 'pseudo' keyword
.HP
\fB\-\-in\-bed\fR: input should be parsed as BED format (automatic if the input
.IP
filename ends with .bed*)
.HP
\fB\-\-in\-tlf\fR: input GFF\-like one\-line\-per\-transcript format without exon/CDS
.IP
features (see \fB\-\-tlf\fR option below); automatic if the input
filename ends with .tlf)
.SS "Clustering:"
.HP
\fB\-M\fR/\-\-merge : cluster the input transcripts into loci, discarding
.IP
"duplicated" transcripts (those with the same exact introns
and fully contained or equal boundaries)
.HP
\fB\-d\fR <dupinfo> : for \fB\-M\fR option, write duplication info to file <dupinfo>
.HP
\fB\-\-cluster\-only\fR: same as \fB\-M\fR/\-\-merge but without discarding any of the
.IP
"duplicate" transcripts, only create "locus" features
.TP
\fB\-K\fR
for \fB\-M\fR option: also discard as redundant the shorter, fully contained
.IP
transcripts (intron chains matching a part of the container)
.TP
\fB\-Q\fR
for \fB\-M\fR option, no longer require boundary containment when assessing
redundancy (can be combined with \fB\-K\fR); only introns have to match for
multi\-exon transcripts, and >=80% overlap for single\-exon transcripts
.TP
\fB\-Y\fR
for \fB\-M\fR option, enforce \fB\-Q\fR but also discard overlapping single\-exon
transcripts, even on the opposite strand (can be combined with \fB\-K\fR)
.SS "Output options:"
.HP
\fB\-\-force\-exons\fR: make sure that the lowest level GFF features are considered
.IP
"exon" features
.HP
\fB\-\-gene2exon\fR: for single\-line genes not parenting any transcripts, add an
.IP
exon feature spanning the entire gene (treat it as a transcript)
.TP
\fB\-D\fR
decode url encoded characters within attributes
.TP
\fB\-Z\fR
merge very close exons into a single exon (when intron size<4)
.TP
\fB\-g\fR
full path to a multi\-fasta file with the genomic sequences
for all input mappings, OR a directory with single\-fasta files
(one per genomic sequence, with file names matching sequence names)
.TP
\fB\-w\fR
write a fasta file with spliced exons for each GFF transcript
.TP
\fB\-x\fR
write a fasta file with spliced CDS for each GFF transcript
.TP
\fB\-y\fR
write a protein fasta file with the translation of CDS for each record
.TP
\fB\-W\fR
for \fB\-w\fR and \fB\-x\fR options, write in the FASTA defline the exon
coordinates projected onto the spliced sequence;
for \fB\-y\fR option, write transcript attributes in the FASTA defline
.TP
\fB\-S\fR
for \fB\-y\fR option, use '*' instead of '.' as stop codon translation
.TP
\fB\-L\fR
Ensembl GTF to GFF3 conversion (implies \fB\-F\fR; should be used with \fB\-m\fR)
.TP
\fB\-m\fR
<chr_replace> is a name mapping table for converting reference
sequence names, having this 2\-column format:
<original_ref_ID> <new_ref_ID>
WARNING: all GFF records on reference sequences whose original IDs
are not found in the 1st column of this table will be discarded!
.TP
\fB\-t\fR
use <trackname> in the 2nd column of each GFF/GTF output line
.TP
\fB\-o\fR
print the GFF records to <outfile.gff> (those that passed any
given filters). Use \fB\-o\-\fR to enable printing of to stdout
.TP
\fB\-T\fR
for \fB\-o\fR, output will be GTF instead of GFF3
.HP
\fB\-\-bed\fR for \fB\-o\fR, output BED format instead of GFF3
.HP
\fB\-\-tlf\fR for \fB\-o\fR, output "transcript line format" which is like GFF
.IP
but exons, CDS features and related data are stored as GFF
attributes in the transcript feature line, like this:
.IP
exoncount=N;exons=<exons>;CDSphase=<N>;CDS=<CDScoords>
.IP
<exons> is a comma\-delimited list of exon_start\-exon_end coordinates;
<CDScoords> is CDS_start:CDS_end coordinates or a list like <exons>;
.HP
\fB\-v\fR,\-E expose (warn about) duplicate transcript IDs and other potential
.IP
problems with the given GFF/GTF records
.SH AUTHOR
This manpage was written by Andreas Tille for the Debian distribution and can be used for any other usage of the program.