File: gffread.1

package info (click to toggle)
gffread 0.12.7-3
links: PTS, VCS
area: main
in suites: bookworm
size: 1,420 kB
sloc: cpp: 2,783; sh: 96; makefile: 73
file content (224 lines) | stat: -rw-r--r-- 7,684 bytes
parent folder | download | duplicates (3)
.\" DO NOT MODIFY THIS FILE!  It was generated by help2man 1.47.8.
.TH GFFREAD "1" "June 2019" "gffread 0.11.2" "User Commands"
.SH NAME
gffread \- GFF/GTF utility providing format conversions, region filtering, FASTA sequence extraction
.SH SYNOPSIS
.B gffread
<input_gff> [\-g <genomic_seqs_fasta> | <dir>][\-s <seq_info.fsize>]
[\-o <outfile.gff>] [\-t <tname>] [\-r [[<strand>]<chr>:]<start>..<end> [\-R]]
[\-CTVNJMKQAFPGUBHZWTOLE] [\-w <exons.fa>] [\-x <cds.fa>] [\-y <tr_cds.fa>]
[\-i <maxintron>] [\-\-sort\-by <refseq_list.txt>]
.SH DESCRIPTION
.IP
Filter and convert GFF3/GTF2 records, extract corresponding sequences etc.
By default (i.e. without \fB\-O\fR) only process transcripts, ignore other features.
.IP
<input_gff> is a GFF file, use '\-' for stdin
.SH OPTIONS
.TP
\fB\-i\fR
discard transcripts having an intron larger than <maxintron>
.TP
\fB\-l\fR
discard transcripts shorter than <minlen> bases
.TP
\fB\-r\fR
only show transcripts overlapping coordinate range <start>..<end>
(on chromosome/contig <chr>, strand <strand> if provided)
.TP
\fB\-R\fR
for \fB\-r\fR option, discard all transcripts that are not fully
contained within the given range
.TP
\fB\-U\fR
discard single\-exon transcripts
.TP
\fB\-C\fR
coding only: discard mRNAs that have no CDS features
.HP
\fB\-\-nc\fR non\-coding only: discard mRNAs that have CDS features
.HP
\fB\-\-ignore\-locus\fR : discard locus features and attributes found in the input
.TP
\fB\-A\fR
use the description field from <seq_info.fsize> and add it
as the value for a 'descr' attribute to the GFF record
.TP
\fB\-s\fR
<seq_info.fsize> is a tab\-delimited file providing this info
for each of the mapped sequences:
<seq\-name> <seq\-length> <seq\-description>
(useful for \fB\-A\fR option with mRNA/EST/protein mappings)
.PP
Sorting: (by default, chromosomes are kept in the order they were found)
.HP
\fB\-\-sort\-alpha\fR : chromosomes (reference sequences) are sorted alphabetically
.HP
\fB\-\-sort\-by\fR : sort the reference sequences by the order in which their
.IP
names are given in the <refseq.lst> file
.SS "Misc options:"
.TP
\fB\-F\fR
attempt to preserve all GFF attributes preservation
.HP
\fB\-\-keep\-exon\-attrs\fR : for \fB\-F\fR option, do not attempt to reduce redundant
.IP
exon/CDS attributes
.TP
\fB\-G\fR
do not keep exon attributes, move them to the transcript feature
(for GFF3 output)
.HP
\fB\-\-keep\-genes\fR : in transcript\-only mode (default), also preserve gene records
.HP
\fB\-\-keep\-comments\fR: for GFF3 input/output, try to preserve comments
.TP
\fB\-O\fR
process other non\-transcript GFF records (by default non\-transcript
records are ignored)
.TP
\fB\-V\fR
discard any mRNAs with CDS having in\-frame stop codons (requires \fB\-g\fR)
.TP
\fB\-H\fR
for \fB\-V\fR option, check and adjust the starting CDS phase
if the original phase leads to a translation with an
in\-frame stop codon
.TP
\fB\-B\fR
for \fB\-V\fR option, single\-exon transcripts are also checked on the
opposite strand (requires \fB\-g\fR)
.TP
\fB\-P\fR
add transcript level GFF attributes about the coding status of each
transcript, including partialness or in\-frame stop codons (requires \fB\-g\fR)
.HP
\fB\-\-add\-hasCDS\fR : add a "hasCDS" attribute with value "true" for transcripts
.IP
that have CDS features
.HP
\fB\-\-adj\-stop\fR stop codon adjustment: enables \fB\-P\fR and performs automatic
.IP
adjustment of the CDS stop coordinate if premature or downstream
.TP
\fB\-N\fR
discard multi\-exon mRNAs that have any intron with a non\-canonical
splice site consensus (i.e. not GT\-AG, GC\-AG or AT\-AC)
.TP
\fB\-J\fR
discard any mRNAs that either lack initial START codon
or the terminal STOP codon, or have an in\-frame stop codon
(i.e. only print mRNAs with a complete CDS)
.HP
\fB\-\-no\-pseudo\fR: filter out records matching the 'pseudo' keyword
.HP
\fB\-\-in\-bed\fR: input should be parsed as BED format (automatic if the input
.IP
filename ends with .bed*)
.HP
\fB\-\-in\-tlf\fR: input GFF\-like one\-line\-per\-transcript format without exon/CDS
.IP
features (see \fB\-\-tlf\fR option below); automatic if the input
filename ends with .tlf)
.SS "Clustering:"
.HP
\fB\-M\fR/\-\-merge : cluster the input transcripts into loci, discarding
.IP
"duplicated" transcripts (those with the same exact introns
and fully contained or equal boundaries)
.HP
\fB\-d\fR <dupinfo> : for \fB\-M\fR option, write duplication info to file <dupinfo>
.HP
\fB\-\-cluster\-only\fR: same as \fB\-M\fR/\-\-merge but without discarding any of the
.IP
"duplicate" transcripts, only create "locus" features
.TP
\fB\-K\fR
for \fB\-M\fR option: also discard as redundant the shorter, fully contained
.IP
transcripts (intron chains matching a part of the container)
.TP
\fB\-Q\fR
for \fB\-M\fR option, no longer require boundary containment when assessing
redundancy (can be combined with \fB\-K\fR); only introns have to match for
multi\-exon transcripts, and >=80% overlap for single\-exon transcripts
.TP
\fB\-Y\fR
for \fB\-M\fR option, enforce \fB\-Q\fR but also discard overlapping single\-exon
transcripts, even on the opposite strand (can be combined with \fB\-K\fR)
.SS "Output options:"
.HP
\fB\-\-force\-exons\fR: make sure that the lowest level GFF features are considered
.IP
"exon" features
.HP
\fB\-\-gene2exon\fR: for single\-line genes not parenting any transcripts, add an
.IP
exon feature spanning the entire gene (treat it as a transcript)
.TP
\fB\-D\fR
decode url encoded characters within attributes
.TP
\fB\-Z\fR
merge very close exons into a single exon (when intron size<4)
.TP
\fB\-g\fR
full path to a multi\-fasta file with the genomic sequences
for all input mappings, OR a directory with single\-fasta files
(one per genomic sequence, with file names matching sequence names)
.TP
\fB\-w\fR
write a fasta file with spliced exons for each GFF transcript
.TP
\fB\-x\fR
write a fasta file with spliced CDS for each GFF transcript
.TP
\fB\-y\fR
write a protein fasta file with the translation of CDS for each record
.TP
\fB\-W\fR
for \fB\-w\fR and \fB\-x\fR options, write in the FASTA defline the exon
coordinates projected onto the spliced sequence;
for \fB\-y\fR option, write transcript attributes in the FASTA defline
.TP
\fB\-S\fR
for \fB\-y\fR option, use '*' instead of '.' as stop codon translation
.TP
\fB\-L\fR
Ensembl GTF to GFF3 conversion (implies \fB\-F\fR; should be used with \fB\-m\fR)
.TP
\fB\-m\fR
<chr_replace> is a name mapping table for converting reference
sequence names, having this 2\-column format:
<original_ref_ID> <new_ref_ID>
WARNING: all GFF records on reference sequences whose original IDs
are not found in the 1st column of this table will be discarded!
.TP
\fB\-t\fR
use <trackname> in the 2nd column of each GFF/GTF output line
.TP
\fB\-o\fR
print the GFF records to <outfile.gff> (those that passed any
given filters). Use \fB\-o\-\fR to enable printing of to stdout
.TP
\fB\-T\fR
for \fB\-o\fR, output will be GTF instead of GFF3
.HP
\fB\-\-bed\fR for \fB\-o\fR, output BED format instead of GFF3
.HP
\fB\-\-tlf\fR for \fB\-o\fR, output "transcript line format" which is like GFF
.IP
but exons, CDS features and related data are stored as GFF
attributes in the transcript feature line, like this:
.IP
exoncount=N;exons=<exons>;CDSphase=<N>;CDS=<CDScoords>
.IP
<exons> is a comma\-delimited list of exon_start\-exon_end coordinates;
<CDScoords> is CDS_start:CDS_end coordinates or a list like <exons>;
.HP
\fB\-v\fR,\-E expose (warn about) duplicate transcript IDs and other potential
.IP
problems with the given GFF/GTF records
.SH AUTHOR
This manpage was written by Andreas Tille for the Debian distribution and can be used for any other usage of the program.