Data release policy and Guidelines and conditions on use of data


	`\| Info \| HGP \| Projects \| Database Searches \| Software \| Teams \| Search \|`
	`Home Software & Databases GFF`

GFF (Gene Finding Features) Specifications Document

Introduction
Version 2 GFF Update
Definition
Semantics
Ways to use GFF
- Complex Examples
  - Similarities to Other Sequences
- Cumulative Score Arrays
Mailing list
Edit History
Authors

Introduction

Essentially all current approaches to gene finding in higher organisms use a variety of recognition methods that give scores to likely signals (starts, splice sites, stops etc.) or to extended regions (exons, introns etc.), and then combine these to give complete gene structures. Normally the combination step is done in the same program as the feature detection, often using dynamic programming methods. We would like to enable these processes to be decoupled, by proposing a format called GFF (Gene-Finding Format) for the transfer of feature information. It would then be possible to take features from an outside source and add them in to an existing program, or in the extreme to write a dynamic programming system which only took external features.

In particular, establishing GFF would allow people to develop features and have them tested without having to maintain a complete gene-finding system. Equally, it would help those developing and applying integrated gene-finding programs to test new feature detectors developed by others, or even by themselves.

We want the GFF format to be easy to parse and process by a variety of programs in different languages. e.g. it would be useful if Unix tools like grep, sort and simple perl and awk scripts could easily extract information out of the file. For these reasons, for the primary format, we propose a record-based structure, where each feature is described on a single line, and line order is not relevant.

We do not intend GFF format to be used for complete data management of the analysis and annotation of genomic sequence. Systems such as Acedb, Genotator etc. that have much richer data representation semantics have been designed for that purpose. The disadvantages in using their formats for data exchange (or other richer formats such as ASN.1) are (1) they require more complexity in parsing/processing, (2) there is little hope on achieving consensus on how to capture all information. GFF is intentionally aiming for a low common denominator.

Here are some example records:

SEQ1	EMBL	atg	103	105	.	+	0
SEQ1	EMBL	exon	103	172	.	+	0
SEQ1	EMBL	splice5	172	173	.	+	.
SEQ1	netgene	splice5	172	173	0.94	+	.
SEQ1	genie	sp5-20	163	182	2.3	+	.
SEQ1	genie	sp5-10	168	177	2.1	+	.
SEQ2	grail	ATG	17	19	2.1	-	0

Back to Table of Contents

Version 2 GFF Update

ALERT 98/12/16: Following discussions with Lincoln Stein and others, we propose the Version 2 format of GFF, as specifically described in this document. The Version 2 specification has not yet been frozen and is presented as a "work-in-progress" at this time, open to user feedback on the proposed changes (plus other suggestions for improvement). The main change from Version 1 to Version 2 is the requirement for a tag-value type structure (essentially .ace format) for any additional material on the line, following the mandatory fields. We also now allow '.' as a score, for features for which there is no score. Dumping in version 2 format is implemented in ACEDB. Changes in the remainder of this document are described and marked as (Version 2 changes).

Back to Table of Contents

Definition

Fields are: <seqname> <source> <feature> <start> <end> <score> <strand> <frame> [group]>[comments]

<seqname>

The name of the sequence. Having an explicit sequence name allows a feature file to be prepared for a data set of multiple sequences. Normally the seqname will be the identifier of the sequence in an accompanying fasta format file. An alternative is that 'seqname' is the identifier for a sequence in a public database, such as an EMBL/Genbank/DDBJ accession number. Which is the case, and which file or database to use, should be explained in accompanying information.

<source>

The source of this feature. This field will normally be used to indicate the program making the prediction, or if it comes from public database annotation, or is experimentally verified, etc.

<feature>

The feature type name. We hope to suggest a standard set of features, to facilitate import/export, comparison etc.. Of course, people are free to define new ones as needed. For example, Genie splice detectors account for a region of DNA, and multiple detectors may be available for the same site, as shown above.

(Version 2 change: Standard Table of Features - we would like to enforce a standard nomenclature for common GFF features. This does not forbid the use of other features, rather, just that if the feature is obviously described in the standard list, that the standard label should be used. For this standard table we propose to fall back on the international public standards for genomic database feature annotation, specifically, the DDBJ/EMBL/GenBank feature table).

<start>, <end>

Integers. <start> must be less than or equal to <end>. Sequence numbering starts at 1, so these numbers should be between 1 and the length of the relevant sequence, inclusive. (Version 2 change: version 2 condones values of <start> and <end> that extend outside the reference sequence. This is often more natural when dumping from acedb, rather than clipping. It means that some software using the files may need to clip for itself.)

<score>

A floating point value. When there is no score (i.e. for a sensor that just records the possible presence of a signal, as for the EMBL features above) you should use '.'. (Version 2 change: in version 1 of GFF you had to write 0 in such circumstances.)

<strand>

One of '+', '-' or '.'. '.' should be used when strand is not relevant, e.g. for dinucleotide repeats.

<frame>

One of '0', '1', '2' or '.'. '0' indicates that the specified region is in frame, i.e. that its first base corresponds to the first base of a codon. '1' indicates that there is one extra base, i.e. that the second base of the region corresponds to the first base of a codon, and '2' means that the third base of the region is the first base of a codon. If the strand is '-', then the first base of the region is value of <end>, because the corresponding coding region will run from <end> to <start> on the reverse strand. As with <strand>, if the frame is not relevant then set <frame> to '.'. It has been pointed out that "phase" might be a better descriptor than "frame" for this field.

[group]

An optional string-valued field that can be used as a name to group together a set of records. Typical uses might be to group the introns and exons in one gene prediction (or experimentally verified gene structure), or to group multiple regions of match to another sequence, such as an EST or a protein. See below for examples.
Version 2 change: In version 2, the optional [group] field on the line must have an tag value structure following the syntax used within objects in a .ace file, flattened onto one line by semicolon separators. Tags must be standard identifiers ([A-Za-z][A-Za-z0-9_]*). Free text values must be quoted with double quotes. Note: all non-printing characters in such free text value strings (e.g. newlines, tabs, control characters, etc) must be explicitly represented by their C (UNIX) style backslash-escaped representation (e.g. newlines as '\n', tabs as '\t'). As in ACEDB, multiple values can follow a specific tag. The aim is to establish consistent use of particular tags, corresponding to an underlying implied ACEDB model if you want to think that way (but acedb is not required). Examples of these would be:

seq1     BLASTX  similarity   101  235 87.1 + 0	Target "HBA_HUMAN" 11 55 ; E_value 0.0003
dJ102G20 GD_mRNA coding_exon 7105 7201   .  - 2 Sequence "dJ102G20.C1.1"

All strings (i.e. values of the <seqname>, <source> or <feature> fields) should be under 256 characters long, and should not include whitespace. The whole line should be under 32k long. A character limit is not very desirable, but helps write parsers in some languages. The slightly silly 32k limit is to allow plenty of space for comments/extra data. Version 2 change: field and line size limitations are removed; however, fields (except the optional [group] field above) must still not include whitespace.

All of the above described fields should be separated by TAB characters ('\t'). Version 2 note: previous Version 2 permission to use arbitrary whitespace as field delimiters is now revoked! (99/02/26)

Back to Table of Contents

Comments

Comments are allowed, starting with "#" as in Perl, awk etc. Everything following # until the end of the line is ignored. Effectively this can be used in two ways. Either it must be at the beginning of the line (after any whitespace), to make the whole line a comment, or the comment could come after all the required fields on the line.

We also permit extra information to be given on the line following the group field without a '#' character (Version 2 change: this extra information must be delimited by the '#' comment delimiter OR by another tab field delimiter character, following any and all [group] field tag-value pairs).

This allows extra method-specific information to be transferred with the line. However, we discourage overuse of this feature: better to find a way to do it with more true feature lines, and perhaps groups. (Version 2 change: we gave in and defined a structured way of passing additional information, as described above under [group]. But the sentiment of this paragraph still applies - don't overuse the tag-value syntax. The use of tag-value pairs (with whitespace) renders problematic the parsing of Version 1 style comments (following the group field, without a '#' character), so in Version 2, such [group] trailing comments must start with the "#", as noted above.

## comment lines for meta information

There is a set of standardised (i.e. parsable) ## line types that can be used optionally at the top of a gff file. The philosophy is a little like the special set of %% lines at the top of postscript files, used for example to give the BoundingBox for EPS files.

Current proposed ## lines are:

##gff-version 1: GFF version - in case it is a real success and we want to change it. The current version is 2. (Version 2 change!)
##source-version {source} {version text}: So that people can record what version of a program or package was used to make the data in this file. I suggest the version is text without whitespace. That allows things like 1.3, 4a etc.
##date {date}: The date the file was made, or perhaps that the prediction programs were run. We suggest to use astronomical format: 1997-11-08 for 8th November 1997, first because these sort properly, and second to avoid any US/European bias.
##DNA {seqname} ##acggctcggattggcgctggatgatagatcagacgac ##... ##end-DNA: To give a DNA sequence. Several people have pointed out that it may be convenient to include the sequence in the file. It should not become mandatory to do so. Often the seqname will be a well-known identifier, and the sequence can easily be retrieved from a database, or an accompanying file.
##sequence-region {seqname} {start} {end}: To indicate that this file only contains entries for the specified subregion of a sequence.

Please feel free to propose new ## lines. The ## line proposal came out of some discussions including Anders Krogh, David Haussler, people at the Newton Institute on 1997-10-29 and some email from Suzanna Lewis. Of course, naive programs can ignore all of these...

File Naming

We propose that the format is called "GFF", with conventional file name ending ".gff".

Back to Table of Contents

Semantics

We have intentionally avoided overspecifying the semantics of the format. For example, we have not restricted the items expressible in GFF to a specified set of feature types (splice sites, exons etc.) with defined semantics. Therefore, in order for the information in a gff file to be useful to somebody else, the person producing the features must describe the meaning of the features.

In the example given above the feature "splice5" indicates that there is a candidate 5' splice site between positions 172 and 173. The "sp5-20" feature is a prediction based on a window of 20 bp for the same splice site. To use either of these, you must know the position within the feature of the predicted splice site. This only needs to be given once, possibly in comments at the head of the file, or in a separate document.

Another example is the scoring scheme; we ourselves would like the score to be a log-odds likelihood score in bits to a defined null model, but that is not required, because different methods take different approaches. Avoiding a prespecified feature set also leaves open the possibility for GFF to be used for new feature types, such as CpG islands, hypersensitive sites, promoter/enhancer elements, etc.

Back to Table of Contents

Ways to use GFF

Here are a few suggestions on how the GFF format might be used.

Simple sharing of sensors. In this case, researcher A has a sensor, such as a 3' splice site sensor, and researcher B wants to test that sensor. They agree on a set of sequences, researcher A runs the sensor on these sequences and sends the resulting GFF file to researher B, who then evaluates the result.
Representing experimental results. GFF feature records can also be created for experimentally confirmed exons and other features. In these cases there will presumably be no score. Such "confirmed" GFF files will be useful for evaluating predictions, using the same software as you would to compare predictions.
Integrated gene parsing. Several GFF files from different researchers can be combined to provide the features used by an integrated genefinder. As mentioned above, this has the advantage that different combinations of sensors and dynamic programming methods for assembling sensor scores into consistent gene parses can be easily explored.
Reporting final predictions. GFF format can also be used to communicate finished gene predictions. One simply reports final predicted exons and other predicted gene features, either with their original scores. or with some sort of posterior scores, rather than, or in addition to, reporting all candidate gene features with their scores. To show that a set of the components belong to a single prediction, a "group" field can be added to all the accepted sites. This is useful for comparing the outputs of several integrated genefinders among themselves, and to "confirmed" GFF files. A particular advantage of having the same format for both raw sensor feature score files and final gene parse files is that one can easily explore the possibility of combining the final gene parses from several different genefinders, using another round of dynamic programming, into a single integrated predicted parse.
Visualisation. GFF will also provide a simple standard format for standardising input to visualisation programs, showing predicted and experimentally determined features, gene structures etc.

Back to Table of Contents

Complex Examples

Similarities to Other Sequences

A major source of information about a sequence comes from similarities to other sequences. For example, BLAST hits to protein sequences help identify potential coding regions. We can represent these as a set of "homology gene features", grouping hits to the same target as follows:

seq1	BLASTX	similarity	101	136	87.1	+	0	HBA_HUMAN
seq1	BLASTX	similarity	107	133	72.4	+	0	HBB_HUMAN
seq1	BLASTX	similarity	290	343	67.1	+	0	HBA_HUMAN

If further information is needed about where in the target protein each match occurs, it can be given after the protein name, e.g. as the start coordinate in the target.

Version 2 change: In version 2 this has been formalised using the tag Target which expects to be followed by the name of the target, followed (optionally) by start and end point in the target as integers, as in

seq1 BLASTX similarity 101 235 87.1 + 0    Target "HBA_HUMAN" 11 55 ; E_value 0.0003

We need to finalise on a tag model for gapped alignments...

Back to Table of Contents

Cumulative Score Arrays

One issue that comes up with a record-based format such as the GFF format is how to cope with large numbers of overlapping segments. For example, in a long sequence, if one tries to include a separate record giving the score of every candidate exon, where a candidate exon is defined as a segment of the sequence that begins and ends at candidate splice sites and consists of an open reading frame in between, then one can have an infeasibly large number of records. The problem is that there can be a huge number of highly overlapping exon candidates.

Let us assume that the score of an exon can be decomposed into three parts: the score of the 5' splice site, the score of the 3' splice site, and the sum of the scores of all the codons in between. In such a case it can be much more efficient to use the GFF format to report separate scores for the splice site sensors and for the individual codons in all three (or six, including reverse strand) frames, and let the program that interprets this file assemble the exon scores. The exon scores can be calculated efficiently by first creating three arrays, each of which contains in its [i]th position a value A[i] that is the partial sum of the codon scores in a particular frame for the entire sequence from position 1 up to position i. Then for any positions i < j, the sum of the scores of all codons from i to j can be obtained as A[j] - A[i]. Using these arrays, along with the candidate splice site scores, a very large number of scores for overlapping exons are implicitly defined in a data structure that takes only linear space with respect to the number of positions in the sequence, and such that the score for each exon can be retrieved in constant time.

When the GFF format is used to transmit scores that can be summed for efficient retrieval as in the case of the codon scores above, we ask that the provider of the scores indicate that these scores are summable in this manner, and provide a recipe for calculating the scores that are to be derived from these summable scores, such as the exon scores described above. We place no limit on the complexity of this recipe, nor do we provide a standard protocol for such assembly, other than providing examples. It behooves the sensor score provider to keep the recipe simple enough that others can easily implement it.

Back to Table of Contents

Mailing list

There is a mailing list to which you can send comments, enquiries, complaints etc. about GFF. If you want to be added to the mailing list, please send mail to Majordomo@sanger.ac.uk with the following command in the body of your email message:

subscribe gff-list

Back to Table of Contents

Edit History

971028 rd: I changed the comment initiator to '#' from '//' because a single symbol is easier for simple parsers.

971028 rd: We also now allow extra text after <group> without a comment character, because this immediately proved useful.

971028 rd: I considered switching from start-end notation to start-length notation, on the suggestion of Anders Krogh. This seems nicer in many cases, but is a debatable point. I then switched back!

971028 rd: I added the section about name space.

971108 rd: added ## line proposals - moved them into main text 971113.

971113 rd: added extra "source" field as discussed at Newton Institute meeting 971029. There are two main reasons. First, to help prevent name space clashes -- each program would have their own source designation. Second, to help reuse feature names, so one could have "exon" for exon predictions from each prediction program.

971113 rd: added section on mailing list.

980909 ihh: fixed some small things and put this page on the Sanger GFF site.

981216 rd: introduced version 2 changes.

990226 rbsk: incorporated amendments to the version 2 specification as follows:

Non-printing characters (e.g. newlines, tabs) in Version 2 double quoted "free text values" must be explicitly represented by their C (UNIX) style backslash escaped character (i.e. '\t' for tabs, '\n' for newlines, etc.)
Removed field (256) and line (32K) character size limitations for Version 2.
Removed arbitrary whitespace field delimiter permission from specification. TAB ('\t') field delimiters now enforced again, as in Version 1.

990317 rbsk:

End of line comments following Version 2 [group] field tag-value structures must be tab '\t' or hash '#' delimited.

Back to Table of Contents

Authors

GFF Protocol Specification initially proposed by: Richard Durbin and David Haussler

with amendments proposed by: Lincoln Stein, Anders Krogh and others.

The GFF specification now maintained at the Sanger Centre by Richard Bruskiewich

Back to Table of Contents

last modified : 25-Mar-1999, 01:59 PM

Richard Bruskiewich (rbsk@sanger.ac.uk)