1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649
|
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2//EN">
<HTML>
<HEAD>
<TITLE>
The Sanger Centre : Gene-Finding Format - introduction and specification
</TITLE>
</HEAD>
<BODY BGCOLOR="#FFFFFF" TOPMARGIN="0">
<TABLE BORDER=0 WIDTH=100%>
<TR>
<TD ALIGN=LEFT>
<FONT FACE="Arial,Helvetica,sans-serif" SIZE="-1"> <I><A HREF="/Projects/release-policy.shtml">Data release policy</a>
and <A HREF="/Projects/use-policy.shtml">Guidelines and conditions on use of data</A></I></FONT>
</TD>
</TR>
</TABLE>
<TABLE BORDER=1 WIDTH=100%>
<TR>
<TD>
<TABLE BORDER="0" WIDTH="100%">
<TR>
<TD WIDTH="23" ALIGN=LEFT VALIGN=MIDDLE ROWSPAN=3>
<IMG WIDTH="23" HEIGHT="55" ALT="" BORDER="0" SRC="/header-icons/helix.gif">
</TD>
<TD ALIGN=CENTER VALIGN=TOP>
<A HREF=/><IMG WIDTH="236" HEIGHT="29" BORDER="0"
ALT="[The Sanger Centre]"
SRC="/header-icons/sanger-centre.gif"></A>
</TD>
<TD WIDTH="55" ALIGN=RIGHT VALIGN=MIDDLE ROWSPAN=3>
<IMG WIDTH="55" HEIGHT="55" ALT="" BORDER=0 SRC=/header-icons/sw.gif>
</TD>
</TR>
<TR>
<TD ALIGN=CENTER VALIGN=TOP NOWRAP>
<TT><FONT FACE=Arial,Helvetica,sans-serif SIZE=-1>
|
<A HREF=/Info/>Info</A>
|
<A HREF=/HGP/>HGP</A>
|
<A HREF=/Projects/>Projects</A>
|
<A HREF=/DataSearch/>Database Searches</A>
|
<A HREF=/Software/><B>Software</B></A>
|
<A HREF=/Teams/>Teams</A>
|
<A HREF=http://search.sanger.ac.uk>Search</A>
|
</FONT></TT>
</TD>
</TR>
<TR> <TD ALIGN=LEFT VALIGN=TOP NOWRAP>
<FONT FACE=Arial,Helvetica,sans-serif SIZE=-1><TT>
<A HREF=/><IMG WIDTH=11 HEIGHT=10 BORDER=0 HSPACE=0 ALIGN=TOP ALT="Home page" SRC=/icons/arrow.small.up.gif> Home</A>
<A HREF=/Software/><IMG WIDTH=11 HEIGHT=10 BORDER=0 HSPACE=0 ALIGN=TOP ALT="up to Software & Databases " SRC=/icons/arrow.small.left.gif> Software & Databases </A>
<A HREF=/Software/GFF/><IMG WIDTH=11 HEIGHT=10 BORDER=0 HSPACE=0 ALIGN=TOP ALT="up to GFF" SRC=/icons/arrow.small.left.gif> GFF</A>
</TT></FONT>
</TD>
</TR>
</TABLE>
</TD>
</TR>
</TABLE>
<P>
<!-- open table cell holding the page content -->
<CENTER><TABLE BORDER="0" WIDTH="80%"><TR><TD ALIGN="LEFT" VALIGN="TOP">
<!-- page content starts here -->
<A NAME="TOC">
<H1 ALIGN="CENTER">GFF (Gene Finding Features) Specifications Document</H1>
<!-- INDEX BEGIN -->
<UL>
<LI><A HREF="#introduction">Introduction</A>
<LI><A HREF="#version_2_update">Version 2 GFF Update</A>
<LI><A HREF="#fields">Definition</A>
<UL>
<LI><A HREF="#standard_feature_table">Standard Table of Features</A>
<LI><A HREF="#group_field">Group Field</A>
<LI><A HREF="#comments">Comments</A>
<UL>
<LI><A HREF="#meta_info">Comments for Meta-Information</A>
</UL>
<LI><A HREF="#file_names">File Naming</A>
</UL>
<LI><A HREF="#semantics">Semantics</A>
<LI><A HREF="#GFF_use">Ways to use GFF</A>
<UL>
<LI><A HREF="#examples">Complex Examples</A>
<UL>
<LI><A HREF="#homology_feature">Similarities to Other Sequences</A>
</UL>
<LI><A HREF="#cum_score_array">Cumulative Score Arrays</A>
</UL>
<LI><A HREF="#mailing_list"> Mailing list</A>
<LI><A HREF="#edit_history">Edit History</A>
<LI><A HREF="#authors">Authors</A>
</UL>
<!-- INDEX END -->
<HR>
<A NAME="introduction"><h2>Introduction</h2></A>
<P>
Essentially all current approaches to gene finding in higher organisms
use a variety of recognition methods that give scores to likely
signals (starts, splice sites, stops etc.) or to extended regions
(exons, introns etc.), and then combine these to give complete gene
structures. Normally the combination step is done in the same program
as the feature detection, often using dynamic programming methods. We
would like to enable these processes to be decoupled, by proposing a
format called GFF (Gene-Finding Format) for the transfer of feature
information. It would then be possible to take features from an
outside source and add them in to an existing program, or in the
extreme to write a dynamic programming system which only took external
features.
<P>
In particular, establishing GFF would allow people to develop features
and have them tested without having to maintain a complete
gene-finding system. Equally, it would help those developing and
applying integrated gene-finding programs to test new feature
detectors developed by others, or even by themselves.
<P>
We want the GFF format to be easy to parse and process by a variety of
programs in different languages. e.g. it would be useful if Unix
tools like grep, sort and simple perl and awk scripts could easily
extract information out of the file. For these reasons, for the
primary format, we propose a record-based structure, where each
feature is described on a single line, and line order is not relevant.
<P>
We do not intend GFF format to be used for complete data management of
the analysis and annotation of genomic sequence. Systems such as
Acedb, Genotator etc. that have much richer data representation
semantics have been designed for that purpose. The disadvantages in
using their formats for data exchange (or other richer formats such as
ASN.1) are (1) they require more complexity in parsing/processing, (2)
there is little hope on achieving consensus on how to capture all
information. GFF is intentionally aiming for a low common
denominator. <P>
Here are some example records:
<pre>
SEQ1 EMBL atg 103 105 . + 0
SEQ1 EMBL exon 103 172 . + 0
SEQ1 EMBL splice5 172 173 . + .
SEQ1 netgene splice5 172 173 0.94 + .
SEQ1 genie sp5-20 163 182 2.3 + .
SEQ1 genie sp5-10 168 177 2.1 + .
SEQ2 grail ATG 17 19 2.1 - 0
</pre>
<P>
Back to <A HREF="#TOC">Table of Contents</A>
<P>
<HR>
<A NAME="version_2_update"><h2>Version 2 GFF Update</h2></A>
<P>
<FONT COLOR="#8F2020"><b>ALERT 98/12/16</b>: Following discussions with Lincoln Stein and others, we
propose the Version 2 format of GFF, as specifically described in
this document. The Version 2 specification has not yet been frozen and
is presented as a "work-in-progress" at this time, open to
user feedback on the proposed changes (plus other suggestions for improvement).
The main change from Version 1 to Version 2 is the requirement for a tag-value
type structure (essentially .ace format) for any additional material on the
line, following the mandatory fields. We also now
allow '.' as a score, for features for which there is no score. Dumping in version
2 format is implemented in ACEDB. Changes in the remainder of this
document are described and marked as (<b>Version 2 changes</b>).
</FONT>
<P>
<P>
Back to <A HREF="#TOC">Table of Contents</A>
<P>
<HR>
<A NAME="fields"><h2>Definition</h2></A>
Fields are:
<seqname> <source> <feature> <start> <end> <score> <strand> <frame> [group]>[comments] <P>
<dl>
<dt><seqname>
<dd>The name of the sequence. Having an explicit sequence name
allows a feature file to be prepared for a data set of multiple
sequences. Normally the seqname will be the identifier of the
sequence in an accompanying fasta format file. An alternative is that
'seqname' is the identifier for a sequence in a public database, such
as an EMBL/Genbank/DDBJ accession number. Which is the case, and
which file or database to use, should be explained in accompanying
information.<P>
<dt><source>
<dd> The source of this feature. This field will normally be used to
indicate the program making the prediction, or if it comes from public
database annotation, or is experimentally verified, etc.<P>
<dt><feature>
<dd> The feature type name. We hope to suggest a standard set of
features, to facilitate import/export, comparison etc.. Of course,
people are free to define new ones as needed. For example, Genie
splice detectors account for a region of DNA, and multiple detectors
may be available for the same site, as shown above.<P>
<A name="standard_feature_table">
(<b>Version 2 change</b>: <u>Standard Table of Features</u> -
we would like to enforce a standard nomenclature for
common GFF features. This does not forbid the use of other features,
rather, just that if the feature is obviously described in the standard
list, that the standard label should be used. For this standard table
we propose to fall back on the international public standards for genomic
database feature annotation, specifically, the
<a href="http://www.ebi.ac.uk/ebi_docs/embl_db/ft/feature_key_ref.html">
DDBJ/EMBL/GenBank feature table</a>).<P>
<dt><start>, <end>
<dd> Integers. <start> must be less than or equal to
<end>. Sequence numbering starts at 1, so these numbers
should be between 1 and the length of the relevant sequence,
inclusive. (<b>Version 2 change</b>: version 2 condones values of
<start> and <end> that extend outside the
reference sequence. This is often more natural when dumping from
acedb, rather than clipping. It means that some software using the
files may need to clip for itself.)<P>
<dt><score>
<dd> A floating point value. When there is no score (i.e. for a
sensor that just records the possible presence of a signal, as for the
EMBL features above) you should use '.'. (<b>Version 2 change</b>: in
version 1 of GFF you had to write 0 in such circumstances.)<P>
<dt><strand>
<dd> One of '+', '-' or '.'. '.' should be used when
strand is not relevant, e.g. for dinucleotide repeats.<P>
<dt><frame>
<dd> One of '0', '1', '2' or '.'. '0' indicates that the specified
region is in frame, i.e. that its first base corresponds to the first
base of a codon. '1' indicates that there is one extra base,
i.e. that the second base of the region corresponds to the first base
of a codon, and '2' means that the third base of the region is the
first base of a codon. If the strand is '-', then the first base of
the region is value of <end>, because the corresponding
coding region will run from <end> to <start> on
the reverse strand. As with <strand>, if the frame is not
relevant then set <frame> to '.'.
It has been pointed out that "phase" might be a better descriptor than
"frame" for this field.<P>
<dt><A NAME="group_field">[group] </A>
<dd> An optional string-valued field that can be used as a name to
group together a set of records. Typical uses might be to group the
introns and exons in one gene prediction (or experimentally verified
gene structure), or to group multiple regions of match to another
sequence, such as an EST or a protein. See below for examples.<br>
<b>Version 2 change</b>: In version 2, the optional [group] field on the line
must have an tag value structure following the syntax used within
objects in a .ace file, flattened onto one line by semicolon
separators. Tags must be standard identifiers
([A-Za-z][A-Za-z0-9_]*). Free text values must be quoted with double
quotes. <em>Note: all non-printing characters in such free text value strings
(e.g. newlines, tabs, control characters, etc)
must be explicitly represented by their C (UNIX) style backslash-escaped
representation (e.g. newlines as '\n', tabs as '\t').</em>
As in ACEDB, multiple values can follow a specific tag. The
aim is to establish consistent use of particular tags, corresponding
to an underlying implied ACEDB model if you want to think that way
(but acedb is not required). Examples of these would be:
<font size="3"><pre>
seq1 BLASTX similarity 101 235 87.1 + 0 Target "HBA_HUMAN" 11 55 ; E_value 0.0003
dJ102G20 GD_mRNA coding_exon 7105 7201 . - 2 Sequence "dJ102G20.C1.1"
</pre></font>
</dl>
All strings (i.e. values of the <seqname>,
<source> or <feature> fields) should be under 256
characters long, and should not include whitespace. The whole line
should be under 32k long. A character limit is not very desirable,
but helps write parsers in some languages. The slightly silly 32k
limit is to allow plenty of space for comments/extra data. <b>Version 2 change</b>:
field and line size limitations are removed; however, fields (except the optional
[group] field above) must still not include whitespace.
<P>
All of the above described fields should be separated by TAB characters ('\t').
<b>Version 2 note</b>: previous Version 2 permission to use arbitrary whitespace
as field delimiters is now <b>revoked</b>! (99/02/26)
<P>
<P>
Back to <A HREF="#TOC">Table of Contents</A>
<P>
<HR>
<A NAME="comments"><h3> Comments </h3>
Comments are allowed, starting with "#" as in Perl, awk etc.
Everything following # until the end of the line is ignored.
Effectively this can be used in two ways. Either it must be at the
beginning of the line (after any whitespace), to make the whole line a
comment, or the comment could come after all the required fields on
the line.
<P>
We also permit extra information to be given on the line following the
group field without a '#' character (<b>Version 2 change</b>: this extra
information <B>must</B> be delimited by the '#' comment delimiter <B>OR</B>
by another tab field delimiter character, following
any and all [group] field tag-value pairs).
<P>
This allows extra method-specific information to be transferred with the line. However,
we discourage overuse of this feature: better to find a way to do it
with more true feature lines, and perhaps groups. (<b>Version 2
change</b>: we gave in and defined a structured way of passing
additional information, as described above under [group]. But the
sentiment of this paragraph still applies - don't overuse the
tag-value syntax. The use of tag-value pairs (with whitespace) renders problematic the parsing of
Version 1 style comments (following the group field, without a '#' character), so in Version 2,
such [group] trailing comments <B>must</B> start with the "#", as noted above.
<A NAME="meta_info"><h4> ## comment lines for meta information </h4>
There is a set of standardised (i.e. parsable) ## line types that can
be used optionally at the top of a gff file. The philosophy is a
little like the special set of %% lines at the top of postscript
files, used for example to give the BoundingBox for EPS files.<P>
Current proposed ## lines are:
<dl>
<dt><pre> ##gff-version 1 </pre>
<dd> GFF version - in case it is a real success and we want to
change it. The current version is 2. (<b>Version 2 change</b>!)
<dt><pre> ##source-version {source} {version text} </pre>
<dd> So that people can record what version of a program or package was
used to make the data in this file. I suggest the version is text
without whitespace. That allows things like 1.3, 4a etc.
<dt> <pre> ##date {date} </pre>
<dd> The date the file was made, or perhaps that the prediction
programs were run. We suggest to use astronomical format: 1997-11-08
for 8th November 1997, first because these sort properly, and second
to avoid any US/European bias.
<dt> <pre>
##DNA {seqname}
##acggctcggattggcgctggatgatagatcagacgac
##...
##end-DNA
</pre>
<dd> To give a DNA sequence. Several people have pointed out that it may
be convenient to include the sequence in the file. It should not
become mandatory to do so. Often the seqname will be a well-known
identifier, and the sequence can easily be retrieved from a
database, or an accompanying file.
<dt> <pre> ##sequence-region {seqname} {start} {end} </pre>
<dd> To indicate that this file only contains entries for the
specified subregion of a sequence.
</dl>
Please feel free to propose new ## lines.
The ## line proposal came out of some discussions including Anders
Krogh, David Haussler, people at the Newton Institute on 1997-10-29
and some email from Suzanna Lewis. Of course, naive programs can
ignore all of these...
<A NAME="file_names"><h3> File Naming </h3>
We propose that the format is called "GFF", with conventional file
name ending ".gff".
<P>
Back to <A HREF="#TOC">Table of Contents</A>
<P>
<HR>
<A NAME="semantics"><h2> Semantics </h2>
We have intentionally avoided overspecifying the semantics of the
format. For example, we have not restricted the items expressible in
GFF to a specified set of feature types (splice sites, exons etc.)
with defined semantics. Therefore, in order for the information in a
gff file to be useful to somebody else, the person producing the
features must describe the meaning of the features. <P>
In the example given above the feature "splice5" indicates that there
is a candidate 5' splice site between positions 172 and 173. The
"sp5-20" feature is a prediction based on a window of 20 bp for the
same splice site. To use either of these, you must know the position
within the feature of the predicted splice site. This only needs to
be given once, possibly in comments at the head of the file, or in a
separate document. <P>
Another example is the scoring scheme; we ourselves would like the
score to be a log-odds likelihood score in bits to a defined null
model, but that is not required, because different methods take
different approaches.
Avoiding a prespecified feature set also leaves open the possibility
for GFF to be used for new feature types, such as CpG islands,
hypersensitive sites, promoter/enhancer elements, etc.
<P>
Back to <A HREF="#TOC">Table of Contents</A>
<P>
<HR>
<A NAME="GFF_use"><h2> Ways to use GFF </h2>
Here are a few suggestions on how the GFF format might be used.
<ol>
<li> Simple sharing of sensors. In this case, researcher A has a sensor,
such as a 3' splice site sensor, and researcher B wants to test that
sensor. They agree on a set of sequences, researcher A runs the
sensor on these sequences and sends the resulting GFF file to
researher B, who then evaluates the result.<P>
<li> Representing experimental results. GFF feature records can also
be created for experimentally confirmed exons and other features. In
these cases there will presumably be no score. Such "confirmed" GFF
files will be useful for evaluating predictions, using the same
software as you would to compare predictions.<P>
<li> Integrated gene parsing. Several GFF files from different
researchers can be combined to provide the features used by an
integrated genefinder. As mentioned above, this has the advantage
that different combinations of sensors and dynamic programming methods
for assembling sensor scores into consistent gene parses can be easily
explored.<P>
<li> Reporting final predictions. GFF format can also be used to
communicate finished gene predictions. One simply reports final
predicted exons and other predicted gene features, either with their
original scores. or with some sort of posterior scores, rather than,
or in addition to, reporting all candidate gene features with their
scores. To show that a set of the components belong to a single
prediction, a "group" field can be added to all the accepted sites.
This is useful for comparing the outputs of several integrated
genefinders among themselves, and to "confirmed" GFF files. A
particular advantage of having the same format for both raw sensor
feature score files and final gene parse files is that one can easily
explore the possibility of combining the final gene parses from
several different genefinders, using another round of dynamic
programming, into a single integrated predicted parse.<P>
<li> Visualisation. GFF will also provide a simple standard format for
standardising input to visualisation programs, showing predicted and
experimentally determined features, gene structures etc.
</ol>
<P>
Back to <A HREF="#TOC">Table of Contents</A>
<P>
<HR>
<A NAME="examples"><h3> Complex Examples</h3>
<A NAME="homology_feature">
<h4> Similarities to Other Sequences </h4>
A major source of information about a sequence comes from similarities
to other sequences. For example, BLAST hits to protein sequences help
identify potential coding regions. We can represent these as a set of
"homology gene features", grouping hits to the same target as follows:
<font size="3"><pre>
seq1 BLASTX similarity 101 136 87.1 + 0 HBA_HUMAN
seq1 BLASTX similarity 107 133 72.4 + 0 HBB_HUMAN
seq1 BLASTX similarity 290 343 67.1 + 0 HBA_HUMAN
</pre></font>
If further information is needed about where in the target protein
each match occurs, it can be given after the protein name, e.g.
as the start coordinate in the target.
<P>
<b>Version 2 change</b>: In version 2 this has been formalised using
the tag Target which expects to be followed by the name of the target,
followed (optionally) by start and end point in the target as
integers, as in
<font size="3"><pre>
seq1 BLASTX similarity 101 235 87.1 + 0 Target "HBA_HUMAN" 11 55 ; E_value 0.0003
</pre></font>
We need to finalise on a tag model for gapped alignments...
<P>
Back to <A HREF="#TOC">Table of Contents</A>
<P>
<HR>
<A NAME="cum_score_array"><h3> Cumulative Score Arrays </h3>
One issue that comes up with a record-based format such as the GFF
format is how to cope with large numbers of overlapping segments. For
example, in a long sequence, if one tries to include a separate record
giving the score of every candidate exon, where a candidate exon is
defined as a segment of the sequence that begins and ends at candidate
splice sites and consists of an open reading frame in between, then
one can have an infeasibly large number of records. The problem is
that there can be a huge number of highly overlapping exon
candidates. <P>
Let us assume that the score of an exon can be decomposed into three
parts: the score of the 5' splice site, the score of the 3' splice
site, and the sum of the scores of all the codons in between. In such
a case it can be much more efficient to use the GFF format to report
separate scores for the splice site sensors and for the individual
codons in all three (or six, including reverse strand) frames, and let
the program that interprets this file assemble the exon scores. The
exon scores can be calculated efficiently by first creating three
arrays, each of which contains in its [i]th position a value A[i] that
is the partial sum of the codon scores in a particular frame for the
entire sequence from position 1 up to position i. Then for any
positions i < j, the sum of the scores of all codons from i to j can
be obtained as A[j] - A[i]. Using these arrays, along with the
candidate splice site scores, a very large number of scores for
overlapping exons are implicitly defined in a data structure that
takes only linear space with respect to the number of positions in the
sequence, and such that the score for each exon can be retrieved in
constant time. <P>
When the GFF format is used to transmit scores that can be summed for
efficient retrieval as in the case of the codon scores above, we ask
that the provider of the scores indicate that these scores are
summable in this manner, and provide a recipe for calculating the
scores that are to be derived from these summable scores, such as the
exon scores described above. We place no limit on the complexity of
this recipe, nor do we provide a standard protocol for such assembly,
other than providing examples. It behooves the sensor score provider
to keep the recipe simple enough that others can easily implement it.
<P>
Back to <A HREF="#TOC">Table of Contents</A>
<P>
<HR>
<A NAME="mailing_list"><h2> Mailing list </h2>
<P>
There is a <A HREF="mailto:gff-list@sanger.ac.uk"> mailing list </a>
to which you can send comments, enquiries, complaints etc. about GFF.
If you want to be added to the mailing list, please send
mail to <A HREF="mailto:Majordomo@sanger.ac.uk">Majordomo@sanger.ac.uk</A> with the
following command in the body of your email message:
<P>
<code>
subscribe gff-list
</code>
<P>
<P>
Back to <A HREF="#TOC">Table of Contents</A>
<P>
<HR>
<A NAME="edit_history"><h2>Edit History</h2></A>
<P>
971028 rd: I changed the comment initiator to '#' from '//' because a
single symbol is easier for simple parsers.<P>
971028 rd: We also now allow extra text after <group>
without a comment character, because this immediately proved useful.<P>
971028 rd: I considered switching from start-end notation to
start-length notation, on the suggestion of Anders Krogh. This seems
nicer in many cases, but is a debatable point. I then switched back!<P>
971028 rd: I added the section about name space.<P>
971108 rd: added ## line proposals - moved them into main text 971113.<P>
971113 rd: added extra "source" field as discussed at Newton Institute
meeting 971029. There are two main reasons. First, to help prevent
name space clashes -- each program would have their own source
designation. Second, to help reuse feature names, so one could have
"exon" for exon predictions from each prediction program.<P>
971113 rd: added section on mailing list.<P>
980909 ihh: fixed some small things and put this page on the Sanger
GFF site.<P>
981216 rd: introduced version 2 changes.<P>
990226 rbsk: incorporated amendments to the version 2 specification as follows:<P>
<UL>
<LI>Non-printing characters (e.g. newlines, tabs) in Version 2 double quoted
"free text values" must be explicitly represented by their C (UNIX) style
backslash escaped character (i.e. '\t' for tabs, '\n' for newlines, etc.)<br>
<LI>Removed field (256) and line (32K) character size limitations for Version 2.
<LI>Removed arbitrary whitespace field delimiter permission from specification.
TAB ('\t') field delimiters now enforced again, as in Version 1.<br>
</UL>
990317 rbsk:
<UL>
<LI>End of line comments following Version 2 [group] field tag-value structures must be
tab '\t' or hash '#' delimited.
</UL>
<P>
<P>
Back to <A HREF="#TOC">Table of Contents</A>
<P>
<HR>
<A NAME="authors"><h2>Authors</h2></A>
<P>
GFF Protocol Specification initially proposed by:
<A HREF="mailto:rd@sanger.ac.uk">Richard Durbin</a> and
<A HREF="mailto:haussler@cse.ucsc.edu">David Haussler</a>
<P>with amendments proposed by:
<A HREF="mailto:lstein@cshl.org">Lincoln Stein</a>, Anders Krogh and others.
<P>The GFF specification now maintained at the Sanger Centre by
<A HREF="mailto:rbsk@sanger.ac.uk">Richard Bruskiewich</a>
<P>
Back to <A HREF="#TOC">Table of Contents</A>
<P>
<!-- page content ends here -->
</TD></TR></TABLE></CENTER> <!-- close table for page content -->
<HR ALIGN="CENTER" WIDTH="90%">
<!-- open table for page footer -->
<TABLE BORDER="0" WIDTH="100%">
<TR>
<TD ALIGN=LEFT>
<I>
last modified : 25-Mar-1999, 01:59 PM
</I>
</TD>
<TD ALIGN=RIGHT>
<A HREF=/Users/rbsk/>Richard Bruskiewich</A>
<I>(<A HREF=mailto:rbsk@sanger.ac.uk>rbsk@sanger.ac.uk</A>)</I>
</TD>
</TR>
</TABLE> <!-- close table for page footer -->
</BODY>
</HTML>
|