File: CAF.5

package info (click to toggle)
caftools 2.0.3-1
links: PTS, VCS
area: non-free
in suites: bullseye
size: 10,844 kB
sloc: ansic: 24,940; sh: 3,628; perl: 452; makefile: 150
file content (123 lines) | stat: -rw-r--r-- 5,634 bytes
.TH CAF 5
.SH NAME
CAF - common assembly format
.SH DESCRIPTION
.PP
CAF is an interchange format for DNA sequence assemblies.  It can store
DNA sequence data, base quality estimates and an extensive description of the
reads and contigs in the assembly.
.PP
CAF is intended to be sufficently comprehensive that any assembly engine/editor
such as Phrap, Consed, Gap, Acembly, cap3 etc can derive all the information
it needs from the CAF file without reading any other data, (except for trace
information which is still held in SCF files). The eventual aim is to be able
to freely convert between CAF and any other format so that different assembly
programs can be combined. This aim is still some way off because of
incompatibilities between these programs. Currently it is possible to convert
to and from Phrap, cap3, and into GAP (the reverse is possible with some loss
of information).
.PP
In CAF files, Comments are any text preceded by //
.PP
CAF supports three object types, Sequence, DNA and BaseQuality. The Sequence
type is the most complex. All base coordinates start at position 1 (NOT 0).
These are the important Sequence attributes (others are available in the CAF
acedb model but are not currently used by any processing module).
.SS Anatomy of a CAF file
.PP
CAF files consist of one or more objects.  Each object is separated by at
least one blank line.  The CAF format has three types of object: DNA,
Sequence and BaseQuality.
.PP
A typical object looks like this:
.EX
Object_type : "Object_name"
Data...
.EE
.PP
The format of the data depends on the type of the object.  For DNA objects,
it is the base calls for the read or contig represented by  the object.  For
BaseQuality, it is a series of quality values separated by spaces.  For
Sequence objects, it is a series of tags with optional data, one per line.
.PP
A description of each variety of sequence tag follows.
.SS Sequence object
Sequence objects come in a variety of flavours - read, contig, group and
assembly.
.SS A read sequence object
.EX
Sequence : "Readname"           // Name of the Sequence
Is_read                         // Defines this as a read Sequence
Padded | Unpadded               // Padding state
ProcessStatus "State" "Text"    // Asp pass or failure, with reason
                                // "State" can be:
                                //   PASS, 
                                //   SVEC (completely seq vector), 
                                //   QUAL,(poor trace quality),
                                //   CONT (contaminant, eg E.coli)
Asped "Date"                    // When the read was processed
Dye Dye_terminator | Dye_primer // Sequencing chemistry
SCF_File "Filename"             // Trace file name
Primer Unknown_primer | Universal_primer | Custom "Oligo"
                                // Primer type
                                // Including custom primer sequence if known
Template "Template"             // Template (a.k.a. subclone) name
Insert_size x1 x2               // Predicted range [x1, x2] of insert size
Ligation_no "Text"              // Ligation (a.k.a sequencing library) name
Strand Forward | Reverse        // Read strand
Seq_vec SVEC x1 x2 "Text"       // Mark bases x1 to x2 as sequencing vector
Clone_vec CVEC x1 x2 "Text"     // Mark bases x1 to x2 as cloning vector
Clipping "Type" x1 x2 "Text"    // Mark bases x1 to x2 as good quality
Tag "Type" x1 x2 "Text"         // General tag from base x1 to base x2
Align_to_SCF r1 r2 t1 t2        // Alignment of the read sequence to the
                                // original trace file.  Region [r1, r2]
                                // in the DNA sequence aligns to [t1, t2]
                                // in the trace.
.EE
.SS A contig sequence object
.EX
Sequence : "Contig_name"        // Name of the Sequence
Is_contig                       // Defines this as a contig Sequence
Padded | Unpadded               // Padding state
Tag "Type" x1 x2 "Text"         // General tag from base x1 to base x2
Assembled_from "Read" s1 s2 r1 r2
                                // Contigs only:
                                // Alignment of Read to contig. 
                                // Interval [r1, r2] in the read aligns with
                                // [s1,s2] in contig. 
                                // If s1 > s2 then align the reverse
                                // complement of [r1,r2] with [s1,s2].

.EE
.SS An assembly sequence object
.EX
Sequence : "Name"               // Name of the assembly
Is_assembly
Group_order  Group p1           // Defines Group to be at group position p1
                                // within the assembly
.EE
.SS A contig group sequence object
.EX
Sequence : "Name"               // Name of the group
Is_group
Contig_order Contig q1          // Groups only: Defines Contig to be at
	                        //  relative position q1 within group
.EE
.SS DNA object
DNA objects contain the basecalls for the corresponding read or contig
Sequence object.
.EX
DNA : "Name"                        // Name of sequence
ACGTGCGG......                      // The sequence: Use ACGT, N for unknowns,
                                    // - for pads.
.EE
.SS BaseQuality object
BaseQuality objects contain 
.EX
BaseQuality : "Name"                // Name of sequence
0 12 13 90 ...                      // Base qualities. These must be positive
                                    // integers between 0 and 99 inclusive.
                                    // If the Base Quality is present it must
                                    // be the same length as the DNA
.EE