File: README.txt

package info (click to toggle)
ruby-bio 2.0.6-1
links: PTS, VCS
area: main
in suites: forky, sid
size: 7,108 kB
sloc: ruby: 68,331; perl: 13; makefile: 11; sh: 1
file content (109 lines) | stat: -rw-r--r-- 4,951 bytes
parent folder | download | duplicates (25)
This README file describes the FASTQ example files provided as supplementary
information to the open-access publication:

P.J.A. Cock, C.J. Fields, N. Goto, M.L. Heuer and P.M. Rice (2009). The Sanger
FASTQ file format for sequences with quality scores, and the Solexa/Illumina
FASTQ variants.

These files are provided freely and we encourage anyone writing a FASTQ parser
to use them as part of your test suite. Permission is granted to freely
distribute and modify the files. We request (but do not insist) that this
README file is included, or at least a reference to the above paper. Please
cite the above paper if appropriate. We also request (but do not insist) that
the example files are not modified, in order that they may serve as a common
reference.

Invalid FASTQ files
===================

The archive contains the following sample FASTQ files with names of the form
error_NAME.fastq, which all contain errors and should be rejected (if parsed
as any of the three FASTQ variants):

error_diff_ids.fastq
error_double_qual.fastq
error_double_seq.fastq
error_long_qual.fastq
error_no_qual.fastq
error_qual_del.fastq
error_qual_escape.fastq
error_qual_null.fastq
error_qual_space.fastq
error_qual_tab.fastq
error_qual_unit_sep.fastq
error_qual_vtab.fastq
error_short_qual.fastq
error_spaces.fastq
error_tabs.fastq
error_trunc_at_seq.fastq
error_trunc_at_plus.fastq
error_trunc_at_qual.fastq
error_trunc_in_title.fastq
error_trunc_in_seq.fastq
error_trunc_in_plus.fastq
error_trunc_in_qual.fastq

Of these, those with names error_qual_XXX.fastq would be valid except for the
inclusion of spaces or non-printing ASCII characters outside the range allowed
in the quality string. The files named error_trunc_XXX.fastq would be valid
but for being truncated (e.g. simulating a partial copy over the network).

The special cases of FASTQ files which would be valid as one variant, but not
another, are covered below.

Valid FASTQ
===========

The archive contains the following valid sample FASTQ input files for testing:

longreads_original_sanger.fastq
wrapping_original_sanger.fastq
illumina_full_range_original_illumina.fastq
sanger_full_range_original_sanger.fastq
solexa_full_range_original_solexa.fastq
misc_dna_original_sanger.fastq
misc_rna_original_sanger.fastq

These all have the form NAME_original_FORMAT.fastq, where NAME is a prefix for
that example, and FORMAT is one of sanger, solexa or illumina indicating which
FASTQ variant that example is using. There are three matching files called
NAME_as_FORMAT.fastq showing how the original file should be converted into
each of the three FASTQ variants. These converted files are standardised not
to use line wrapping (so each record has exactly four lines), and omit the
optional repetition of the read titles on the plus line.

The file longreads_original_sanger.fastq is based on real Roche 454 reads from
the Sanger Institute for the the potato cyst nematodes Globodera pallida. Ten
of the reads have been presented as FASTQ records, wrapping the sequence and
the quality lines at 80 characters. This means some of the quality lines start
with "@" or "+" characters, which may cause problems with naive parsers. Also
note that the sequence is mixed case (with upper case denoting the trimmed
region), and furthermore the free format title lines are over 100 characters
and encode assorted read information (and are repeated on the "+" lines).

The wrapping_original_sanger.fastq is based on three real reads from the NCBI
Short Read Archive, but has been carefully edited to use line wrapping for the
quality lines (but not the sequence lines) such that the due to the occurrence
of "@" and "+" on alternating lines, the file may be misinterpreted by a
simplistic parser. While this is therefore a very artificial example, it
remains a valid FASTQ file, and is useful for testing purposes.

The sanger_full_range_original_sanger.fastq file uses PHRED scores from 0 to
93 inclusive, covering ASCII characters from 33 (!) to 126 (~). This means it
cannot be treated as a Solexa or Illumina 1.3+ FASTQ file, and attempting to
parse it as such should raise an error.

The solexa_full_range_original_solexa.fastq file uses Solexa scores from -5 to
62 inclusive, covering ASCII characters from 59 (;) to 126 (~). This means it
cannot be treated as a Illumina 1.3+ FASTQ file, and attempting to parse it as
such should raise an error. On the basis of the quality characters, the file
would also qualify as a valid Sanger FASTQ file.

The illumina_full_range_original_illumina.fastq file uses PHRED scores from 0
to 62 inclusive, covering ASCII characters from 64 (@) to 126 (~). On the
basis of the quality characters, the file would also qualify as a valid Sanger
or Solexa FASTQ file.

The misc_dna_original_sanger.fastq and misc_rna_original_sanger.fastq files
are artificial reads using the full range of IUPAC DNA or RNA letters,
including ambiguous character codes, and both cases.