File: format_ncbi.tex

package info (click to toggle)
hmmer 3.2.1%2Bdfsg-1
links: PTS, VCS
area: main
in suites: buster
size: 23,380 kB
sloc: ansic: 119,305; perl: 8,791; sh: 3,266; makefile: 1,871; python: 598
file content (271 lines) | stat: -rw-r--r-- 10,446 bytes
parent folder | download | duplicates (9)

The NCBI BLAST databases are generated by the program \emph{formatdb}.
The three files needed by Easel are index file, sequence file 
and header file.For protein databases these files end with the extension 
".pin", ".psq" and ".phr" respectively.  For DNA databases the 
extensions are ".nin", ".nsq" and ".nhr" respectively.  The index 
file contains information about the database, i.e. version number, 
database type, file offsets, etc.  The sequence file contains residues
for each of the sequences.  Finally, the header file contains the header 
information for each of the sequences.  This document describes the
structure of the NCBI version 4 database.

If these files cannot be found, an alias file, extensions ".nal" or
 ".pal", is processed.  The alias file is used to specify multiple
volumes when databases are larger than 2 GB.


\subsection{Index File (*.pin, *.nin)}

The index file contains format information about the database.   The 
layout of the version 4 index file is below:

\bigskip
\begin{center}
\begin{tabular}{|l|l|p{3.5in}|} \hline
Version &
Int32 &
Version number. \\ \hline
Database type &
Int32 &
0-DNA 1-Protein.  \\ \hline
Title length &
Int32 &
Length of the title string (\emph{T}).  \\ \hline
Title &
Char[\emph{T}] &
Title string.  \\ \hline
Timestamp length &
Int32 &
Length of the timestamp string (\emph{S}).  \\ \hline
Timestamp &Char[\emph{S}] &
Time of the database creation.  The length of the timestamp \emph{S} is 
increased to force 8 byte alignment of the remaining integer 
fields.  The timestamp is padded with NULs to achieve this alignment.   \\ \hline
Number of sequences &
Int32 &
Number of sequences in the database (\emph{N})  \\ \hline
Residue count &
Int64 &
Total number of residues in the database.  Note:  Unlike other integer 
fields, this field is stored in little endian.  \\ \hline
Maximum sequence &
Int32 &
Length of the longest sequence in the database  \\ \hline
Header offset table &
Int32[\emph{N+1}] &
Offsets into the sequence's header file (*.phr, *nhr). \\ \hline
Sequence offset table &
Int32[\emph{N+1}] &
Offsets into the sequence's residue file (*.psq, *.nsq). \\ \hline
Ambiguity offset table &
Int32[\emph{N+1}] &
Offset into the sequence's residue file (*.nsq).  Note: This table is only 
in DNA databases.  If the sequence does not have any ambiguity 
residues, then the offset points to the beginning of the next 
sequence.  \\ \hline
\end{tabular}
\end{center}
\bigskip

The integer fields 
are stored in big endian format, except for the residue count which is 
stored in little endian.  The two string fields, timestamp and title 
are preceded by a 32 bit length field.  The title string is not NUL 
terminated.  If the end of the timestamp field does not end on an offset 
that is a multiple of 8 bytes, NUL characters are padded to the end of 
the string to bring it to a multiple of 8 bytes.  This forces all the 
following integer fields to be aligned on a 4-byte boundary for 
performance reasons.  The length of the timestamp field reflects the 
NUL padding if any.  The header offset table is a list of offsets to
the beginning of each sequence's header.  These are offsets into
the header file (*.phr, *.nhr).  The size of the header can be 
calculated by subtracting the offset of the next header from the 
current header.  
The sequence offset table is a list of offsets to
the beginning of each sequence's residue data.  These are offsets into
the sequence file (*.psq, *.nsq).  The size of the sequence can be 
calculated by subtracting the offset of the next sequence from the 
current sequence.  
Since one more offset is stored than the number 
of sequences in the database, no special code is needed in calculating 
the header size or sequence size for the last entry in the database.


\subsection{Protein Sequence File (*.pin)}

The sequence file contains the sequences, one after another.  The 
sequences are in a binary format separated by a NUL byte.  Each 
residue is encoded in eight bits.

\bigskip
\begin{center}
\begin{tabular}{|c|c|c|c|c|c|c|c|} \hline
Amino acid & Value & Amino acid & Value & Amino acid & Value & Amino acid & Value \\ \hline
- & 0 & G &  7 & N & 13 & U & 24 \\ \hline
A & 1 & H &  8 & O & 26 & V & 19 \\ \hline
B & 2 & I &  9 & P & 14 & W & 20 \\ \hline
C & 3 & J & 27 & Q & 15 & X & 21 \\ \hline
D & 4 & K & 10 & R & 16 & Y & 22 \\ \hline
E & 5 & L & 11 & S & 17 & Z & 23 \\ \hline
F & 6 & M & 12 & T & 18 & * & 25 \\ \hline
\end{tabular}
\end{center}


\subsection{DNA Sequence File (*.nsq)}

The sequence file contains the sequences, one after another.  The 
sequences are in a binary format but unlike the protein sequence 
file, the sequences are not separated by a NUL byte.  The 
sequence is first compressed using two bits per residue then 
followed by an ambiguity correction table if 
necessary.  If the sequence does not have an ambiguity table, 
the sequence's ambiguity index points to the beginning of the 
next sequence.

\subsubsection{Two-bit encoding}

The sequence is encoded first using two bits per nucleotide.

\bigskip
\begin{center}
\begin{tabular}{|c|c|c|} \hline
Nucleotide & Value & Binary \\ \hline
A & 0 & 00 \\ \hline
C & 1 & 01 \\ \hline
G & 2 & 10 \\ \hline
T or U & 3 & 11 \\ \hline
\end{tabular}
\end{center}
\bigskip

Any 
ambiguous residues are replaced by an 'A', 'C', 'G' or 'T' in 
the two bit encoding.  To calculate the number of residues 
in the sequence, the least significant two bits in the 
last byte of the sequence needs to be examined.  
These last two bits indicate how many residues, if any, are 
encoded in the most significant bits of the last byte.


\subsubsection{Ambiguity Table}

To correct a sequence containing any degenerate residues, an 
ambiguity table follows the two bit encoded string.  
The start of the ambiguity table is 
pointed to by the ambiguity table index in the index file, 
"*.nin".  The first four bytes contains the number of 32 bit 
words in the correction table.  If the most significant bit 
is set in the count, then two 32 bit entries will be used for 
each correction.  
The 64 bit entries are used for sequence with
more than 16 million residues.  Each correction contains three 
pieces of 
information, the actual encoded nucleotide, how many nucleotides
in the sequence are replaced by the correct nucleotide and finally 
the offset into the sequences to apply the correction. 

For 32 bit
entries, the first 4 most significant bits encodes the nucleotide.
Their bit pattern is 
true of their representation, i.e. the value of 'H' is equal 
to ('A'~or~'T'~or~'C').  

\bigskip
\begin{center}
\begin{tabular}{|c|c|c|c|c|c|c|c|} \hline
Nucleotide & Value & Nucleotide & Value & Nucleotide & Value & Nucleotide & Value \\ \hline
- & 0 & G & 4 & T & 8  & K & 12 \\ \hline
A & 1 & R & 5 & W & 9  & D & 13 \\ \hline
C & 2 & S & 6 & Y & 10 & B & 14 \\ \hline
M & 3 & V & 7 & H & 11 & N & 15 \\ \hline
\end{tabular}
\end{center}
\bigskip

The next field is the repeat count which is four bits wide. 
One is added to the count giving it the range of 1 -- 256.  
The last 24 bits is the offset into the sequence where the
replacement starts.  The first residue start at offset zero,
the second at offset one, etc.  With a 24 bit size, the offset
can only address sequences around 16 million residues long.

To address larger sequences, 64 bit
entries are used.  For 64 bit entries, the order of the entries stays the same, 
but their sizes change.  The nucleotide remains at four bits.
The repeat count is increased to 12 bits giving it the range 
of 1 -- 4096.  The offset size is increased to 48 bits.


\subsection{Header File (*.phr, *.nhr)}

The header file contains the headers for each sequence, one after another.  
The sequences are in a binary encoded ASN.1 format.  The length 
of a header can be calculated by subtracting the offset of the 
next sequence from the current sequence offset.  The ASN.1 definition 
for the headers can be found in the NCBI toolkit in the following 
files: asn.all and fastadl.asn.

The parsing of the header can be done with a simple recursive 
descent parser.  The five basic types defined in the header are:

\begin{itemize}
\item Integer -- a variable length integer value.
\item VisibleString -- a variable length string.
\item Choice -- a union of one or more alternatives.  
\item Sequence -- an ordered collection of one or more types.  
\item SequenceOf -- an ordered collection of zero or more occurrences 
of a given type.
\end{itemize} 

\subsubsection{Integer}

The first byte of an encoded integer is a hex \verb+02+.  The next byte 
is the number of bytes used to encode the integer value.  The 
remaining bytes are the actual value.  The value is encoded 
most significant byte first.

\subsubsection{VisibleString}

The first byte of a visible string is a hex \verb+1A+.  
The next byte 
starts encoding the length of the string.  If the most 
significant bit is off, then the lower seven bits encode the 
length of the string, i.e. the string has a length less than 128. 
If the most significant bit is on, then 
the lower seven bits is the number of bytes that hold the length of 
the string, then the bytes encoding the string length, most significant 
bytes first.
Following the length are the actual string characters.
The strings are not NUL terminated.  

\subsubsection{Choice}


The first byte indicates which selection of the choice.  The choices 
start with a hex value \verb+A0+ for the first item, \verb+A1+ for 
the second, etc.  
The selection is followed by a hex \verb+80+.  Two NUL bytes follow 
the choice.

\subsubsection{Sequence}

The first two bytes are a hex \verb+3080+.  The header is 
then followed by 
the encoded sequence types.  The first two bytes indicates which type 
of the sequence is encoded.  This index starts with the hex value 
\verb+A080+ 
for the first item, \verb+A180+ for the second, etc. then 
followed by the 
encoded item and finally two NUL bytes, \verb+0000+, to indicate the end 
of that type.  The next type in the sequence is then encoded.  If an 
item is optional and is not defined, then none of it is encoded 
including the index and NUL bytes.  This is repeated until the entire 
sequence has been encoded.  Two NUL bytes then mark the end of the 
sequence.

\subsubsection{SequenceOf}

The first two bytes are a hex \verb+3080+.  Then the lists of objects are 
encoded.  Two NUL bytes encode the end of the list.