1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150
|
<TITLE>Blocks Format</TITLE>
<H1><IMG SRC="/blocks/icons/small-blocks.xbm"> Format of a Block</H1>
<! PRE is required to get multiple spaces, etc. >
<PRE>
ID short_identifier; BLOCK
AC block_number; distance from previous block = (<I>min,max</I>)
DE description
BL <I>xxx</I> motif; width=<I>w</I>; seqs=<I>s</I>; 99.5%=<I>n1</I>; strength=<I>n2</I>
sequence_id (offset) sequence_segment sequence_weight
.
.
.
//</PRE><HR>
ID line starts a block entry and contains a short identifier for the group
of sequences from which the block was made.
If the block was taken from InterPro, it will be the InterPro group ID.
The identifier is terminated by a semi-colon, and the word "BLOCK"
indicates the entry type.<P>
AC line contains the block number, a seven-character group number for
sequences from which the block was made, followed by a letter (A-Z)
indicating the order of the block in the sequences.
If the group has only one block, the letter is omitted.
If the block was made from InterPro group IPRnnnnnn, the block number
is IPBnnnnnna.
If the block was converted from Terri Attwood's Prints Database the
block number is PRnnnnna.
<I>min,max</I> = minimum,maximum number of amino acids from previous block for
sequences in this block. For the first block in the group, the distance from
the beginning of the sequences.<P>
DE line contains a description of the group of sequences from which
the block was made. If the block was taken from InterPro, it will be
a slightly edited version of the InterPro description.<P>
BL line contains information about the block:<BR>
<I>xxx</I> = the amino acids in the spaced triplet found by MOTIF upon
which the block is based.<BR>
<I>w</I> = width of the sequence segments (columns) in the block.<BR>
<I>s</I> = number of sequence segments (rows) in the block.<BR>
<I>n1</I> = raw calibration score; 99.5th percentile score of true
negative sequences. Raw search scores are normalized by
dividing by this score and multiplying by 1000.<BR>
<I>n2</I> = median normalized score of known true positive sequences as
documented in InterPro.<P>
Following the BL line are lines for each sequence with
a segment in the block. The segments may be clustered with clusters
separated by blank lines.
Each segment line contains a sequence identifier,
the offset from the beginning of the sequence
to the block in parentheses, the sequence segment, and a weight for the
segment. The weights are normalized so that
the most distant segment has a weight of 100.<P>
// line terminates a block entry.
<P>
<A HREF="/blocks/help/blocks_release.html">Current Blocks Database Release</A>
<P>
<A HREF="/blocks/help/about_blocks.html">About the Blocks Database</A>
<HR>
<H2>Other Multiple Alignment Formats</H2>
<H3><A NAME="fasta">FASTA Format<A><BR></H3>
Each sequence in the multiple alignment starts with a FASTA
title line containing the sequence name followed by the
aligned sequence residues with dashes representing gaps:
<PRE>
>JC2395
NVSDVNLNK---YIWRTAEKMK---ICDAKKFARQHKIPESKIDEIEHNSPQDAAE----
-------------------------QKIQLLQCWYQSHGKT--GACQALIQGLRKANRCD
IAEEIQAM
>KPEL_DROME
MAIRLLPLPVRAQLCAHLDAL-----DVWQQLATAVKLYPDQVEQISSQKQRGRS-----
-------------------------ASNEFLNIWGGQYN----HTVQTLFALFKKLKLHN
AMRLIKDY
>FASA_MOUSE
NASNLSLSK---YIPRIAEDMT---IQEAKKFARENNIKEGKIDEIMHDSIQDTAE----
-------------------------QKVQLLLCWYQSHGKS--DAYQDLIKGLKKAECRR
TLDKFQDM
</PRE>
<P><HR>
<H3><A NAME="clustal">CLUSTAL/STOCKHOLM Format<A></H3>
<A HREF="http://www2.ebi.ac.uk/clustalw">ClustalW Site</A>.<BR>
The first non-blank line must contain the word "CLUSTAL" or "STOCKHOLM".
Sequences are interleaved on separate lines with gaps represented
by dashes. Each sequence line starts with the sequence name which is
separated from the aligned sequence residues by spaces or tabs.
Each set of interleaved sequence segments is separated by one or more
blank lines. Lines containing sequence conservations symbols (CLUSTAL)
or "//" (STOCKHOLM) are ignored.
<BR>
(Please note: Some WWW sites post-process Clustal output so that
it has a different format than in this example; in this case use
FASTA format).
<PRE>
CLUSTAL W(1.60) multiple sequence alignment
JC2395 NVSDVNLNK---YIWRTAEKMK---ICDAKKFARQHKIPESKIDEIEHNSPQDAAE----
KPEL_DROME MAIRLLPLPVRAQLCAHLDAL-----DVWQQLATAVKLYPDQVEQISSQKQRGRS-----
FASA_MOUSE NASNLSLSK---YIPRIAEDMT---IQEAKKFARENNIKEGKIDEIMHDSIQDTAE----
JC2395 -------------------------QKIQLLQCWYQSHGKT--GACQALIQGLRKANRCD
KPEL_DROME -------------------------ASNEFLNIWGGQYN----HTVQTLFALFKKLKLHN
FASA_MOUSE -------------------------QKVQLLLCWYQSHGKS--DAYQDLIKGLKKAECRR
JC2395 IAEEIQAM
KPEL_DROME AMRLIKDY
FASA_MOUSE TLDKFQDM
</PRE>
<P><HR>
<H3><A NAME="msf">MSF Format<A></H3>
Any comments at the beginning of the file are terminated with a line
starting with two slashes.
Sequences are interleaved on separate lines with gaps represented
by periods. Each sequence line starts with the sequence name which is
separated from the aligned sequence residues by white space:
<PRE>
//
1 50
JC2395 NVSDVNLNK. ..YIWRTAEK MK...ICDAK KFARQHKIPE SKIDEIEHNS
KPEL_DROME MAIRLLPLPV RAQLCAHLDA L.....DVWQ QLATAVKLYP DQVEQISSQK
FASA_MOUSE NASNLSLSK. ..YIPRIAED MT...IQEAK KFARENNIKE GKIDEIMHDS
51 100
JC2395 PQDAAE.... .......... .......... .....QKIQL LQCWYQSHGK
KPEL_DROME QRGRS..... .......... .......... .....ASNEF LNIWGGQYN.
FASA_MOUSE IQDTAE.... .......... .......... .....QKVQL LLCWYQSHGK
101
JC2395 T..GACQALI QGLRKANRCD IAEEIQAM
KPEL_DROME ...HTVQTLF ALFKKLKLHN AMRLIKDY
FASA_MOUSE S..DAYQDLI KGLKKAECRR TLDKFQDM
<HR>
<A href="/blocks">[Blocks home]</A>
|