1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283
|
## This (modified) SIFT version tested on gcc version 4.7.2
## This SIFT version tested on old legacy BLAST package (blastpgp and formatdb)
This SIFT installation is for
- submitting a single protein sequence
- submitting a protein alignment
- submitting a NCBI gi id.
This SIFT installation is NOT for:
- submitting VCF files (please use SIFT 4G)
- submitting genomic coordinates of variants (please use SIFT 4G)
- SIFT indel prediction
Please go to sift-dna.org or sift-dna.org/sift4g to download code for
the above functionalities.
1. INSTALL NCBI BLAST TOOLS
SIFT uses PSI-BLAST and makeblastdb commands from BLAST+ (This has been tested for 2.4.0).
Download and install NCBI BLAST 2.4.0+ (or the latest) version at:
ftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/
SIFT uses NCBI's PSI-BLAST and makeblastdb. Make sure these programs
are inside your package (otherwise install 2.4.0 which has these
programs).
ls <NCBI_BLAST_DIR>/bin
makeblastdb
psiblast
2. Download and format reference protein sequence database
This step is required for SIFT sequence submission. It is optional
if you are inputting a protein alignment or a NCBI gi id.
SIFT searches a database of protein sequences to find homologous
sequences. You will need to download a database of protein sequences
and format it properly so that SIFT subroutines can read it.
2a. Download a protein database. Some options are:
UniRef:
ftp://ftp.ebi.ac.uk/pub/databases/fastafiles/uniref/
SWISS-PROT/TrEMBL:
ftp://ftp.ebi.ac.uk/pub/databases/fastafiles/uniprot/
Refseq:
ftp://ftp.ncbi.nlm.nih.gov/refseq/release/complete/
(I personally prefer uniref90.gz)
2b. Format your database
SIFT cannot process names of fasta sequences that contain
delimiters such as "|" or ":". Also, it needs the first 10
characters of the sequence name to be unique.
Format the standard databases so that the sequence names are
simpler.
For UniRef, the format command is:
zcat uniref90.gz | perl -pe 's/^>UR090:UniRef90_/>/' > uniref90.fa
For Swiss-Prot or TrEMBL databases, the format command is:
zcat uniprot_swissprot.gz | perl -pe 's/^>TR:/>/; s/^>SP:/>/' > uniprotkb_swissprot.fa
For NCBI:
zcat complete.1000.protein.faa.gz |perl -pe 's/^>gi\|/>gi/' | perl -pe 's/\|/ /' > complete.1000.protein.faa
The names will be changed to the proper format for proper
parsing.
If you have your own protein sequence database and SIFT
is not properly recognizing the names, go to
src/Alignment.c and modify "fix_names"
Recompile ALL programs.
2c. Format your database:
<BLAST_DIR>/bin/makeblastdb -in <protein database fasta> -dbtype prot
3. UNPACK SIFT
tar -zxvf sift-<version>.tar.gz
This will create a directory of the form, sift-<version>. Change to
the directory.
cd sift-<version>
You can SKIP STEPS 4 & 5 if you're using Linux. Linux executables are already in the bin directory.
4. COMPILE BLIMPS CODE:
Go to the blimps directory:
cd <SIFT_HOME>/blimps
Compile blimps by typing:
make clean
make all
Folders 'obj','lib' and 'include' will be created.
Check that the blimps library 'libblimps.a' is generated in the 'lib'
folder.
If 'libblimps.a' is present, this indicates that BLIMPS has compiled
successfully.
5. COMPILE SIFT CODE:
Compile SIFT
Set the BLIMPS path in the file cccb (<SIFT_HOME>/src/cccb)
cd <SIFT_HOME>/src
Edit the file 'cccb'
set b = <BLIMPS Home>
set CC = <Path to GCC>
Compile the SIFT executables.
cd <SIFT_HOME>/src
./compile.csh
'compile.csh' will generate and move executable to <SIFT_HOME>/bin dirctory.
Check <SIFT_HOME>/bin directory for the following executables.
choose_seqs_via_psiblastseedmedian
clump_output_alignedseq
info_on_seqs
consensus_to_seq
psiblast_res_to_fasta_dbpairwise
seqs_from_psiblast_res
6. SET PATHS
Edit the files in <SIFT_HOME>/bin to set environmental variables
SIFT_for_submitting_fasta_seq.csh
SIFT_for_submitting_NCBI_gi_id.csh
# Set NCBI, BLIMPS and SIFT path in file 'SIFT_for_submitting_fasta_seq.csh' (<SIFT_HOME>/bin/SIFT_for_submitting_fasta_seq.csh)
setenv NCBI <BLAST directory where psiblast and makeblastdb are>
setenv SIFT_DIR <SIFT_HOME> # Location of SIFT (use SIFT absolute path eg: /home/software/sift)
setenv tmpdir = <SIFT_HOME>/tmp # SIFT's temporary output directory (use sift absolute path eg: /home/software/sift/tmp)
7. RUNNING SIFT
# Before test/run SIFT program, change working directory to SIFT bin (<SIFT_HOME>/bin) directory and execute below SIFT command (important).
# ie, Current working directoy is '<SIFT_HOME>/bin'
A. Input: Protein sequence. (SIFT chooses homologues).
Requires 3 inputs:
1) Protein sequence in fasta format.
2) Protein database to search. These sequences are
assumed to be functional
3) File of substitutions to be predicted on (optional).
See test/lacI.subst for an example of the format.
This file is optional. Alternatively, you can print
scores for the entire protein sequence.
Results will be in <tmpdir>/<seq_file>.SIFTprediction
COMMANDLINE FOR A LIST OF SUBSTITUTIONS:
Go to the bin directory <SIFT_HOME>:
cd <SIFT_HOME>
Type the commandline:
bin/SIFT_for_submitting_fasta_seq.csh <seq file> <protein_database> <file of substitutions>
EXAMPLE:
bin/SIFT_for_submitting_fasta_seq.csh test/lacI.fasta <protein_database> test/lacI.subst
Results will appear in <tmpdir>/lacI.fasta.SIFTprediction.
COMMANDLINE TO PRINT ALL SIFT SCORES
bin/SIFT_for_submitting_fasta_seq.csh <seq file> <protein_database> -
A dash "-" replaces the list of substitutions.
Results will appear in <tmp_dir>/<seq file>.SIFTprediction
B. Input: Your own protein alignment (and the path to the environmental variable BLIMPS_DIR, which was set in SIFT.csh).
COMMANDLINE FORMAT FOR A LIST OF SUBSTITIONS:
cd <SIFT_DIR>
export BLIMPS_DIR=<blimps_path>
bin/info_on_seqs <protein alignment> <substitution file> <output file>
EXAMPLE:
Type:
cd <SIFT_DIR>
export BLIMPS_DIR=<blimps_path>
bin/info_on_seqs test/lacI.alignedfasta test/lacI.subst test/lacI.fasta.SIFTprediction
Prediction results will appear in
test/lacI.fasta.SIFTprediction, read above for description
of output.
COMMANDLINE TO PRINT ALL SIFT SCORES:
export BLIMPS_DIR=<blimps_path>
bin/info_on_seqs <protein alignment> - <output file>
Example:
Type :
export BLIMPS_DIR=<blimps_path>
bin/info_on_seqs test/lacI.alignedfasta - test/lacI.fasta.SIFTprediction
and scores for each position will appear in the file
test/lacI.fast.SIFTprediction
C. Input: BLink gi.
Rather than using BLAST to search for homologous sequences,
homologues are retrieved from NCIB BLink.
csh ./SIFT_for_submitting_NCBI_gi_id.csh <gi id> <subst_file> BEST
1) <gi id> : the NCBI protein gi
This protein ID is used to look up BLAST hits from NCBI
2) File of substitutions to be predicted on (optional).
See test/lacI.subst for an example of the format.
This file is optional. Alternatively, you can print
scores for the entire protein sequence by entering a "-".
3) Type of hits to retrieve from BLink (optional). The two options
are BEST or ALL. By default,ALL hits are retrived. To get
reciprocal best hits , pass in "BEST".
Results will be stored in the $tmpdir/<gi id>.SIFTprediction
COMMANDLINE FOR A LIST OF SUBSTITUTIONS:
If you are in SIFT_DIR, the commandline is:
csh ./SIFT_for_submitting_NCBI_gi_id.csh <gi id> <file of substitutions> <BEST or ALL>
EXAMPLE:
If you have a list of substitutions, type the following:
csh bin/SIFT_for_submitting_NCBI_gi_id.csh 22209009 test/gi22209009.subst BEST
Results will appear in $tmpdir/22209009.SIFTprediction
COMMANDLINE TO PRINT ALL SIFT SCORES
csh bin/SIFT_for_submitting_NCBI_gi_id.csh <gi id> - BEST
A dash "-" replaces the substitution file, and BEST is optional.
Results will appear in <gi id>.SIFT prediction.
8. OUTPUT
A. When a list of substitutions is submitted (.subst file)
Results will appear in <tmpdir>/lacI.fasta.SIFTprediction and
look something like:
K2S TOLERATED 0.08 3.47 LOW CONFIDENCE
P3M TOLERATED 0.08 3.35 LOW CONFIDENCE
V15K INTOLERANT 0.00 2.84
According to this output, the SIFT score for K2S is 0.08
and the median information of the sequences that have an
amino acid represented at the position 2 is 3.47. If this
number exceedes 3.25 the substitution is annotated as
having LOW CONFIDENCE (which means too few sequences were
represented at that position.) There are enough sequences for
confidence in the V15K prediction.
B. When "-" is used instead of a substitution list, predictions for
all possible amino acid changes are printed out.
Results will appear in <tmp_dir>/<seq file>.SIFTprediction
Each row is a position
in the sequence (row 1 is amino acid position 1, row 2 is
amino acid 2) and
the SIFT scores for each amino acid substitution are printed for each row.
|