File: INSTALL

package info (click to toggle)
sift 6.2.1-2
  • links: PTS, VCS
  • area: non-free
  • in suites: sid
  • size: 4,784 kB
  • sloc: ansic: 18,272; perl: 219; csh: 164; makefile: 152
file content (283 lines) | stat: -rw-r--r-- 9,916 bytes parent folder | download | duplicates (2)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
## This (modified) SIFT version tested on gcc version 4.7.2
## This SIFT version tested on old legacy BLAST package (blastpgp and formatdb)

This SIFT installation is for
 - submitting a single protein sequence
 - submitting a protein alignment
 - submitting a NCBI gi id.

This SIFT installation is NOT for:
 - submitting VCF files (please use SIFT 4G)
 - submitting genomic coordinates of variants (please use SIFT 4G)
 - SIFT indel prediction  
Please go to sift-dna.org or sift-dna.org/sift4g to download code for 
the above functionalities.
 
1. INSTALL NCBI BLAST TOOLS 
	SIFT uses PSI-BLAST and makeblastdb commands from BLAST+ (This has been tested for 2.4.0).

	Download and install NCBI BLAST 2.4.0+ (or the latest) version at:
 
	     ftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/

	SIFT uses NCBI's PSI-BLAST and makeblastdb.  Make sure these programs 
	are inside your package (otherwise install 2.4.0 which has these 
	programs).  
		ls <NCBI_BLAST_DIR>/bin

		makeblastdb
		psiblast


2. Download and format reference protein sequence database 
	This step is required for SIFT sequence submission. It is optional 
	if you are inputting a protein alignment or a NCBI gi id.

	SIFT searches a database of protein sequences to find homologous
        sequences.  You will need to download a database of protein sequences
        and format it properly so that SIFT subroutines can read it.

	2a. Download a protein database. Some options are:

	    UniRef: 
		ftp://ftp.ebi.ac.uk/pub/databases/fastafiles/uniref/ 

	    SWISS-PROT/TrEMBL: 
		ftp://ftp.ebi.ac.uk/pub/databases/fastafiles/uniprot/ 

	    Refseq:
		ftp://ftp.ncbi.nlm.nih.gov/refseq/release/complete/


	    (I personally prefer uniref90.gz) 

	2b. Format your database
	    SIFT cannot process names of fasta sequences that contain
	    delimiters such as "|" or ":". Also, it needs the first 10
	    characters of the sequence name to be unique.  
	
	    Format the standard databases so that the sequence names are
	    simpler.

	    For UniRef, the format command is:
	    zcat uniref90.gz | perl -pe 's/^>UR090:UniRef90_/>/'  > uniref90.fa

	    For Swiss-Prot or TrEMBL databases, the format command is: 
	    zcat uniprot_swissprot.gz | perl -pe 's/^>TR:/>/; s/^>SP:/>/' > uniprotkb_swissprot.fa

	    For NCBI:
            zcat complete.1000.protein.faa.gz |perl -pe 's/^>gi\|/>gi/' | perl -pe 's/\|/ /' > complete.1000.protein.faa

            The names will be changed to the proper format for proper 
	    parsing.

            If you have your own protein sequence database and SIFT 
	    is not properly recognizing the names, go to 
            src/Alignment.c and modify "fix_names"
            Recompile ALL programs.

	2c.  Format your database:
		<BLAST_DIR>/bin/makeblastdb -in <protein database fasta> -dbtype prot
					  	
3. UNPACK SIFT
	
	tar -zxvf sift-<version>.tar.gz

	This will create a directory of the form, sift-<version>.  Change to
	the directory.

	cd sift-<version>


You can SKIP STEPS 4 & 5 if you're using Linux. Linux executables are already in the bin directory. 

4. COMPILE BLIMPS CODE: 
	Go to the blimps directory:	
		cd <SIFT_HOME>/blimps	
	
	Compile blimps by typing:
		make clean
		make all	
		
	Folders 'obj','lib' and 'include' will be created. 
	Check that the blimps library 'libblimps.a' is generated in the 'lib'
	folder. 
	If 'libblimps.a' is present, this indicates that BLIMPS has compiled
	successfully.
	
	
5. COMPILE SIFT CODE:
	Compile SIFT
	Set the BLIMPS path in the file cccb (<SIFT_HOME>/src/cccb)
	
		cd <SIFT_HOME>/src
	
	Edit the file 'cccb'
	
		set b = <BLIMPS Home>		
		set CC = <Path to GCC>	


	Compile the SIFT executables.	
		cd <SIFT_HOME>/src
		./compile.csh	

		
	'compile.csh' will generate and move executable to <SIFT_HOME>/bin dirctory.
	Check <SIFT_HOME>/bin directory for the following executables.
		choose_seqs_via_psiblastseedmedian
                clump_output_alignedseq
                info_on_seqs
                consensus_to_seq
                psiblast_res_to_fasta_dbpairwise
                seqs_from_psiblast_res

                                                                          
6. SET PATHS 

	Edit the files in <SIFT_HOME>/bin to set environmental variables
		SIFT_for_submitting_fasta_seq.csh
		SIFT_for_submitting_NCBI_gi_id.csh
		
	# Set NCBI, BLIMPS and SIFT path in file 'SIFT_for_submitting_fasta_seq.csh' (<SIFT_HOME>/bin/SIFT_for_submitting_fasta_seq.csh)
	
		setenv NCBI <BLAST directory where psiblast and makeblastdb are>
			
		setenv SIFT_DIR <SIFT_HOME>			# Location of SIFT (use SIFT absolute path eg: /home/software/sift)
			
		setenv tmpdir = <SIFT_HOME>/tmp			# SIFT's temporary output directory (use sift absolute path eg: /home/software/sift/tmp) 


7. RUNNING SIFT

	# Before test/run SIFT program, change working directory to SIFT bin (<SIFT_HOME>/bin) directory and execute below SIFT command (important).
	# ie, Current working directoy is '<SIFT_HOME>/bin'

	A.  Input: Protein sequence. (SIFT chooses homologues).
		Requires 3 inputs:
		1) Protein sequence in fasta format.
		2) Protein database to search.  These sequences are 
		   assumed to be functional
		3) File of substitutions to be predicted on (optional).  
		   See test/lacI.subst for an example of the format. 
		   This file is optional. Alternatively, you can print 
	           scores for the entire protein sequence. 

		Results will be in <tmpdir>/<seq_file>.SIFTprediction

		COMMANDLINE FOR A LIST OF SUBSTITUTIONS:
		Go to the bin directory <SIFT_HOME>:

			cd <SIFT_HOME>

		Type the commandline: 

			bin/SIFT_for_submitting_fasta_seq.csh <seq file> <protein_database> <file of substitutions> 

		EXAMPLE:
			bin/SIFT_for_submitting_fasta_seq.csh test/lacI.fasta <protein_database> test/lacI.subst 

		Results will appear in <tmpdir>/lacI.fasta.SIFTprediction. 

		COMMANDLINE TO PRINT ALL SIFT SCORES
		bin/SIFT_for_submitting_fasta_seq.csh <seq file> <protein_database> - 
		A dash "-" replaces the list of substitutions.

		Results will appear in <tmp_dir>/<seq file>.SIFTprediction  

	B.	Input: Your own protein alignment (and the path to the environmental variable BLIMPS_DIR, which was set in SIFT.csh).

		COMMANDLINE FORMAT FOR A LIST OF SUBSTITIONS:

		cd <SIFT_DIR>
		export BLIMPS_DIR=<blimps_path>
		bin/info_on_seqs <protein alignment> <substitution file> <output file>

		EXAMPLE:
		Type:

		cd <SIFT_DIR>
                export BLIMPS_DIR=<blimps_path>
		bin/info_on_seqs test/lacI.alignedfasta test/lacI.subst test/lacI.fasta.SIFTprediction

		Prediction results will appear in 
		test/lacI.fasta.SIFTprediction, read above for description 
		of output.

		COMMANDLINE TO PRINT ALL SIFT SCORES:

		export BLIMPS_DIR=<blimps_path>
		bin/info_on_seqs <protein alignment> - <output file>

		Example:
		Type :

		export BLIMPS_DIR=<blimps_path>
		bin/info_on_seqs test/lacI.alignedfasta - test/lacI.fasta.SIFTprediction

		and scores for each position will appear in the file 
		test/lacI.fast.SIFTprediction

	C.      Input: BLink gi. 
		Rather than using BLAST to search for homologous sequences,
		homologues are retrieved from NCIB BLink. 
		
			csh ./SIFT_for_submitting_NCBI_gi_id.csh <gi id> <subst_file> BEST

                1) <gi id> : the NCBI protein gi
 
		   This protein ID is used to look up BLAST hits from NCBI
                2) File of substitutions to be predicted on (optional).
                   See test/lacI.subst for an example of the format.
                   This file is optional. Alternatively, you can print
                   scores for the entire protein sequence by entering a "-".
		3) Type of hits to retrieve from BLink (optional).  The two options
		   are BEST or ALL.  By default,ALL hits are retrived. To get
	           reciprocal best hits , pass in "BEST".  
                
		Results will be stored in the $tmpdir/<gi id>.SIFTprediction

                COMMANDLINE FOR A LIST OF SUBSTITUTIONS:
                If you are in SIFT_DIR, the commandline is:

                	csh ./SIFT_for_submitting_NCBI_gi_id.csh <gi id> <file of substitutions> <BEST or ALL> 

                EXAMPLE:
                If you have a list of substitutions, type the following:
                csh bin/SIFT_for_submitting_NCBI_gi_id.csh 22209009 test/gi22209009.subst BEST 

                Results will appear in $tmpdir/22209009.SIFTprediction

                COMMANDLINE TO PRINT ALL SIFT SCORES
               	csh bin/SIFT_for_submitting_NCBI_gi_id.csh <gi id> - BEST 

		A dash "-" replaces the substitution file, and BEST is optional.
                Results will appear in <gi id>.SIFT prediction.

8.  OUTPUT
	A. When a list of substitutions is submitted (.subst file)
		Results will appear in <tmpdir>/lacI.fasta.SIFTprediction and
                look something like:

                K2S     TOLERATED 0.08 3.47 LOW CONFIDENCE
                P3M     TOLERATED 0.08  3.35 LOW CONFIDENCE
                V15K    INTOLERANT 0.00 2.84

                According to this output, the SIFT score for K2S is 0.08
                and the median information of the sequences that have an
                amino acid represented at the position 2 is 3.47.  If this
                number exceedes 3.25 the substitution is annotated as
                having LOW CONFIDENCE (which means too few sequences were
                represented at that position.)  There are enough sequences for
                confidence in the V15K prediction.


	B. When "-" is used instead of a substitution list, predictions for 
	   all possible amino acid changes are printed out.

		Results will appear in <tmp_dir>/<seq file>.SIFTprediction
                Each row is a position
                in the sequence (row 1 is amino acid position 1, row 2 is
                amino acid 2) and
                the SIFT scores for each amino acid substitution are printed for                each row.