1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212
|
PROVEAN v.1.1.5 (May 7, 2014)
0. QUICK INSTALLATION
$ tar zxvf provean-1.1.5.tar.gz
$ cd provean-1.1.5
$ ./configure
$ make
$ make install
I. PREREQUISITES
PROVEAN requires the following software and database.
1. NCBI BLAST 2.2.28+ (or more recent)
This is available at the NCBI ftp site.
ftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/
2. CD-HIT 3.1.2 (or more recent, but currently v4.6 and v4.6.1 are not
recommended since those versions have a reported problem,
https://code.google.com/p/cdhit/issues/detail?id=18)
This is available at the CD-HIT website.
http://weizhong-lab.ucsd.edu/cd-hit/download.php
3. NCBI nr (non-redundant) protein database
Only BLAST pre-formatted databases are needed.The NCBI nr databases
(released August 2011) used for our publication is available at the
JCVI ftp site. Download and unpack nr.*.tar.gz files.
ftp://ftp.jcvi.org/pub/data/provean/
Current version of nr database is available at the NCBI ftp site.
ftp://ftp.ncbi.nih.gov/blast/db/
II. INSTALLATION INSTRUCTIONS
1. Download and/or install prerequisites described above.
2. Unpack the distribution:
$ tar zxvf provean-1.1.5.tar.gz
3. Change to the source directory:
$ cd provean-1.1.5
4. Run configure:
Specifying the location of NCBI nr databases (including an alias file
name without the file extension, e.g. nr for nr.pal)
$ ./configure BLAST_DB=/path/to/blast/database/nr
or if you do not have nr databases yet (In this case, you can set the
location later manually.)
$ ./configure
By default this will place the PROVEAN binary files in /usr/local/bin
and the associated files in /usr/local/share/data/provean or
/usr/local/share/doc/provean. If you want to place the executables in
a different location, or you do not have write permissions (i.e. are
not the root or superuser) to these directories, then you may specify
a different location using:
$ ./configure --prefix=/other/path/
If you receive an error during configuration that psiblast, blastdbcmd,
or cd-hit cannot be found, and you have indeed installed it, then place
it in your PATH variable, or specify its location (including the binary
file names):
e.g.)
$ ./configure PSIBLAST=/path/to/psiblast
$ ./configure CDHIT=/path/to/cd-hit
$ ./configure PSIBLAST=/path/to/psiblast BLASTDBCMD=/path/to/blastdbcmd CDHIT=/path/to/cd-hit
5. Compile:
$ make
6. Install the software and data:
$ make install
III. SETTING UP DATABASE & SOFTWARE LOCATION
If you did not specify the location of NCBI nr database during
installation, then you may edit provean.sh file in /usr/local/bin or
under your specified directory with --prefix option. Set BLAST_DB
variable accordingly.
You can also change the path to psiblast, blastdbcmd, or cd-hit in a
similar way if you want to use different version.
e.g.)
BLAST_DB="/path/to/blast/database/nr"
PSIBLAST="/path/to/psiblast"
BLASTDBCMD="/path/to/blastdbcmd"
CD_HIT="/path/to/cd-hit"
IV. RUNNING PROVEAN
1. To see PROVEAN usage instruction, execute provean.sh with -h option:
$ provean.sh -h
PROVEAN v1.1.5
USAGE:
provean.sh [Options]
Example:
# Given a query sequence in aaa.fasta file,
# compute scores for variations in bbb.var file
provean.sh -q aaa.fasta -v bbb.var
Required arguments:
-q <string>, --query <string>
Query protein sequence filename in fasta format
-v <string>, --variation <string>
Variation filename containing a list of variations:
one entry per line in HGVS notation,
e.g.: G105C, F508del, Q49dup, Q49_P50insC, Q49_R52delinsLI
Optional arguments:
--save_supporting_set <string>
Saves supporting sequence set infomation into a given filename
--supporting_set <string>
Supporting sequence set filename saved with
'--save_supporting_set' option above
(This will save time for BLAST search and clustering.)
--tmp_dir <string>
Temporary directory used to store temporary files
--num_threads <integer>
Number of threads (CPUs) to use in BLAST search
-V, --verbose
Verbosely shows the information about procedure
-h, --help
Gives this help message
2. Run PROVEAN with test examples:
Test examples are provided in share/data/provean/examples directory.
Change to the directory and execute provean.sh with the options
shown below. You should get a similar result, but the scores could be
different since they depend on the protein database used.
$ cd /usr/local/share/data/provean/examples
$ provean.sh -q P04637.fasta -v P04637.var --save_supporting_set P04637.sss
## PROVEAN v1.1 output ##
# Query sequence file: P04637.fasta
# Variation file: P04637.var
# Protein database: /usr/local/projects/SIFT/ychoi/provean_genome/provean_result/nr_db/nr
[13:50:20] searching related sequences...
[14:02:56] clustering subject sequences...
[14:02:57] supporting sequence set was saved at P04637.sss
[14:02:57] supporting sequences were saved in FASTA format at P04637.sss.fasta
# Number of clusters: 30
# Number of supporting sequences used: 413
[14:02:57] computing delta alignment scores...
## PROVEAN scores ##
# VARIATION SCORE
P72R -0.461
G105C -8.119
K370del -2.201
H178_H179insPHP -10.945
L22_W23delinsQS -10.392
In the above case, you provided a filename with '--save_supporting_set'
option so that the infomation on supporting sequence set is stored
into a file. You can provide this file with '--supporting_set' option
so that PROVEAN skips the BLAST search and clustering procedures to
save time as below.
$ provean.sh -q P04637.fasta -v P04637.var --supporting_set P04637.sss
## PROVEAN v1.1 output ##
# Query sequence file: P04637.fasta
# Variation file: P04637.var
# Protein database: /usr/local/projects/SIFT/ychoi/provean_genome/provean_result/nr_db/nr
# Number of clusters: 30
# Number of supporting sequences used: 413
[14:05:40] computing delta alignment scores...
## PROVEAN scores ##
# VARIATION SCORE
P72R -0.461
G105C -8.119
K370del -2.201
H178_H179insPHP -10.945
L22_W23delinsQS -10.392
3. Interpreting PROVEAN scores:
We suggest using a cutoff of -2.5 for the PROVEAN score when using the
NCBI nr protein database released in August 2011. That is, consider a score
higher than -2.5 to be neutral (tolerated) and that lower than or equal to
-2.5 to be deleterious (damaging). The PROVEAN scores and optimal cutoff
may slightly vary with different versions of nr database because the scores
are computed based on the homologs in the DB. More detailed information on
PROVEAN scores can be found at http://provean.jcvi.org/about.php
Yongwook Choi
ychoi@jcvi.org
|