File: README.scripts

package info (click to toggle)
fasta3 36.3.8h.2020-02-11-3
  • links: PTS, VCS
  • area: main
  • in suites: bullseye
  • size: 6,048 kB
  • sloc: ansic: 56,138; perl: 10,192; python: 2,205; sh: 416; csh: 85; sql: 55; makefile: 38
file content (84 lines) | stat: -rw-r--r-- 3,182 bytes parent folder | download | duplicates (4)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84

  $Id: README.scripts 1258 2014-05-02 13:27:07Z wrp $
  $Revision: 1258 $

May 13, 2013

This directory contains a variety of scripts that work with the two
scripting options for the FASTA programs:

-e expand_script.pl

and

-V \!annotate_script.pl

All these scripts use a mysql database that provides a mapping between
the sequence identifiers provided by the sequence library and the
additional sequences (-e) or annotations (-V) that are generated.
Thus, the scripts will NOT work as written because they reference
mysql databases that are only available inside the University of
Virginia.

Scripts for sequence library expansion:

expand_uniref50.pl -- produces new sequences from the
   Uniref50 mapping of Uniref50 to Uniprot.

expand_links.pl -- produce new sequences from a custom-built database
  of protein links
links2sql.pl -- build the file of protein accessions to linked accessions

exp_up_ensg.pl -- (human sequences only) use the ENSEMBL to Uniprot
  mapping to extract alternative splice isoforms.

The expansion scripts expect a file name for a file that contains:

sp|P09488|GSTM1<tab>1e-100
...

The file is then opened, read, and the accessions extracted and used
to find the linked sequences.

================

Scripts for library annotation:

The annotation scripts are very similar to the expansion scripts, but
have the option of either (1) taking the name of a file sequence
annotations, e.g.

  annot_script.pl annot_file

or (2) taking an argument that is a single sequence identifier:

  annot_script.pl 'sp|P09488|GSTM1_HUMAN'  ('|' must be escaped for  many shells)

Three annotation scripts are available:

  ann_feats2l.pl - get features and domains from local Uniprot database
  ann_feats2ipr.pl -- get features from Uniprot and domains from a local Uniprot/Interpro database

  ann_pfam.pl -- get domains (only) from local copy of Pfam SQL database
  ann_pfam_www.pl -- get domains (only) from Pfam web services

  ann_feats_up_www.pl -- get features/domains from Uniprot gff3 server (http://www.uniprot.org/uniprot/P0948.gff)

The Uniprot gff service does not provide information on the actual sequence changes associated with mutants and variants:
P09488	UniProtKB	Natural variant	210	210	.	.	.	ID=VAR_014497;Dbxref=dbSNP:rs449856	
P09488	UniProtKB	Mutagenesis	7	7	.	.	.	Note=Reduces catalytic activity 100-fold.	
However, the Uniprot XML service does provide this information, so a second resource:

  ann_feats_up_www2.pl -- get features/domains from EMBL/EBI XSLT conversion of Uniprot XML:

 http://www.ebi.ac.uk/Tools/dbfetch/dbfetch/uniprotkb/P09488/gff2

Provides substitution information, as well as links to references:

P09488	UniProtKB	natural_variant_site	210	210	.	.	.	Note "S -> T" ; Note "UniProtKB FT ID: VAR_014497" ; Note "dbSNP:rs449856" ; Link "http://www.ensembl.org/Homo_sapiens/Variation/Explore?v=rs449856" ; Link "http://www.ncbi.nlm.nih.gov/SNP/snp_ref.cgi?type=rs&rs=449856"
P09488	UniProtKB	mutated_variant_site	7	7	.	.	.	Note "Y -> F" ; Note "Reduces catalytic activity 100- fold" ; Link "http://www.ncbi.nlm.nih.gov/pubmed/16548513"

All the annotation scripts offer  -h and --help options.

================