1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140
|
22-Jan-2014
13-Apr-2016 updated
22-Feb-2019 updated
fasta36/scripts
Perl scripts for annotating sequences and expanding libraries
-- Sequence generation (January, February, 2019)
The FASTA programs can now use sequences that are downloaded from
Uniprot or NCBI/RefSeq (or otherwise provided by a program script that
produces FASTA sequences from an identifier) by specifying the name of
the script, the accession(s), and library type 9, e.g.
fasta36 \!../scripts/get_protein.py+P09488 /seqlib/swissprot.fasta
Scripts are available for downloading protein sequences from Uniprot
or RefSeq (get_protein.py), Uniprot (get_uniprot.py), and for
downloading either protein or mRNA sequences from RefSeq
(get_refseq.py).
scripts/get_protein.py get Refseq or Uniprot proteins
scripts/get_refseq.py get RefSeq proteins or mRNAs
scripts/get_up_prot_iso_sql.py get a protein and its isoforms using a mysql database
scripts/get_genome_seq.py get human genome (hg38) or mouse (mm10) --genome mm10 sequences using bedtools using "get_genome_seq.py chr1:123456-126543"
-- Sequence alignment scoring/annotation
Two program scripts -- annot_blast_btop2.pl and blastp_cmd.sh -- have
been added to support sub-alignment scoring of BLASTP alignments.
annot_blast_btop2.pl takes three inputs: (1) a query sequence file; (2)
a domain annotation script (see below), and (3) a BLAST tabular format
output with two additional fields, "score" and "btop":
annot_blast_btop2.pl --query query.file --ann_script ann_pfam_www.pl blast_tab_btop_file
The blast_tab_btop_file can be produced using the blastp_cmd.sh shell
script, which uses ASN.1 output and blast_formatter to produce both a
standard alignment file and the modified blast tabular btop file.
-- Implied Multiple sequence alignment --
As part of a strategy to improve PSSM-based similarity searching, two
scripts that use a BTOP encoded alignment string from either -m 8CB or
-m 9B output files to produce a Clustal-like multiple sequence
alignment (MSA) that can be used as input to psiblast to produce an
ASN.1 text file (which can be converted with datatool to ASN.1 binary,
which can be read by ssearch36 -P "file.asn1 2"). We used the BTOP
encoding, rather than the more common CIGAR string (-m 9C), or the
older alignment encoding (-m 9c), because the BTOP encoding only
requires the query sequence to reproduce both the query and subject
aligned residues. Thus:
m8_btop_msa.pl --query gstt1_drome.aa gstt1_sp.bl_btop > gstt1_sp.ss_msa
where "gstt1_sp.bl_btop" is "-m 8CB" output, produces:
====
SSEARCHm8 multiple sequence alignment
sp|P20432|GSTT1_DROME MVDFYYLPGSSPCRSVIMTAKAVGVELNKKLLNLQAGEHLKPEFLKINPQHTIPTLVDNG
sp|P04907|GSTF3_MAIZE ---LYGMPLSPNVVRVATVLNEKGLDFEIVPVDLTTGAHKQPDFLALNPFGQIPALVDGD
sp|P12653|GSTF1_MAIZE -------------------------------INFATAEHKSPEHLVRNPFGQVPALQDGD
sp|P0ACA5|SSPA_ECO57 ------------------------------------------DLIDLNPNQSVPTLVDRE
sp|P00502|GSTA1_RAT VLHYFNARGRMECIRWLLA--AAGVEFDEKFI--QSPEDL--EKLKKDGNDQVPMVEIDG
...
====
which can be used with psiblast -in_msa gstt1_sp.ss_msa.
m9B_btop_msa.pl does the same for "-m 9B" output.
-- Domain annotation --
(Nov. 2015) These domain annotation scripts allow overlapping domains,
and must be used with versions of the FASTA programs that support the
current "start - stop domain_description" format (in contrast to the
older format which put domain starts and stops on separate lines with
'[' and ']'). Until this release, the "overlapping" domain scripts had
'_e' in their name, e.g. ann_pfam28_e.pl. The "_e" scripts have been
renamed, losing the '_e', and the old non-'_e' scripts have been
removed from the distribuition.
All of the "ann_*.pl" scripts are used to annotate query or library
sequences using the -V option. See ../test/test2V.pl for examples.
ann_feats2ipr.pl -- generate Uniprot sites, Interpro domains, from a mySQL database
ann_feats2l.pl -- generate Uniprot sites, domains from a mySQL database
ann_feats_up_www2.pl -- generate Uniprot sites, domains from an EBI
web server that converts Uniprot DAS to gff3.
ann_feats_up_www.pl -- generate Uniprot sites, domains from a Uniprot
gff web server (less information than ann_feats_www2.pl)
ann_ipr_www.pl -- Interpro domains from Interpro WWW site.
ann_pdb_cath.pl -- generate CATH domains using PDB accessions from a mySQL database
ann_pdb_vast.pl -- use VAST domains, but domain names are not informative
ann_pfam28.pl -- generate Pfam domains using local Pfam mySQL database
(Pfam28, no auto_pfamA, auto_pfamseq)
ann_pfam_www.pl -- use Pfam Website, and XML::Twig, to get Pfam domain info.
ann_exons_up_www.pl -- generate exon boundaries on Uniprot proteins
using the EBI/Proteins/API/coordinate service
ann_exons_up_sql_www.pl -- generate exon boundaries on Uniprot
proteins using an SQL database (if available) or the EBI/Proteins
coordinate service. The SQL results are dramatically faster.
ann_exons_ncbi.pl -- generate exon boundaries on NCBI refseq proteins.
-- Library expansion
expand_up_isoforms.pl -- for Uniprot reference proteomes, provide
isoforms for each canonical sequence.
expand_uniref50.pl -- allows search of uniref50 to be expanded
expand_links.pl -- script to take hits from a smaller library and
expand to complete library
links2sql.pl -- create links for expand_links.pl
exp_up_ensg.pl -- expand uniprot sequences to include Ensembl splice variants
-- Plot local alignments (.lav files)
lav2plt.pl -- used to produce postscript or svg plots of "lalign36 -m 11" lav output files
color_defs.pl -- used by lav2plt.pl to produce domain colors
lavplt_ps.pl -- used by lav2plt.pl --dev ps
lavplt_svg.pl -- used by lav2plt.pl --dev svg
|