File: README

package info (click to toggle)
fasta3 36.3.8i.14-Nov-2020-4
links: PTS, VCS
area: main
in suites: forky, sid
size: 7,576 kB
sloc: ansic: 77,292; perl: 10,677; python: 2,461; sh: 428; csh: 86; sql: 55; makefile: 40
file content (140 lines) | stat: -rw-r--r-- 5,789 bytes
parent folder | download | duplicates (4)

22-Jan-2014
13-Apr-2016 updated
22-Feb-2019 updated

fasta36/scripts

Perl scripts for annotating sequences and expanding libraries

-- Sequence generation (January, February, 2019)

The FASTA programs can now use sequences that are downloaded from
Uniprot or NCBI/RefSeq (or otherwise provided by a program script that
produces FASTA sequences from an identifier) by specifying the name of
the script, the accession(s), and library type 9, e.g.

     fasta36 \!../scripts/get_protein.py+P09488 /seqlib/swissprot.fasta

Scripts are available for downloading protein sequences from Uniprot
or RefSeq (get_protein.py), Uniprot (get_uniprot.py), and for
downloading either protein or mRNA sequences from RefSeq
(get_refseq.py).

   scripts/get_protein.py    get Refseq or Uniprot proteins
   scripts/get_refseq.py     get RefSeq proteins or mRNAs
   scripts/get_up_prot_iso_sql.py  get a protein and its isoforms using a mysql database
   scripts/get_genome_seq.py   get human genome (hg38) or mouse (mm10) --genome mm10 sequences using bedtools using "get_genome_seq.py chr1:123456-126543"

-- Sequence alignment scoring/annotation

Two program scripts -- annot_blast_btop2.pl and blastp_cmd.sh -- have
been added to support sub-alignment scoring of BLASTP alignments.

annot_blast_btop2.pl takes three inputs: (1) a query sequence file; (2)
a domain annotation script (see below), and (3) a BLAST tabular format
output with two additional fields, "score" and "btop":

annot_blast_btop2.pl --query query.file --ann_script ann_pfam_www.pl blast_tab_btop_file

The blast_tab_btop_file can be produced using the blastp_cmd.sh shell
script, which uses ASN.1 output and blast_formatter to produce both a
standard alignment file and the modified blast tabular btop file.

-- Implied Multiple sequence alignment --

As part of a strategy to improve PSSM-based similarity searching, two
scripts that use a BTOP encoded alignment string from either -m 8CB or
-m 9B output files to produce a Clustal-like multiple sequence
alignment (MSA) that can be used as input to psiblast to produce an
ASN.1 text file (which can be converted with datatool to ASN.1 binary,
which can be read by ssearch36 -P "file.asn1 2").  We used the BTOP
encoding, rather than the more common CIGAR string (-m 9C), or the
older alignment encoding (-m 9c), because the BTOP encoding only
requires the query sequence to reproduce both the query and subject
aligned residues.  Thus:

m8_btop_msa.pl --query gstt1_drome.aa gstt1_sp.bl_btop > gstt1_sp.ss_msa

where "gstt1_sp.bl_btop" is "-m 8CB" output, produces:
  ====
  SSEARCHm8 multiple sequence alignment


  sp|P20432|GSTT1_DROME     MVDFYYLPGSSPCRSVIMTAKAVGVELNKKLLNLQAGEHLKPEFLKINPQHTIPTLVDNG
  sp|P04907|GSTF3_MAIZE     ---LYGMPLSPNVVRVATVLNEKGLDFEIVPVDLTTGAHKQPDFLALNPFGQIPALVDGD
  sp|P12653|GSTF1_MAIZE     -------------------------------INFATAEHKSPEHLVRNPFGQVPALQDGD
  sp|P0ACA5|SSPA_ECO57      ------------------------------------------DLIDLNPNQSVPTLVDRE
  sp|P00502|GSTA1_RAT       VLHYFNARGRMECIRWLLA--AAGVEFDEKFI--QSPEDL--EKLKKDGNDQVPMVEIDG


  ...
  ====

which can be used with psiblast -in_msa gstt1_sp.ss_msa.
m9B_btop_msa.pl does the same for "-m 9B" output.

-- Domain annotation --

(Nov. 2015) These domain annotation scripts allow overlapping domains,
and must be used with versions of the FASTA programs that support the
current "start - stop domain_description" format (in contrast to the
older format which put domain starts and stops on separate lines with
'[' and ']'). Until this release, the "overlapping" domain scripts had
'_e' in their name, e.g. ann_pfam28_e.pl.  The "_e" scripts have been
renamed, losing the '_e', and the old non-'_e' scripts have been
removed from the distribuition.

All of the "ann_*.pl" scripts are used to annotate query or library
sequences using the -V option.  See ../test/test2V.pl for examples.


ann_feats2ipr.pl -- generate Uniprot sites, Interpro domains, from a mySQL database
ann_feats2l.pl -- generate Uniprot sites, domains from a mySQL database

ann_feats_up_www2.pl -- generate Uniprot sites, domains from an EBI
		     	web server that converts Uniprot DAS to gff3.

ann_feats_up_www.pl -- generate Uniprot sites, domains from a Uniprot
		       gff web server (less information than ann_feats_www2.pl)

ann_ipr_www.pl -- Interpro domains from Interpro WWW site.

ann_pdb_cath.pl -- generate CATH domains using PDB accessions from a mySQL database
ann_pdb_vast.pl -- use VAST domains, but domain names are not informative

ann_pfam28.pl -- generate Pfam domains using local Pfam mySQL database
  (Pfam28, no auto_pfamA, auto_pfamseq)

ann_pfam_www.pl -- use Pfam Website, and XML::Twig, to get Pfam domain info.

ann_exons_up_www.pl -- generate exon boundaries on Uniprot proteins
  using the EBI/Proteins/API/coordinate service

ann_exons_up_sql_www.pl -- generate exon boundaries on Uniprot
  proteins using an SQL database (if available) or the EBI/Proteins
  coordinate service.  The SQL results are dramatically faster.

ann_exons_ncbi.pl -- generate exon boundaries on NCBI refseq proteins.

-- Library expansion

expand_up_isoforms.pl -- for Uniprot reference proteomes, provide
  isoforms for each canonical sequence.

expand_uniref50.pl -- allows search of uniref50 to be expanded

expand_links.pl -- script to take hits from a smaller library and
  expand to complete library

links2sql.pl -- create links for expand_links.pl

exp_up_ensg.pl -- expand uniprot sequences to include Ensembl splice variants

-- Plot local alignments (.lav files)

lav2plt.pl    -- used to produce postscript or svg plots of "lalign36 -m 11" lav output files
color_defs.pl -- used by lav2plt.pl to produce domain colors
lavplt_ps.pl  -- used by lav2plt.pl --dev ps
lavplt_svg.pl -- used by lav2plt.pl --dev svg