File: gmod-tools-readme.pod

package info (click to toggle)
libchado-perl 1.23-2
links: PTS, VCS
area: main
in suites: jessie, jessie-kfreebsd
size: 23,976 kB
ctags: 10,378
sloc: xml: 192,540; sql: 165,945; perl: 28,339; sh: 101; python: 73; makefile: 46
file content (341 lines) | stat: -rw-r--r-- 12,086 bytes
parent folder | download | duplicates (5)

=head1 SYNOPSIS

  GMODTools is a collection of perl scripts and modules for use with 
  GMOD Chado genome databases, primarily at this point for loading and 
  extracting sequences and annotations.

    bulkfiles.pl is command-line program for Bio::GMOD::Bulkfiles
  This program generates bulk genome annotation files from a Chado genome
  database, including Fasta, GFF, DNA, Blast indices,
  a standard genome webpage, and Chado database overview tables. 
  
    gff2biomart.pl creates tables for BioMart from genome GFF annotations.
  It generates table .sql, .txt and .xml suited to BioMart. With such
  a BioMart dataset you can select genome regions with the available
  annotations, and exclude others, and download feature tables or
  sequences of the  selection set. For instance, select the regions
  with Mosquito gene homologs, but no D. melanogaster gene homologs.
  Or select regions  with gene predictions but no known homolog.

  Other sequence functions are in library module
  GMOD::Chado::SeqUtils -- common methods for these chado seq scripts
  with main programs in bin/ folder as summarized below.
  
  ======================================================================  
  
=head1 NAME

  bulkfiles.pl -- command-line program for Bio::GMOD::Bulkfiles
  
=head1 SYNOPSIS

  This program generates bulk genome annotation files from a Chado genome
  database, including Fasta, GFF, DNA, Blast indices,
  a standard genome webpage, Chado database overview tables. 
  
  # get bulkfiles software
  curl -O http://eugenes.org/gmod/GMODTools/GMODTools-1.0.zip
  unzip GMODTools-1.0.zip
  -- OR from CVS repository --
  cvs -d :pserver:anonymous@cvs.sourceforge.net:/cvsroot/gmod \
    co -d GMODTools schema/GMODTools 

  # load a genome chado db to Postgres database
  curl -O http://sgdlite.princeton.edu/download/sgdlite/sgdlite.sql.gz
  createdb sgdlite
  (gunzip -c sgdlite.sql.gz | psql -d sgdlite -f - ) >& log.load 

  # extract bulk files from database
  cd GMODTools 
  perl -Ilib bin/bulkfiles.pl -conf sgdbulk -make 

  To create a new genome database release configuration see
  conf/bulkfiles/bulkfiles_template.xml
  
  For example output, see http://insects.eugenes.org/genome/

  ======================================================================  

=head1 NAME

  Bio::GMOD::Bulkfiles -- produce bulk sequence and feature files
    from Chado genome database for public distribution.

=head1 ABOUT Bulkfiles

  This generates Fasta, GFF, DNA and other bulk genome annotation files at
    ftp://flybase.net/genomes/Drosophila_melanogaster/current/ ..
    (and other species)
  It works with several FlyBase chado dbs, and with SGDLite chado db

  Bulkfiles is mostly self-contained, but uses a few
  BioPerl parts plus XML::Simple for configuration files.  All of
  the organism/database-specific logic should be in these configuration
  files (see GMODTools/conf/bulkfiles/)
    fbbulk-r4.xml, sgdbulk1.xml .. -- organism/database/release specific options
    chadofeatsql.xml  --  chado db sql calls to dump features
    chadofeatconv.xml --  feature conversion options

=head1 UPDATES  2008-June, version 1.2b
  
  - Rationalized -strand, revcomp usage for CDS-Exons.
    Problems were caused in this by Bioperl version changes
    (in Location/Split and Seq/LargPrimarySeq). Now Bulkfiles
    handles -strand, multi-exon CDS in a direct way, hopefully
    independent of Bioperl version changes.
  
=head1 UPDATES  2008-May, version 1.2

  - adding (in progress) Genbank Submission table writer, 
    'bulkfiles -format=genbanktbl', with output suited
    to submit to NCBI as per these specifications
    http://www.ncbi.nlm.nih.gov/Genbank/eukaryotic_genome_submission.html
     
     -- adding Genbank2Chado database xml configs for testing (anogam,...)
     
  - improvements (in progress) in chado sql efficiency (table joins mostly
    can be a problem w/ large dbs).
  
=head1 UPDATES  2007-Oct, version 1.1

  - No chromosome/scaffold/golden_path files.  This change is needed to handle 
    partially assembled genomes with many (1000s to 100,000s) of scaffolds.
    Flag no_csomesplit=1 to use this (should become default).
  
  - Gene Ontology association file, see go_association tags in configurations
    formatted as http://geneontology.org/GO.format.annotation.shtml

  - Validate main variables in chado database: ${golden_path}, ${seq_ontology}
    This step, on now by default, checks that database contains values set in
    configuration for chromosome type, sequence ontology CV name, and any other
    critical variables. If failed, db is inspected for real values.
    
  - miscellany bugs cured and configuration updates.  tables/overview now again
    active.


=head1 OUTPUTS
  
  DNA files (full chromosomes) in raw and fasta formats
  GFF (v3) feature files
  FFF (v1) feature files (used in FlyBase, each complex feature one line
     using GenBank/EMBL locations)
  Fasta sequence for each selected feature set,
     with headers from feature files
  BLAST Index files (NCBI)
  Chado database overview tables. 
  A standard genome webpage for access to bulk data.

  
=head1 USAGE
  
  use Bio::GMOD::Bulkfiles;    
  
  my $bulkfiles= Bio::GMOD::Bulkfiles->new( 
    configfile => 'fbbulk-r4',   # data-release config file, required param
    debug => 1, showconfig => 0,
    failonerror => 1,
    );
  
  $bulkfiles->dumpFeatures(); # extract feature tables from chado sql db
  $bulkfiles->sortNSplitByChromosome(); # collate and separate by chromosome
  $bulkfiles->dumpChromosomeBases();    # extract chromosome dna files
  
     # produce various output format bulk files from above
  my $result= $bulkfiles->makeFiles(
        formats => [qw(fff gff fasta blast gnomap)], 
        chromosomes => [qw(all)] );
  
    
=head1 WHY Bulkfiles? 

  (rather than using other middleware layers to chado db - chadoxml,
   chadodbi, bioperl, ...)
   
  The general logic is
  
    1. dump all chado db features using simple (and quick) sql,
       to common intermediate table files, and chromosome dna to raw files.
       The feature info is simple: type, location, name/id, and a few
       attributes (db_xrefs,..)
       
    2. postprocess these table files to create the various public use
       formats (the time-consuming and configurable part), organized
       into per-chromosome files.
    
  Here are some reasons we take this approach:
  
    a. using simple sql to dump all db features to intermediate table
       allows easy checks that all features get to bulk files
       
    b. simple sql dump is fast (30 - 60 min for full fly genome), 
       reliable in getting all mapped features by keeping logic simple
       
    c. process table output in stages - better debugging of steps in
        process, and can split processing among computers
       c1. the stages are loosely coupled - one can go back, tweek
        configurations and get a new output w/o redoing the complete
        extraction process.
        
    d. convert one common feature table + dna to several output formats
       in one step, or repeatedly as needed.
    
    e. combine features from several chado dbs (flybase now has 3 chado dbs
       for d.mel genome features), and add other sources like
       flybase cytology features.
       
    f. need fairly complex and data specific configurations - moving
      that to config files keeps code reusable.
      
    g. each genome chado database has different policy and choices with
      respect to feature, vocabulary and other data.  A highly configurable
      tool, with data extraction and correction methods that are separate
      and tunable is needed to adapt to such variation in genome databases.
      
      
      
 ==========================================================================  

=head1 NAME  

gff2biomart.pl -- create tables for BioMart from genome GFF annotations

=head1 EXAMPLE BioMart : DroSpeGe

  BioMart : DroSpeGe  provides a tool for mining homologies and
annotations  of [twelve] Drosophila species genomes.
You can select genome regions with the available annotations, and
exclude others, and download tables or sequences of the  selection
set. For instance, select the regions with Mosquito gene homologs,
but no D. melanogaster gene homologs. Or select regions  with
gene predictions but no known homolog.
  DroSpeGe's BioMart is built with the GMOD Tool gff2biomart.pl
It converts most genome GFF annotations into a BioMart datamine.
Example data sets from this tool are at
  http://insects.eugenes.org/BioMart/martview

=head1 OUTPUTS 

  The script generates table .sql, .txt and .xml suited to BioMart 
(MySQL, BioMart version 0.3 tested).  It is a simple script without
special requirements, basically a data transformer that writes new
files formatted for a BioMart database.  Components created

1. chromosome region__main tables for biomart
   with chromosomes broken into Kb bins/regions (5Kb default)
   Features that overlap each region are tabulated.
   
2. per-feature feature__dm link tables
    store feature attributes (id,dbxref,match stats,..)
    add __main column feature_bool to indicate
    where features lie.

3. __chromosome__dm    
  with dna residues for fasta output 

4. main_biomart.xml and sequence_biomart.xml
   configurations for biomart usage.
   

=head1 IN BIOMART

-- filter (include,exclude) features that exist in regions,
including joint filters (has homology to human
but not to fly or worm genes; has predicted gene but 
not homology; any such feature type comparison).

-- output 4 kinds of attributes: a feature table, per-feature
sequence, region table, per-region sequence.
  
Please have installed and used BioMart before trying
to load the outputs of this.

=head1 VERSION

This is a alpha-level, not yet fully tested.  

 ==========================================================================  

=head1 NAME

cgi-bin/genomepathmap.pl

=head1 SYNOPSIS

  This works with the genome directory generated by bulk files
  to provide standard URL access to genome data.  It transforms
  /genome/species/release/{dna,protein,mrna,ncrna,feature}
  web urls to bulk data file paths.
  It supports gzip and plain data returns.


=head1 GMOD::SeqUtils

  GMOD::SeqUtils is based on GMOD release 0.001 (January 2004) and can
  be expected to change or go obsolete.  Besides contents of
  this package, requirements to run include the perl modules
  installed with GMOD release, esp. Bio::Chado::LoadDBI/AutoDBI and
  its ancestor classes from CPAN - Class::DBI, Ima::DBI, etc.
  
  GMOD::Chado::SeqUtils was written for use with the nascent Daphnia genome database,
  wFleaBase, at http://eugenes.org/daphnia/ See also further information
  at http://eugenes.org/daphnia/database
    
  ======================================================================  

  
=head1 NAME

gmod_init_db.pl

=head1 SYNOPSIS

  *** from GMOD 0.001; needs updating to be useful
  A simple script to create the Chado database and tables

=head1 NAME

gmod_load_newseq.pl

=head1 SYNOPSIS

*** from GMOD 0.001; needs updating to be useful

Load a file of miscellany sequences: cDNA, EST, microsats into chado
db, generating public IDs. Sequences are assumed to be non-genomic,
not located.

  ======================================================================  


=head1 NAME

gmod_dump_seq.pl - Print sequences from ChadoDB

=head1 SYNOPSIS

  *** from GMOD 0.001; needs updating to be useful

Dump sequence file of Chado features, with feature props, synonyms, dbxrefs
xSelect by organism, by 'pub' = input file,  seq type, feat props 
Need for nascent daphnia wFleaBase to use sequence public IDs

  ======================================================================  


=head1 NAME

  gmod_list_db.pl

=head1 SYNOPSIS

  *** from GMOD 0.001; needs updating to be useful

Summarize feature entries in a Chado database, by publication (input file),
by Seq. Ontology type (cDNA, EST, etc.), by organism.  Add other categories
as needed.

=head1 AUTHORS

  Don Gilbert, Feb 2004