$Id: BIODATABASE-ACCESS-HOWTO.txt 2590 2003-03-17 20:46:24Z kdj $ INTRODUCTION Importing sequences with annotations is a central part of most bioinformatics tasks. BioJava supports importing sequences from indexed flat-files, local relational databases and remote (internet) databases. Previously, separate programming syntax was required for each of these types of data access. In addition, if one wanted to change one's mode of sequence-data acquisition (for example, by implementing a local relational database version of Genbank when previously the data had been stored in an indexed flat-file) one had to rewrite all of the data-access subroutines in one's application code. The Open Biological Database Access (OBDA) System was designed so that one could use the same application code to access data from all three of the database types by simply changing a few lines in a "configuration file". This makes application code more portable and easier to maintain. This document shows how to set up the required OBDA-registry configuration file and how to access data from the databases referred to in the configuration file using the BioJava API as well as from the command line. Note: accessing data via the OBDA system is optional in BioJava. It is still possible to access sequence data via the API in the org.biojava.bio.seq.db package, including direct access to the binary flatfile indexing system used by EMBOSS (that is, you can instantiate BioJava SequenceDB objects using EMBOSS indices). USING THE OBDA BIODIRECTORY REGISTRY SYSTEM TO ACCESS SEQUENCE DATABASES The OBDA BioDirectory Registry is a platform-independent system for specifying how BioJava programs find sequence databases. It uses a site-wide configuration file, known as the registry, which defines one or more databases and the access methods to use to access them. For instance, you might start out by accessing sequence data over the web, and later decide to install a locally mirrored copy of Genbank. By changing one line in the registry file, all Bio{Perl,Java,Python,Ruby} applications will start using the mirrored local copy automatically - no source code changes are necessary. INSTALLING THE REGISTRY FILE The registry file should be named seqdatabase.ini. By default, it should be installed in one of the following locations: $HOME/.bioinformatics/seqdatabase.ini /etc/bioinformatics/seqdatabase.ini The Bio{Perl,Java,Python,Ruby} registry-handling code will initialize itself from the registry file located in the home directory first, followed by the system-wide default in /etc. If no local registry file cannot be found, the registry-handling code will take its configuration from the file located at this URL: http://www.open-bio.org/registry/seqdatabase.ini MODIFYING THE SEARCH PATH The registry file search path can be modified by setting the environment variable OBDA_SEARCH_PATH. This variable is a "+" delimited string of files and URLs, for example: OBDA_SEARCH_PATH=/home/lstein/seqdatabase.ini+http://foo.org/seqdatabase.ini The search order proceeds from left to right. The first file or URL that is found ends the search. Warning! Note that the fact that the search path is for an entire file (seqdatabase.ini) rather than for single entry (e.g. 'genbank') means that you have to copy any default values you want to keep from the (old) default configuration file to your new configuration file. For example, say you have been using biofetch with the default configuration file http://www.open-bio.org/registry/seqdatabase.ini for all your sequence-data retrieval. If you now install a local copy of genbank, your local seqdatabase.ini must not only have a "stanza" indicating that 'genbank' is local but it must have stanzas configuring the web access for all the other databases you use, since http://www.open-bio.org/registry/seqdatabase.ini will no longer be found in a registry-file search. ============================================================================ FORMAT OF THE REGISTRY FILE The registry file is a simple text file, as shown in the following example: ----------------- example starts -------------- VERSION=1.00 [embl] protocol=biofetch location=http://www.ebi.ac.uk/cgi-bin/dbfetch dbname=embl [swissprot] protocol=biofetch location=http://www.ebi.ac.uk/cgi-bin/dbfetch dbname=swall ------------------ example ends --------------- The first line is the registry format version number in the format VERSION=X.XX. The current version is 1.00. The file remainder is a simple stanza format which goes: [database-name] tag=value tag=value [database-name] tag=value tag=value Each stanza starts with a symbolic database service name enclosed in square brackets. Service names are case insensitive. The remainder of the stanza is followed by a series of tag=value pairs that configure access to the service. Database-name stanzas can be repeated, in which case the client should try each service in turn from top to bottom. The options under each stanza must have two non-optional tag=value lines being protocol= location= The Protocol Tag ---------------- The protocol tag species what access mode to use. Currently it can be one of: flat biofetch biosql "flat" is used to fetch sequences from local flat files that have been indexed using binary search indexing. "biofetch" is used to fetch sequences from web-based databses. Due to restrictions on the use of these databases, this is recommended only for lightweight applications. "biosql" fetches sequences from BioSQL databases. To use this protocol you will need to set up an SQL database using the API in the org.biojava.bio.seq.db.biosql package. The Location Tag ---------------- The location tag tells the bioperl sequence fetching code where the database is located. Its interpretation depends on the protocol chosen. For example, it might be a directory on the local file system, or a remote URL. Other Tags ---------- Any number of additional tag values are allowed. The number and nature of these tags depends on the access protocol selected. Some protocols require no additional tags, whereas others will require several. Protocol Tag Description -------- --- ----------- flat location Directory in which the index is stored. The "config.dat" file generated during indexing must be found in this location. dbname Name of database. biofetch location Base URL for the web service. Currently the only compatible biofetch service is http://www.ebi.ac.uk/cgi-bin/dbfetch dbname Name of the database. Currently recognized values are "embl" (sequence and protein), "swall" (SwissProt + TREMBL), and "refseq" (NCBI RefSeq entries). biosql location dbname driver mysql|postgres|oracle|sybase|sqlserver|access |csv|informix|odbc|rdb user passwd biodbname ============================================================================ INSTALLING LOCAL DATABASES If you are using the biofetch protocol, you're all set. You can start reading sequences immediately. For the flat and biosql protocols, you will need to create and initialize local databases. See the following documentation on how to do this: flat protocol: See FLAT-DATABASES-HOWTO.txt (in the docs/howto subdirectory) biosql protocol: See BIOSQL-HOWTO.txt (this doc is still being developed) ============================================================================ WRITING CODE TO USE THE REGISTRY Once you've set up the OBDA registry file, accessing sequence data from within a Java program is simple. The following example shows how; note that nowhere in the program do you explicitly specify whether the data is stored in a flat file, a local relational database or a database on the internet. To use the registry from a Java program, use the following idiom: 1 import org.biojava.directory.Registry; 2 Registry registry = Registry.instance(); 3 SequenceDBLite db = registry.getDatabase("embl"); 4 Sequence seq = db.getSequence("J02231"); 5 SeqIOTools.writeFasta(System.out, seq); In lines 1 and 2, we import the Registry class and obtain a reference to the singleton Registry object. We then ask the registry to return a database accessor for the symbolic data source "embl", which must be defined in an [embl] stanza in the seqdatabase.ini registry file. The returned accessor is a SequenceDBLite object (see the appropriate JavaDoc page), which has amongst its methods: db.getSequence(id); These method returns a Sequence object by searching for its primary ID. In line 5, we call the SeqIOTools utility object's static writeFasta method to print out the DNA or protein sequence. ============================================================================ USING BIOGETSEQ TO ACCESS REGISTRY SEQUENCES FROM THE COMMAND LINE As a convenience, the BioJava distribution includes a program 'BioGetSeq' that enables one to have OBDA access to sequence data from the command line. The program 'BioGetSeq' is located at the apps directory of the BioJava distribution. Move or add it into your path to run it. You can get to help by running it with no arguments: Usage: org.biojava.app.BioGetSeq --dbname embl --format embl \ --namespace id [ id ... ]* dbname defaults to embl format defaults to embl namespace defaults to 'id' ['id' being the only supported namespace] rest of the arguments is a list of ids in the given namespace If you have a set of ids you want to fetch from EMBL database, you just give them as space separated parameters: % java org.biojava.app.BioGetSeq J02231 A21530 A10516 The output is directed to standard out, so it can be redirected to a file. The options can be given in long "double hyphen" format or abbreviated to one letter format: % java org.biojava.app.BioGetSeq -f fasta --namespace id J02231 \ A21530 A10516 > file.seq ---------------------------------------------------------------------------- ChangeLog $Log$ Revision 1.1 2003/03/14 14:58:30 kdj Imported from BioPerl, appropriate modifications