============================================================================== Biblio::Citation::Parser 1.10 Documentation ============================================================================== Contents: - Introduction - Required Software - How to Install Biblio::Citation::Parser - Reference Parsing - ParaCite Web Service - Troubleshooting - How-To Guides - Problems, Questions and Feedback ============================================================================== Introduction ============================================================================== What is ParaTools? ParaTools, short for ParaCite Toolkit, is a collection of Perl modules for reference parsing that is designed to be easily expanded and yet simple to use. The parsing modules make up the core of the package, but there are also useful modules to assist with OpenURL creation and the extraction of references from documents. The toolkit is released under the GNU Public License, so can be used freely as long as the source code is provided (see the COPYING file in the root directory of the distribution for more information). The toolkit came about as a result of the ParaCite resource, a reference search engine located at http://paracite.eprints.org, which uses a template-based reference parser to extract metadata from provided references and then provides search results based on this metadata. The ParaCite parser is provided directly as the Biblio::Citation::Parser::Standard module, with a separate Templates module that can be replaced as new reference templates are located. As well as providing examples for the provided parsing modules, ParaTools also includes examples for using the ParaCite web service. This is an alternate interface which provides access to ParaCite's search and parsing functionality for any language that supports the Web Services Description Language (WSDL). Who should use ParaTools? The ParaTools package has many applications, including: * Converting reference lists into valid OpenURLs * Converting existing metadata into valid OpenURLs * Collecting metadata from references to carry out internal searches * Extracting reference lists from documents * Carrying out searches using ParaCite The modularity of ParaTools means that it is very easy to add new techniques (and we would be very pleased to hear of new ones!). What will it run on? ParaTools should work on any platform that supports Perl 5.6.0 or higher, although testing was primarily carried out using Red Hat Linux 7.3 with Perl 5.6. Where possible platform-agnostic modules have been used for file functionality, so temporary files should be placed in the correct place for the operating system. Memory requirements for ParaTools are minimal, although the template parser and document parser will require more memory as the number of templates and sizes of documents increase. This Documentation This documentation is written in perl POD format and converted into Postscript (which is 2 pages to a sheet for printing), ASCII, PDF, and HTML. The latest version of this documentation can be obtained from http://paracite.eprints.org/files/docs/ ============================================================================== Required Software ============================================================================== What software does Biblio::Citation::Parser need? Perl Modules URI URI is required for the OpenURL encoding functions in Biblio::Citation::Parser::Utils. Text::Unidecode Used by Biblio::Citation::Parser::Citebase to allow for matching on unicode strings. URI::OpenURL (Optional) If you wish to create valid OpenURLs, URI::OpenURL provides a set of functions for this purpose. The metadata produced by Biblio::Citation::Parser can be used with this module. SOAP::Lite (Optional) This module is required if you wish to use the ParaCite web services, but optional otherwise. This requires several other modules, which are available in the soap subdirectory of http://paracite.eprints.org/files/perlmods/. There are also some dependencies for the above modules, including MIME::Base64, HTML::TagSet, and Digest::MD5. The latest versions of these can be obtained from http://www.cpan.org/ Installing Perl Modules This describes the way to install a simple perl module, some require a bit more effort. We will use the non-existent FOO module as an example. Unpack the archive: % tar xfvz FOO-5.2.34.tar.gz Enter the directory this creates: % cd FOO-5.2.34 Run the following commands: % perl ./Build.PL % ./Build % ./Build test % ./Build install ============================================================================== How to Install Biblio::Citation::Parser ============================================================================== Installation First unpack the Biblio::Citation::Parser archive: % tar xfvz .tar.gz Move into the unpacked folder, and then do the following: % perl Build.PL % ./Build You can optionally run % ./Build test which will carry out a few checks to ensure everything is working correctly. Finally, become root and do: % ./Build install This will install the modules and man pages into the correct locations. Examples The examples directory contains two categories of examples - parsing examples and web service examples. Note that the web service examples require the SOAP::Lite module (see Required Software for more information). To try out these samples after installation, simply cd into the directory and execute the example. More information about the examples is in the README file inside the examples directory. ============================================================================== Reference Parsing ============================================================================== Parsing References Biblio::Citation::Parser is designed for parsing citations, and this can be done very simply: use Biblio::Citation::Parser::Standard; my $parser = new Biblio::Citation::Parser::Standard(); my $metadata = $parser->parse("Jewell, M (2002) Parsing Examples"); The $metadata variable is a hash containing the information extracted from the reference. If you'd prefer to use another parser, simply substitute the 'Standard' for the appropriate module. Biblio::Citation::Parser is distributed with the Jiao module, which is a slightly modified version of a module created by Zhuoan Jiao. To use this instead of the Standard module, you would do the following: use Biblio::Citation::Parser::Jiao; my $parser = new Biblio::Citation::Parser::Jiao(); my $metadata = $parser->parse("Jewell, M (2002) Parsing Examples"); The Standard module provides slightly richer metadata than the Jiao module, but it does rely on templates (see Biblio::Citation::Parser::Templates) so requires updating as new citation formats are found. Creating an OpenURL Once you have the metadata from the reference, it is easy to create an OpenURL from it: use Biblio::Citation::Parser::Standard; use Biblio::Citation::Parser::Utils; my $parser = new Biblio::Citation::Parser::Standard(); my $metadata = $parser->parse("Jewell, M (2002) Parsing Examples"); my $openurl = create_openurl($metadata); The OpenURLs created by Biblio::Citation::Parser do not have a Base URL prefixed, so this should be carried out before they are used (the ParaCite base URL is http://paracite.eprints.org/cgi-bin/openurl.cgi). If you would like to try to extract more information from the metadata, you can use the "decompose_openurl" function: my ($enriched_metadata, @errors) = decompose_openurl($metadata); This tries to extract information from SICIs, page ranges, etc, and also checks the fields for validity (the @errors array contains any mistakes). Note that the create_openurl has been superceded by URI::OpenURL, but the metadata returned by "trim_openurl" is in the correct format to be passed to this module. Metadata Structure Biblio::Citation::Parser supports all of the fields specified in Table 1 of the OpenURL specification (http://www.sfxit.com/openurl/openurl.html). Specific parsers can add their own fields, but these are not exported when OpenURLs are created. Biblio::Citation::Parser::Standard provides the following extra fields: marked A marked-up version of the reference. e.g. Jewell, M (2002) A title. match The template matched by Biblio::Citation::Parser::Standard ref The original reference ============================================================================== ParaCite Web Service ============================================================================== The ParaCite Web Service The Biblio::Citation::Parser package includes several examples that demonstrate the ParaCite web service, as well as the WSDL definition file. This section explains the web service, and gives an introduction to using it. As ParaCite is written entirely in Perl, there are obvious issues if you wish to use Java, PHP, or another language. The ParaCite web services provides an interface into the reference parsing features of ParaCite, while remaining language agnostic. Using the Web Service from Perl To access the web service from Perl requires the SOAP::Lite module (see Required Software). Once this is present, this is all that is required to connect to the web service: my $service = SOAP::Lite -> service("http://paracite.eprints.org/paracite.wsdl"); Three functions are now available from the $service variable: doOpenURLConstruct($reference, $baseurl) This returns an OpenURL, prefixed by the base URL if one is provided. doReferenceParse($reference, $baseurl) This returns a hash containing the metadata in the reference, and an OpenURL formed using the metadata and the base URL. doParaciteSearch($reference, $baseurl) This returns an hash containing 'resultElements' (an array of search results), and 'metadata' (a hash of metadata). Web Service Examples The following code parses a reference, and stores the metadata in $metadata and the OpenURL in $openurl: use SOAP::Lite; my $service = SOAP::Lite -> service("http://paracite.eprints.org/paracite.wsdl"); my $base_url = "http://paracite.eprints.org/cgi-bin/openurl.cgi?"; my $result = $service -> doReferenceParse("Jewell, M (2002) Example", $base_url); my $metadata = $result->{metadata}; my $openurl = $result->{openURL}; If you do not want the metadata, and just want a link to an OpenURL resolver, the following will do that: use SOAP::Lite; my $service = SOAP::Lite -> service("http://paracite.eprints.org/paracite.wsdl"); my $base_url = "http://paracite.eprints.org/cgi-bin/openurl.cgi?"; my $open_url = $service -> doOpenURLConstruct("Jewell, M (2002) Example", $base_url); Finally, this example uses the doParaciteSearch method to get the first match on a reference: use SOAP::Lite; my $service = SOAP::Lite -> service("http://paracite.eprints.org/paracite.wsdl"); my $base_url = "http://paracite.eprints.org/cgi-bin/openurl.cgi?"; my $query = "Harnad, Stevan (1995) The PostGutenberg Galaxy."; my $result = $service -> doParaciteSearch($query, $base_url); my $first_result = $result->{resultElements}->[0]; print "First result is: ".$first_result->{URL}."\n"; The web service automatically adds Google, Scirus, and Vivissimo as resources to the search request, so if no resources match the publication or subject these will be used as fall-backs. Web Service Structures Most of the Paracite structures have been modelled very closely on the Google web service structures to allow some degree of standardisation. Some additions have been made, and some fields are not yet used, but these may change in future versions. ParaciteSearchResult resultElements This is an array of resources, along with the search URLs associated with them. See the ResultElement description later in this section. estimatedTotalResultsCount This returns the number of items in the resultElements array. estimateIsExact This currently always returns 1. searchQuery This contains the original reference. openURL This contains the OpenURL represented by the reference metadata (prefixed by base URL if one is supplied). metadata This is a Metadata object (see later in this section). ResultElement URL This is a URL that searches the current resource for the reference. template This contains the template of the matching resource interface that was used to generate the search URL. name The name of the resource (e.g. Google). description Some more information about the resource. tollfree A boolean value that is true if the results can be viewed without cost. fulltext A boolean value that is true if a resulting article from this resource will have the full text available. stratum An integer representing the stratum in which this resource lies. A complete list of the strata is available at http://paracite.eprints.org/cgi-bin/views/viewstrata.cgi Metadata All of the fields in Metadata are valid fields in OpenURL metadata. See Table 1 at http://www.sfxit.com/openurl/openurl.html for a complete list. ============================================================================== Troubleshooting ============================================================================== Troubleshooting If you cannot find a solution to your problem here, make sure you are using the latest version of the toolkit and ask on the ParaTools mailing list (see http://paracite.eprints.org/developers/). Reference Parsing Reference does not parse correctly If you are using the Standard parsing module, make sure that a template for the reference exists in the package. See the HOWTO for more information on how to do this. If you are using a contributed module, please email your query to the author of the module. ============================================================================== How-To Guides ============================================================================== HOW TO: Modify Templates in Biblio::Citation::Parser::Standard Adding new templates to the Standard parser is relatively easy: * Locate where your Templates.pm file has been installed. On Linux systems this should just involve doing 'locate Templates.pm', otherwise 'find / -name Templates.pm' should work. Alternatively, you can edit the Templates.pm in the Biblio/Citation/Parser/ directory of an unpacked distribution, and install it once you have finished. * Add the template to the list. If you are editing an already installed Templates.pm file you will probably have to be root to do this. If you are editing the Templates.pm inside an unpacked distribution, you will have to reinstall the modules once you are finished (see the Installation section). The Templates.pm file should contain a structure similar to this: $Biblio::Citation::Parser::Templates::templates = [ '_AUTHORS_, _PUBLICATION_, _YEAR_, _ISSUE_, _SPAGE_-_EPAGE_', ... ]; Each template is a string containing a set of placeholders. For example, '_AUTHORS_ (_YEAR_) _TITLE_' can match 'Jewell, M (2002) Title'. The following are valid field names: _ANY_ Matches anything. _AUFIRST_ Matches the first name of an author. _AULAST_ Matches the last name of an author. _AUTHORS_ Matches a list of authors. _CAPPUBLICATION_ Matches a capitalised publication title (e.g. "Journal of Lemurs"). _CAPTITLE_ Matches a capitalised title. _CHAPTER_ Matches a chapter number. _DATE_ Matches a date in nn/nn/nn form. _EDITOR_ Matches an editor's name. _EPAGE_ Matches the last page in a page range. _ISBN_ Matches an ISBN number. _ISSN_ Matches an ISSN number. _ISSUE_ Matches an issue number. _PAGES_ Matches a page range in nn-nn form. _PUBLICATION_ Matches a publication name. _PUBLISHER_ Matches a publisher name. _PUBLOC_ Matches the location of a publisher. _SPAGE_ Matches the start page. _SUBTITLE_ Matches a subtitle. _TITLE_ Matches an article title. _UCPUBLICATION_ Matches a publication in entirely upper-case (e.g. JOURNAL OF LEMURS). _UCTITLE_ Matches a title in entirely upper-case. _URL_ Matches a URL. _VOLUME_ Matches a volume. _YEAR_ Matches a year (4 digits). HOW TO: Integrate ParaTools with EPrints 2 EPrints already contains ParaCite support, but using a specially built version of the module before it was part of ParaTools. To alter your cgi/paracite script to use ParaTools, you need to do the following: First replace use Citation::Parser::Simple; with use Biblio::Citation::Parser::Standard; Next, replace this line: my $parser = new Citation::Parser::Simple(); with this line: my $parser = new Biblio::Citation::Parser::Standard(); This should work fine, although you can obviously integrate ParaCite more if you wish. HOW TO: Create a New Parser Creating a Citation Parser All new citation parsers should be named Biblio::Citation::Parser::SomeName, where SomeName is replaced with a unique name (ideally the author's surname). The parser should extend the Biblio::Citation::Parser module like so: package Biblio::Citation::Parser::SomeName; require Exporter; @ISA = ("Exporter", "Biblio::Citation::Parser"); our @EXPORT_OK = ( 'new', 'parse' ); You should then override the 'new' and 'parse' methods: e.g. sub new { my($class) = @_; my $self = {}; return bless($self, $class); } sub parse { my($self, $ref) = @_; my $hashout = $self->extract_metadata($ref); return $hashout; } This makes it easy for users to swap out one reference parser for another. ============================================================================== Problems, Questions and Feedback ============================================================================== Bug Report Policy There is currently no online bug tracking system. Known bugs are listed in the BUGLIST file in the distribution and a list will be kept on the http://paracite.eprints.org/developers/ site. If you identify a bug or "issue" (issues are not bugs, but are things which could be clearer or better), and it's not already listed on the site, please let us know at paracite@ecs.soton.ac.uk - include all the information you can: what version of Biblio::Citation::Parser (see VERSION if you're not sure), what operating system etc. Where to go with Questions and Suggestions There is a mailing list for ParaTools (encompassing Biblio::Citation::Parser) which may be the right place to ask general questions and start discussions on broad design issues. To subscribe send an email to majordomo@ecs.soton.ac.uk containing the text subscribe paratools