File: citeparser.txt

package info (click to toggle)
libbiblio-citation-parser-perl 1.10%2Bdfsg-2.1
  • links: PTS, VCS
  • area: main
  • in suites: bullseye
  • size: 660 kB
  • sloc: perl: 3,793; sh: 51; makefile: 2
file content (561 lines) | stat: -rw-r--r-- 20,644 bytes parent folder | download | duplicates (4)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
==============================================================================
Biblio::Citation::Parser 1.10 Documentation
==============================================================================

Contents:
 - Introduction
 - Required Software
 - How to Install Biblio::Citation::Parser
 - Reference Parsing
 - ParaCite Web Service
 - Troubleshooting
 - How-To Guides
 - Problems, Questions and Feedback

==============================================================================
Introduction
==============================================================================

What is ParaTools?
    ParaTools, short for ParaCite Toolkit, is a collection of Perl modules for
    reference parsing that is designed to be easily expanded and yet simple to
    use. The parsing modules make up the core of the package, but there are
    also useful modules to assist with OpenURL creation and the extraction of
    references from documents. The toolkit is released under the GNU Public
    License, so can be used freely as long as the source code is provided (see
    the COPYING file in the root directory of the distribution for more
    information).

    The toolkit came about as a result of the ParaCite resource, a reference
    search engine located at http://paracite.eprints.org, which uses a
    template-based reference parser to extract metadata from provided
    references and then provides search results based on this metadata. The
    ParaCite parser is provided directly as the
    Biblio::Citation::Parser::Standard module, with a separate Templates
    module that can be replaced as new reference templates are located.

    As well as providing examples for the provided parsing modules, ParaTools
    also includes examples for using the ParaCite web service. This is an
    alternate interface which provides access to ParaCite's search and parsing
    functionality for any language that supports the Web Services Description
    Language (WSDL).

Who should use ParaTools?
    The ParaTools package has many applications, including:

    *       Converting reference lists into valid OpenURLs

    *       Converting existing metadata into valid OpenURLs

    *       Collecting metadata from references to carry out internal searches

    *       Extracting reference lists from documents

    *       Carrying out searches using ParaCite

    The modularity of ParaTools means that it is very easy to add new
    techniques (and we would be very pleased to hear of new ones!).

What will it run on?
    ParaTools should work on any platform that supports Perl 5.6.0 or higher,
    although testing was primarily carried out using Red Hat Linux 7.3 with
    Perl 5.6. Where possible platform-agnostic modules have been used for file
    functionality, so temporary files should be placed in the correct place
    for the operating system. Memory requirements for ParaTools are minimal,
    although the template parser and document parser will require more memory
    as the number of templates and sizes of documents increase.

This Documentation
    This documentation is written in perl POD format and converted into
    Postscript (which is 2 pages to a sheet for printing), ASCII, PDF, and
    HTML.

    The latest version of this documentation can be obtained from
    http://paracite.eprints.org/files/docs/


==============================================================================
Required Software
==============================================================================

What software does Biblio::Citation::Parser need?
  Perl Modules
    URI     URI is required for the OpenURL encoding functions in
            Biblio::Citation::Parser::Utils.

    Text::Unidecode
            Used by Biblio::Citation::Parser::Citebase to allow for matching
            on unicode strings.

    URI::OpenURL (Optional)
            If you wish to create valid OpenURLs, URI::OpenURL provides a set
            of functions for this purpose. The metadata produced by
            Biblio::Citation::Parser can be used with this module.

    SOAP::Lite (Optional)
            This module is required if you wish to use the ParaCite web
            services, but optional otherwise. This requires several other
            modules, which are available in the soap subdirectory of
            http://paracite.eprints.org/files/perlmods/.

    There are also some dependencies for the above modules, including
    MIME::Base64, HTML::TagSet, and Digest::MD5. The latest versions of these
    can be obtained from http://www.cpan.org/

  Installing Perl Modules
    This describes the way to install a simple perl module, some require a bit
    more effort. We will use the non-existent FOO module as an example.

    Unpack the archive:
             % tar xfvz FOO-5.2.34.tar.gz

    Enter the directory this creates:
             % cd FOO-5.2.34

    Run the following commands:
             % perl ./Build.PL
             % ./Build
             % ./Build test
             % ./Build install


==============================================================================
How to Install Biblio::Citation::Parser
==============================================================================

Installation
    First unpack the Biblio::Citation::Parser archive:

     % tar xfvz <packagename>.tar.gz

    Move into the unpacked folder, and then do the following:

     % perl Build.PL
     % ./Build

    You can optionally run

     % ./Build test

    which will carry out a few checks to ensure everything is working
    correctly.

    Finally, become root and do:

     % ./Build install

    This will install the modules and man pages into the correct locations.

Examples
    The examples directory contains two categories of examples - parsing
    examples and web service examples. Note that the web service examples
    require the SOAP::Lite module (see Required Software for more
    information). To try out these samples after installation, simply cd into
    the directory and execute the example. More information about the examples
    is in the README file inside the examples directory.


==============================================================================
Reference Parsing
==============================================================================

Parsing References
    Biblio::Citation::Parser is designed for parsing citations, and this can
    be done very simply:

     use Biblio::Citation::Parser::Standard;
     my $parser = new Biblio::Citation::Parser::Standard();
     my $metadata = $parser->parse("Jewell, M (2002) Parsing Examples");

    The $metadata variable is a hash containing the information extracted from
    the reference.

    If you'd prefer to use another parser, simply substitute the 'Standard'
    for the appropriate module. Biblio::Citation::Parser is distributed with
    the Jiao module, which is a slightly modified version of a module created
    by Zhuoan Jiao. To use this instead of the Standard module, you would do
    the following:

     use Biblio::Citation::Parser::Jiao;
     my $parser = new Biblio::Citation::Parser::Jiao();
     my $metadata = $parser->parse("Jewell, M (2002) Parsing Examples");

    The Standard module provides slightly richer metadata than the Jiao
    module, but it does rely on templates (see
    Biblio::Citation::Parser::Templates) so requires updating as new citation
    formats are found.

Creating an OpenURL
    Once you have the metadata from the reference, it is easy to create an
    OpenURL from it:

     use Biblio::Citation::Parser::Standard;
     use Biblio::Citation::Parser::Utils;
     my $parser = new Biblio::Citation::Parser::Standard();
     my $metadata = $parser->parse("Jewell, M (2002) Parsing Examples");
     my $openurl = create_openurl($metadata);

    The OpenURLs created by Biblio::Citation::Parser do not have a Base URL
    prefixed, so this should be carried out before they are used (the ParaCite
    base URL is http://paracite.eprints.org/cgi-bin/openurl.cgi).

    If you would like to try to extract more information from the metadata,
    you can use the "decompose_openurl" function:

     my ($enriched_metadata, @errors) = decompose_openurl($metadata);
 
    This tries to extract information from SICIs, page ranges, etc, and also
    checks the fields for validity (the @errors array contains any mistakes).

    Note that the create_openurl has been superceded by URI::OpenURL, but the
    metadata returned by "trim_openurl" is in the correct format to be passed
    to this module.

Metadata Structure
    Biblio::Citation::Parser supports all of the fields specified in Table 1
    of the OpenURL specification (http://www.sfxit.com/openurl/openurl.html).
    Specific parsers can add their own fields, but these are not exported when
    OpenURLs are created. Biblio::Citation::Parser::Standard provides the
    following extra fields:

    marked  A marked-up version of the reference. e.g. <author>Jewell,
            M</author> (<year>2002</year>) <title>A title</title>.

    match   The template matched by Biblio::Citation::Parser::Standard

    ref     The original reference


==============================================================================
ParaCite Web Service
==============================================================================

The ParaCite Web Service
    The Biblio::Citation::Parser package includes several examples that
    demonstrate the ParaCite web service, as well as the WSDL definition file.
    This section explains the web service, and gives an introduction to using
    it.

    As ParaCite is written entirely in Perl, there are obvious issues if you
    wish to use Java, PHP, or another language. The ParaCite web services
    provides an interface into the reference parsing features of ParaCite,
    while remaining language agnostic.

Using the Web Service from Perl
    To access the web service from Perl requires the SOAP::Lite module (see
    Required Software). Once this is present, this is all that is required to
    connect to the web service:

     my $service = SOAP::Lite
            -> service("http://paracite.eprints.org/paracite.wsdl");

    Three functions are now available from the $service variable:

    doOpenURLConstruct($reference, $baseurl)
            This returns an OpenURL, prefixed by the base URL if one is
            provided.

    doReferenceParse($reference, $baseurl)
            This returns a hash containing the metadata in the reference, and
            an OpenURL formed using the metadata and the base URL.

    doParaciteSearch($reference, $baseurl)
            This returns an hash containing 'resultElements' (an array of
            search results), and 'metadata' (a hash of metadata).

Web Service Examples
    The following code parses a reference, and stores the metadata in
    $metadata and the OpenURL in $openurl:

     use SOAP::Lite;
     my $service = SOAP::Lite
            -> service("http://paracite.eprints.org/paracite.wsdl");
     my $base_url = "http://paracite.eprints.org/cgi-bin/openurl.cgi?";
     my $result = $service 
            -> doReferenceParse("Jewell, M (2002) Example", $base_url);
     my $metadata = $result->{metadata};
     my $openurl = $result->{openURL};

    If you do not want the metadata, and just want a link to an OpenURL
    resolver, the following will do that:

     use SOAP::Lite;
     my $service = SOAP::Lite
            -> service("http://paracite.eprints.org/paracite.wsdl");
     my $base_url = "http://paracite.eprints.org/cgi-bin/openurl.cgi?";
     my $open_url = $service
            -> doOpenURLConstruct("Jewell, M (2002) Example", $base_url);
 
    Finally, this example uses the doParaciteSearch method to get the first
    match on a reference:

     use SOAP::Lite;
     my $service = SOAP::Lite
            -> service("http://paracite.eprints.org/paracite.wsdl");
     my $base_url = "http://paracite.eprints.org/cgi-bin/openurl.cgi?";
     my $query = "Harnad, Stevan (1995) The PostGutenberg Galaxy.";
     my $result = $service 
            -> doParaciteSearch($query, $base_url);
     my $first_result = $result->{resultElements}->[0];
     print "First result is: ".$first_result->{URL}."\n";

    The web service automatically adds Google, Scirus, and Vivissimo as
    resources to the search request, so if no resources match the publication
    or subject these will be used as fall-backs.

Web Service Structures
    Most of the Paracite structures have been modelled very closely on the
    Google web service structures to allow some degree of standardisation.
    Some additions have been made, and some fields are not yet used, but these
    may change in future versions.

  ParaciteSearchResult
    resultElements
            This is an array of resources, along with the search URLs
            associated with them. See the ResultElement description later in
            this section.

    estimatedTotalResultsCount
            This returns the number of items in the resultElements array.

    estimateIsExact
            This currently always returns 1.

    searchQuery
            This contains the original reference.

    openURL This contains the OpenURL represented by the reference metadata
            (prefixed by base URL if one is supplied).

    metadata
            This is a Metadata object (see later in this section).

  ResultElement
    URL     This is a URL that searches the current resource for the
            reference.

    template
            This contains the template of the matching resource interface that
            was used to generate the search URL.

    name    The name of the resource (e.g. Google).

    description
            Some more information about the resource.

    tollfree
            A boolean value that is true if the results can be viewed without
            cost.

    fulltext
            A boolean value that is true if a resulting article from this
            resource will have the full text available.

    stratum An integer representing the stratum in which this resource lies. A
            complete list of the strata is available at
            http://paracite.eprints.org/cgi-bin/views/viewstrata.cgi

  Metadata
    All of the fields in Metadata are valid fields in OpenURL metadata. See
    Table 1 at http://www.sfxit.com/openurl/openurl.html for a complete list.


==============================================================================
Troubleshooting
==============================================================================

Troubleshooting
    If you cannot find a solution to your problem here, make sure you are
    using the latest version of the toolkit and ask on the ParaTools mailing
    list (see http://paracite.eprints.org/developers/).

Reference Parsing
  Reference does not parse correctly
    If you are using the Standard parsing module, make sure that a template
    for the reference exists in the package. See the HOWTO for more
    information on how to do this. If you are using a contributed module,
    please email your query to the author of the module.


==============================================================================
How-To Guides
==============================================================================

HOW TO: Modify Templates in Biblio::Citation::Parser::Standard
    Adding new templates to the Standard parser is relatively easy:

    *       Locate where your Templates.pm file has been installed.

            On Linux systems this should just involve doing 'locate
            Templates.pm', otherwise 'find / -name Templates.pm' should work.
            Alternatively, you can edit the Templates.pm in the
            Biblio/Citation/Parser/ directory of an unpacked distribution, and
            install it once you have finished.

    *       Add the template to the list.

            If you are editing an already installed Templates.pm file you will
            probably have to be root to do this. If you are editing the
            Templates.pm inside an unpacked distribution, you will have to
            reinstall the modules once you are finished (see the Installation
            section).

    The Templates.pm file should contain a structure similar to this:

     $Biblio::Citation::Parser::Templates::templates = [
            '_AUTHORS_, _PUBLICATION_, _YEAR_, _ISSUE_, _SPAGE_-_EPAGE_',
     ...
            ];

    Each template is a string containing a set of placeholders. For example,
    '_AUTHORS_ (_YEAR_) _TITLE_' can match 'Jewell, M (2002) Title'. The
    following are valid field names:

    _ANY_   Matches anything.

    _AUFIRST_
            Matches the first name of an author.

    _AULAST_
            Matches the last name of an author.

    _AUTHORS_
            Matches a list of authors.

    _CAPPUBLICATION_
            Matches a capitalised publication title (e.g. "Journal of
            Lemurs").

    _CAPTITLE_
            Matches a capitalised title.

    _CHAPTER_
            Matches a chapter number.

    _DATE_  Matches a date in nn/nn/nn form.

    _EDITOR_
            Matches an editor's name.

    _EPAGE_ Matches the last page in a page range.

    _ISBN_  Matches an ISBN number.

    _ISSN_  Matches an ISSN number.

    _ISSUE_ Matches an issue number.

    _PAGES_ Matches a page range in nn-nn form.

    _PUBLICATION_
            Matches a publication name.

    _PUBLISHER_
            Matches a publisher name.

    _PUBLOC_
            Matches the location of a publisher.

    _SPAGE_ Matches the start page.

    _SUBTITLE_
            Matches a subtitle.

    _TITLE_ Matches an article title.

    _UCPUBLICATION_
            Matches a publication in entirely upper-case (e.g. JOURNAL OF
            LEMURS).

    _UCTITLE_
            Matches a title in entirely upper-case.

    _URL_   Matches a URL.

    _VOLUME_
            Matches a volume.

    _YEAR_  Matches a year (4 digits).

HOW TO: Integrate ParaTools with EPrints 2
    EPrints already contains ParaCite support, but using a specially built
    version of the module before it was part of ParaTools. To alter your
    cgi/paracite script to use ParaTools, you need to do the following:

    First replace

     use Citation::Parser::Simple;

    with

     use Biblio::Citation::Parser::Standard;

    Next, replace this line:

     my $parser = new Citation::Parser::Simple();

    with this line:

     my $parser = new Biblio::Citation::Parser::Standard();

    This should work fine, although you can obviously integrate ParaCite more
    if you wish.

HOW TO: Create a New Parser
  Creating a Citation Parser
    All new citation parsers should be named
    Biblio::Citation::Parser::SomeName, where SomeName is replaced with a
    unique name (ideally the author's surname). The parser should extend the
    Biblio::Citation::Parser module like so:

     package Biblio::Citation::Parser::SomeName;
     require Exporter;
     @ISA = ("Exporter", "Biblio::Citation::Parser");
     our @EXPORT_OK = ( 'new', 'parse' );

    You should then override the 'new' and 'parse' methods:

    e.g.

     sub new
     {
             my($class) = @_;
             my $self = {};
             return bless($self, $class);
     }

     sub parse
     {
             my($self, $ref) = @_;
             my $hashout = $self->extract_metadata($ref);
             return $hashout;
     }

    This makes it easy for users to swap out one reference parser for another.


==============================================================================
Problems, Questions and Feedback
==============================================================================

Bug Report Policy
    There is currently no online bug tracking system. Known bugs are listed in
    the BUGLIST file in the distribution and a list will be kept on the
    http://paracite.eprints.org/developers/ site.

    If you identify a bug or "issue" (issues are not bugs, but are things
    which could be clearer or better), and it's not already listed on the
    site, please let us know at paracite@ecs.soton.ac.uk - include all the
    information you can: what version of Biblio::Citation::Parser (see VERSION
    if you're not sure), what operating system etc.

Where to go with Questions and Suggestions
    There is a mailing list for ParaTools (encompassing
    Biblio::Citation::Parser) which may be the right place to ask general
    questions and start discussions on broad design issues.

    To subscribe send an email to majordomo@ecs.soton.ac.uk containing the
    text

     subscribe paratools