File: Tutorial.pod

package info (click to toggle)
libcatmandu-marc-perl 1.320-1
links: PTS, VCS
area: main
in suites: forky, sid, trixie
size: 1,116 kB
sloc: perl: 4,948; xml: 554; makefile: 2; sh: 1
file content (602 lines) | stat: -rw-r--r-- 19,669 bytes
parent folder | download | duplicates (2)
=encoding utf8

=head1 NAME

Catmandu::MARC::Tutorial - A documentation-only module for new users of Catmandu::MARC

=head1 SYNOPSIS

  perldoc Catmandu::MARC::Tutorial

=head1 UTF-8

=head2 MARC8 and UTF-8

The current Catmandu MARC tools are targetted for processing UTF-8 encoded files.
When you have MARC8 encoded data tools like MarcEdit <https://marcedit.reeset.net/>
or C<yaz-marcdump> <https://software.indexdata.com/yaz/doc/yaz-marcdump.html> can
be used to create a UTF-8 encoded file:

   $ yaz-marcdump -f MARC-8 -t UTF-8 -o marc -l 9=97 marc21.raw > marc21.utf8.raw

=head2 Unicode errors

If you process UTF-8 encoded files which contain faulty characters, you will get a fatal error message like:

  utf8 "\xD8" does not map to Unicode at ...

Use the iconv (libc6-dev Linux package) tool, to preprocess the data and discard faulty characters:

  $ iconv -c -f UTF-8 -t UTF-8 marc21.utf8.raw | catmandu convert MARC --type RAW to JSON

=head2 Convert a decomposed UTF-8 file to a combined UTF-8 file and vice versa

For example, the character ä can be represented as

"ä", that is the codepoint U+00E4 (two bytes c3 a4 in UTF-8 encoding), or as
"ä", that is the two codepoints U+0061 U+0308 (three bytes 61 cc 88 in UTF-8).

The uconv tool (from the libicu-dev Linux package) can be used to convert these types of
files:

    $ uconv -x any-nfc < decomposed.txt > combined.txt
    $ uconv -x any-nfd < combined.txt > decomposed.txt

=head1 READING

=head2 Convert MARC21 records into JSON

The command below converts file data.mrc into JSON:

   $ catmandu convert MARC to JSON < data.mrc

=head2 Convert MARC21 records into MARC-XML

   $ catmandu convert MARC to MARC --type XML < data.mrc

=head2 Convert UNIMARC records into JSON, XML, ...

To read UNIMARC records use the RAW parser to get the correct character
encoding.

   $ catmandu convert MARC --type RAW to JSON < data.mrc
   $ catmandu convert MARC --type RAW to MARC --type XML < data.mrc

=head2 Create a CSV file containing all the titles

To extract data from a MARC record on needs a Fix routine. This
is a small language to manipulate data. In the example below
we extract all 245 fields from MARC:

   $ catmandu convert MARC to CSV --fix 'marc_map(245,title); retain(title)' < data.mrc

The Fix C<marc_map> puts the MARC 245 field in the C<title> field.
The Fix C<retain> makes sure only the title field ends up in the
CSV file.

=head2 Create a CSV file containing only the 245$a and 245$c subfields

The C<marc_map> Fix can get one or more subfields to extract from MARC:

   $ catmandu convert MARC to CSV --fix 'marc_map(245ac,title); retain(title)' < data.mrc

=head2 Create a CSV file which contains a repeated field

In the example below the 650a field can be repeated in some MARC records.
We will join all the repetitions in a comma delimited list for each record.

   $ catmandu convert MARC to CSV --fix 'marc_map(650a,subject,join:","); retain(subject)' < data.mrc

=head2 Create a list of all ISBN numbers in the data

In the previous example we saw how all subjects can be printed using a few Fix commands.
When a subject is repeated in a record, it will be written on one line joined by a comma:

    subject1
    subject2,subject3
    subject4

In this example, record 1 contained 'subject1', record 2 'subject2' and 'subject3' and
record 3 'subject4'. What should we use when we want a list of all values in a single long list?

In the example below we'll print all ISBN numbers in a batch of MARC records in one long list
using the Text exporter:

  $ catmandu convert MARC to Text --field_sep "\n" --fix 'marc_map(020a,isbn.$append); retain(isbn)' < data.mrc

The first new thing is C<$append> in the marc_map. This will create in C<isbn> a
list of all ISBN numbers found in the C<020a> field. 
The C<Text> exporter with the C<field_sep>
option will use all list values in the C<isbn> field and writ them using new line as separator.

=head2 Create a list of all unique ISBN numbers in the data

Given the result of the previous command, it is now easy to create a unique list of ISBN numbers
with the UNIX C<uniq> command:

 $ catmandu convert MARC to Text --field_sep "\n" --fix 'marc_map(020a,isbn.$append); retain(isbn)' < data.mrc | sort | uniq

=head2 Create a list of the number of subjects per record

We will create a list of subjects (650a) and count the number of items
in this list for each record. The CSV file will contain the C<_id> (record
identifier) and C<subject> the number of 650a fields.

Writing all Fixes on the command line can become tedious. In Catmandu it is possible
to create a Fix script that contains all the Fix commands.

Open a text editor and create the C<myfix.fix> file with content:

    marc_map(650a,subject.$append)
    count(subject)
    retain(_id, subject)

And execute the command:

   $ catmandu convert MARC to CSV --fix myfix.fix < data.mrc

=head2 Create a list of all ISBN numbers for records with type 920a == book

In the example we need an extra condition for match the content of the
920a field against the string C<book>.

Open a text editor and create the C<myfix.fix> file with content:

    marc_map(020a,isbn.$append)
    marc_map(920a,type)

    select all_match(type,"book") # select only the books
    select exists(isbn)           # select only the records with ISBN numbers

    retain(isbn)                  # only keep this field

Text after the C<#> sign are inline code comments.

And run the command:

    $ catmandu convert MARC to Text --field_sep "\n" --fix myfix.fix < data.mrc

=head2 Show which MARC records don't contain a 900a field matching some list of values

First we need to create a list of keys that need to be matched against our MARC records.
In the example below we create a CSV file with a C<key> , C<value>
header and all the keys that are OK:

    $ cat mylist.txt
    key,value
    book,OK
    article,OK
    journal,OK

Next we create a Fix script that maps the MARC 900a field to a field called
C<type>. This C<type> field we lookup in the C<mylist.txt> file. If a match
is found, then the C<type> field will contain the value in the list (OK). When
no match is found then the C<type> will contain the original value. We reject
all records that have OK as C<type> and keep only the ones that weren't matched
in the file.

Open a text editor and create the C<myfix.fix> file with content:

    marc_map(900a,type)

    lookup(type,'/tmp/mylist.txt')

    reject all_match(type,OK)

    retain(_id,type)

And now run the command:

    $ catmandu convert MARC to CSV --fix myfix.fix < data.mrc

=head1 Create a CSV file of all ISSN numbers found at any MARC field

To process this information we need to create a Fix script like the
one below (line numbers are added here to explain the working of this script
but should not be included in the script):

    01: marc_map('***',text.$append)
    02:
    03: filter(text,'(\b\d{4}-?\d{3}[\dxX]\b)')
    04: replace_all(text.*,'.*(\b\d{4}-?\d{3}[\dxX]\b).*',$1)
    05:
    06: do list(path:text)
    07:   unless is_valid_issn(.)
    08:     reject()
    09:   end
    10: end
    11:
    12: vacuum()
    13:
    14: select exists(text)
    15:
    16: join_field(text,' ; ')
    17:
    18: retain(_id,text)

On line 01 all the text in the MARC record is mapped into a C<text> array.
On line 03 we filter out this array all the lines that contain an ISSN string
using a regular expression.
On line 04 the C<replace_all> is used to delete everything in the C<text>
array that isn't an ISSN number.
On line 06-10 we go over every ISSN string and check if it has a valid checksum
and erase it when not.
On line 12 we use the C<vacuum> function to remove any remaining empty fields
On line 14 we select only the records that contain a valid ISSN number
On line 16 the ISSN get joined by a semicolon ';' into a long string
On line 18 we keep only the record id and the ISSNs in for the report.

Run this Fix script (without the line number) using this command

    $ catmandu convert MARC to CSV --fix myfix.fix < data.mrc

=head2 Create a MARC validator

For this example we need a Fix script that contains validation rules we need to
check. For instance, we require to have a 245 field and at least a 008 control
field with a date filled in. This can be coded as in:

    # Check if a 245 field is present
    unless marc_has('245')
      log("no 245 field",level:ERROR)
    end

    # Check if there is more than one 245 field
    if marc_has_many('245')
      log("more than one 245 field?",level:ERROR)
    end

    # Check if in 008 position 7 to 10 contains a 4 digit number ('\d' means digit)
    unless marc_match('008/07-10','\d{4}')
      log("no 4-digit year in 008 position 7 -> 10",level:ERROR)
    end

Put this Fix script in a file C<myfix.fix> and execute the Catmandu command
with the "-D" option for logging and the Null exporter to discard the normal
output

    $ catmandu -D convert MARC to Null --fix myfix.fix < data.mrc

=head1 TRANSFORMING

=head2 Add a new MARC field

In the example bellow we add new 856 field to the record with a $u subfield containing
the Google homepage:

   marc_add(856,u,"http://www.google.com")

A control field can be added by using the '_' subfield

   marc_add(009,_,0123456789)

Maybe you want to copy the data from one subfield to another. Use the marc_map to
store the data first in a temporary field and add it later to the new field:

   # copy a subfield
   marc_map(001,tmp)

   # maybe process the data a bit
   append(tmp,"-mytest")

   # add the contents of the tmp field to the new 009 field
   marc_add(009,_,$.tmp)

=head2 Set a MARC subfield

Set the $h subfield to a new value (or create it when it doesn't exist yet):

   marc_set(100h, test123)

Only set the 100 field if the first indicator is 3

   marc_set(100[3]h, test123)

=head2 Remove a MARC (sub)field

Remove all fields 500 , 501 , 5** :

   marc_remove(5**)

Remove all 245h fields:

   marc_remove(245h)

=head2 Append text to a MARC field

Append a period to the 500 field is there isn't already there:

  do marc_each()
    unless marc_match(500, "\.$")    # Only if the current field 500 doesn't end with a period
      marc_append(500,".")           # Add to the current 500 field a period
    end
  end

Use the L<Catmandu::Fix::Bind::marc_each> Bind to loop over all MARC fields. In the
context of the C<do -- end> only one MARC field at a time is visible for the C<marc_*> fixes.

=head2 The marc_each binder

All C<marc_*> fixes will operate on all MARC fields matching a MARC path. For example,

   marc_remove(856)

will remove all 856 MARC fields. In some cases you may want to change only some of the fields
in a record. You could write:

  if marc_match(856u,"google")
     marc_remove(856)
  end

in the hope it would remove the 856 fields that contain the text "google" in the $u subfield.
Alas, this is not what will happen. The C<if> condition will match when the record contains one or
more 856u fields containing "google". The C<marc_remove> Fix will delete B<all> 856 fields. To
correctly remove only the 856 fields in the context of the C<if> statement the C<marc_each> binder
is required:

  do marc_each()
    if marc_match(856u,"google")
       marc_remove(856)
    end
  end

The C<marc_each> will loop over all MARC fields one at a time. The if statement will only match when
the current MARC field is 856 and the $u field contains "google". The C<marc_remove(856)> will only
delete the current 856 field.

In C<marc_each> binder, it seems for all Fixes as if there is only one field at a time visible in the record.
This Fix will not work:

  do marc_each()
    if marc_match(856u,"google")
       marc_remove(900)           # <-- there is only a 856 field in the current context
    end
  end

=head2 marc_copy, marc_cut and marc_paste

The L<Catmandu::Fix::marc_copy>, L<Catmandu::Fix::marc_cut>, L<Catmandu::Fix::marc_paste> Fixes
are needed when complicated edits are needed in MARC record.

The C<marc_copy> fill copy parts of a MARC record matching a MARC_PATH to a temporary variable.
This tempoarary variable will contain an ARRAY of HASHes containing the content of the MARC field.

For instance,

  marc_copy(650, tmp)

The C<tmp> will contain something like:

  tmp:[
      {
          "subfields" : [
              {
                  "a" : "Perl (Computer program language)"
              }
          ],
          "ind1" : " ",
          "ind2" : "0",
          "tag" : "650"
    },
    {
          "ind1" : " ",
          "subfields" : [
              {
                  "a" : "Web servers."
              }
          ],
          "tag" : "650",
          "ind2" : "0"
    }
  ]

This structure can be edited with all the Catmandu fixes. For instance you can set the first
indicator to '1':

  set_field(tmp.*.ind1 , 1)

The JSON path C<tmp.*.ind1> will match all the first indicators. The JSON path
C<tmp.*.tag> will match all the MARC tags. The JSON path C<tmp.*.subfields.*.a> will
match all the $a subfields. For instance, to change all 'Perl' into 'Python' in the $a subfield
use this Fix:

  replace_all(tmp.*.subfields.*.a,"Perl","Python")

When the fields need to be places back into the record the C<marc_paste> command can be used:

   marc_paste(subjects)

This will add all 650 fields in the C<tmp> temporary variable at the B<end> of the record. You can
change the MARC fields in place using the C<march_each> binder:

  do marc_each()
     # Select only the 650 fields
     if marc_has(650)
        # Create a working copy
        marc_copy(650,tmp)

        # Change some fields
        set_field(tmp.*.ind1 , 1)

        # Paste the result back
        marc_paste(tmp)
     end
  end

The C<marc_cut> Fix works like C<marc_copy> but will delete the matching MARC field from the record.

=head2 Rename MARC subfields

In the example below we rename each $1 subfield in the MARC record to $0 using
the L<Catmandu::Fix::marc_cut>, L<Catmandu::Fix::marc_paste> and L<Catmandu::Fix::rename>
fixes:

    # For each marc field...
    do marc_each()
       # Cut the field into tmp..
       marc_cut(***,tmp)

       # Rename every 1 subfield to 0
       rename(tmp.*.subfields.*,1,0)

       # And paste it back
       marc_paste(tmp)
    end

The C<marc_each> bind will loop over all the MARC fields. With C<marc_cut> we
store any field (C<***> matches every field) into a C<tmp> field. The C<marc_cut>
creates an array structure in C<tmp> which is easy to process using the Fix
language. Using the C<rename> function we search for all the subfields, and replace
the field matching the regular expression C<1> with C<0>. At the end, we paste
back the C<tmp> field into the record.

=head2 Setting and remove MARC indicators

In the example below we set every indicator1 of the 500 field to the value "0".
We will use the L<Catmandu::Fix::Bind::marc_each> bind with a loop variable:

    # For each marc field...
    do marc_each(var:this)
       # If the marc field is a 500 field
       if marc_has(500)
          # Set the indicator1 to value "0"
          set_field(this.ind1,0)
          # Store the result back into the MARC record
          marc_remove(500)
          marc_paste(this)
       end
     end

Using the same method indicators can also be deleted by setting their value to
a space " ".

=head2 Adding a new MARC subfield

In the example below we append a new MARC subfield $z to the 500 field with value test.
We will use the L<Catmandu::Fix::Bind::marc_each> bind with a loop variable:

    # For each marc field...
    do marc_each(var:this)
       # If the marc field is a 500 field
       if marc_has(500)
          # add a new subfield z
          add_field(this.subfields.$append.z,Test)
          # Store the result back into the MARC record
          marc_remove(500)
          marc_paste(this)
       end
    end

=head2 Remove all non-numeric fields from the MARC record

    # For each marc field...
    do marc_each(var:this)
       # If we have a non-numeric fields
       unless all_match(this.tag,"\d{3}")
          # Remove this tag
          marc_remove(***)
       end
    end

=head1 WRITING

=head2 Convert a MARC record into a MARC record (do nothing)

    $ catmandu convert MARC to MARC < data.mrc > output.mrc

=head2 Add a 920a field with value 'checked' to all records

    $ catmandu convert MARC to MARC --fix 'marc_add("900",a,"checked")' < data.mrc > output.mrc

=head2 Delete the 024 fields from all MARC records

    $ catmandu convert MARC to MARC --fix 'marc_remove("024")' < data.mrc > output.mrc

=head2 Set the 650p field to 'test' for all records

    $ catmandu convert MARC to MARC --fix 'marc_add("650p","test")' < data.mrc > output.mrc

=head2 Select only the records with 900a == book

    $ catmandu convert MARC to MARC --fix 'marc_map(900a,type); select all_match(type,book)' < data.mrc > output.mrc

The C<all_match> also allows a regular expressions:

    $ catmandu convert MARC to MARC --fix 'marc_map(900a,type); select all_match(type,"[Bb]ook")' < data.mrc > output.mrc

=head2 Select only the records with 900a values in a given CSV file

Create a CSV file with name,value pairs (need two columns):

    $ cat values.csv
    name,values
    book,1
    journal,1
    movie,1

    $ catmandu convert MARC to MARC --fix myfixes.txt < data.mrc > output.mrc

    with myfixes.txt like:

    do marc_each()
       marc_map(900a,test)
       lookup(test,values.csv,default:0)
       select all_match(test,1)
       remove_field(test)
    end

We use a "do marc_each() ... end" loop because 900a fields can be repeated. If a
MARC tag isn't repeatable this loop not isn't needed. With marc_map we copy
first the value of a marc subfield to a 'test' field. This test we lookup against
the CSV file. Then, we select only the records that are found in the CSV file
(and return the correct value).

=head1 DEDUPLICATION

=head2 Check for duplicate ISBN numbers in a MARC file

In this example we extract from a MARC file all the ISBN numbers from
the 020 and do a little bit of data cleaning using the L<Catmandu::Identifier>
project. To install this package, we run this command:

    $ cpanm Catmandu::Identifier

To extract all the ISBN numbers we use this Fix script 'dedup.fix':

    marc_map(020a, identifier.$append)
    replace_all(identifier.*,"\s+.*","")
    do list(path:identifier)
      isbn13(.)
    end
    do hashmap(exporter:YAML)
      copy_field(identifier,key)
      copy_field(_id,value)
    end

The first C<marc_map> fix maps every 020 field to an identifier array.
The C<replace_all> cleans the data a bit and deletes some unwanted text.
The C<do list> will transform all the ISBN numbers to ISBN13.
The C<do hashmap> will create an internal mapping table of identifier,_id key
value pairs. For very identifier, one or more _id can be stored. At the end
of all MARC processing this mapping table is dumped from memory as a YAML document.

Run this fix as:

    $ catmandu convert MARC to Null --fix dedup.fix < marc.mrc > output.yml

The output YAML file will contain the ISBN to document ID mapping. We
only need the ISBN numbers with more than one hit. We need a little bit
of cleanup on this YAML file to reach our final result. Use the following
'cleanup.fix' script:

    select exists(value.1)
    join_field(value,",")

This first C<select> fix selects only the records with more than one hit.
The C<join_field> will turn the array of results into a string. Execute
this Fix like:

    $ catmandu convert YAML to TSV --fix cleanup.fix < output.yml > result.csv

This will provide a tab delimited file of double isbn numbers in the MARC
input file.