File: Tutorial.pod

package info (click to toggle)
libcatmandu-marc-perl 1.320-1
  • links: PTS, VCS
  • area: main
  • in suites: forky, sid, trixie
  • size: 1,116 kB
  • sloc: perl: 4,948; xml: 554; makefile: 2; sh: 1
file content (602 lines) | stat: -rw-r--r-- 19,669 bytes parent folder | download | duplicates (2)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
=encoding utf8

=head1 NAME

Catmandu::MARC::Tutorial - A documentation-only module for new users of Catmandu::MARC

=head1 SYNOPSIS

  perldoc Catmandu::MARC::Tutorial

=head1 UTF-8

=head2 MARC8 and UTF-8

The current Catmandu MARC tools are targetted for processing UTF-8 encoded files.
When you have MARC8 encoded data tools like MarcEdit <https://marcedit.reeset.net/>
or C<yaz-marcdump> <https://software.indexdata.com/yaz/doc/yaz-marcdump.html> can
be used to create a UTF-8 encoded file:

   $ yaz-marcdump -f MARC-8 -t UTF-8 -o marc -l 9=97 marc21.raw > marc21.utf8.raw

=head2 Unicode errors

If you process UTF-8 encoded files which contain faulty characters, you will get a fatal error message like:

  utf8 "\xD8" does not map to Unicode at ...

Use the iconv (libc6-dev Linux package) tool, to preprocess the data and discard faulty characters:

  $ iconv -c -f UTF-8 -t UTF-8 marc21.utf8.raw | catmandu convert MARC --type RAW to JSON

=head2 Convert a decomposed UTF-8 file to a combined UTF-8 file and vice versa

For example, the character ä can be represented as

"ä", that is the codepoint U+00E4 (two bytes c3 a4 in UTF-8 encoding), or as
"ä", that is the two codepoints U+0061 U+0308 (three bytes 61 cc 88 in UTF-8).

The uconv tool (from the libicu-dev Linux package) can be used to convert these types of
files:

    $ uconv -x any-nfc < decomposed.txt > combined.txt
    $ uconv -x any-nfd < combined.txt > decomposed.txt

=head1 READING

=head2 Convert MARC21 records into JSON

The command below converts file data.mrc into JSON:

   $ catmandu convert MARC to JSON < data.mrc

=head2 Convert MARC21 records into MARC-XML

   $ catmandu convert MARC to MARC --type XML < data.mrc

=head2 Convert UNIMARC records into JSON, XML, ...

To read UNIMARC records use the RAW parser to get the correct character
encoding.

   $ catmandu convert MARC --type RAW to JSON < data.mrc
   $ catmandu convert MARC --type RAW to MARC --type XML < data.mrc

=head2 Create a CSV file containing all the titles

To extract data from a MARC record on needs a Fix routine. This
is a small language to manipulate data. In the example below
we extract all 245 fields from MARC:

   $ catmandu convert MARC to CSV --fix 'marc_map(245,title); retain(title)' < data.mrc

The Fix C<marc_map> puts the MARC 245 field in the C<title> field.
The Fix C<retain> makes sure only the title field ends up in the
CSV file.

=head2 Create a CSV file containing only the 245$a and 245$c subfields

The C<marc_map> Fix can get one or more subfields to extract from MARC:

   $ catmandu convert MARC to CSV --fix 'marc_map(245ac,title); retain(title)' < data.mrc

=head2 Create a CSV file which contains a repeated field

In the example below the 650a field can be repeated in some MARC records.
We will join all the repetitions in a comma delimited list for each record.

   $ catmandu convert MARC to CSV --fix 'marc_map(650a,subject,join:","); retain(subject)' < data.mrc

=head2 Create a list of all ISBN numbers in the data

In the previous example we saw how all subjects can be printed using a few Fix commands.
When a subject is repeated in a record, it will be written on one line joined by a comma:

    subject1
    subject2,subject3
    subject4

In this example, record 1 contained 'subject1', record 2 'subject2' and 'subject3' and
record 3 'subject4'. What should we use when we want a list of all values in a single long list?

In the example below we'll print all ISBN numbers in a batch of MARC records in one long list
using the Text exporter:

  $ catmandu convert MARC to Text --field_sep "\n" --fix 'marc_map(020a,isbn.$append); retain(isbn)' < data.mrc

The first new thing is C<$append> in the marc_map. This will create in C<isbn> a
list of all ISBN numbers found in the C<020a> field. 
The C<Text> exporter with the C<field_sep>
option will use all list values in the C<isbn> field and writ them using new line as separator.

=head2 Create a list of all unique ISBN numbers in the data

Given the result of the previous command, it is now easy to create a unique list of ISBN numbers
with the UNIX C<uniq> command:

 $ catmandu convert MARC to Text --field_sep "\n" --fix 'marc_map(020a,isbn.$append); retain(isbn)' < data.mrc | sort | uniq

=head2 Create a list of the number of subjects per record

We will create a list of subjects (650a) and count the number of items
in this list for each record. The CSV file will contain the C<_id> (record
identifier) and C<subject> the number of 650a fields.

Writing all Fixes on the command line can become tedious. In Catmandu it is possible
to create a Fix script that contains all the Fix commands.

Open a text editor and create the C<myfix.fix> file with content:

    marc_map(650a,subject.$append)
    count(subject)
    retain(_id, subject)

And execute the command:

   $ catmandu convert MARC to CSV --fix myfix.fix < data.mrc

=head2 Create a list of all ISBN numbers for records with type 920a == book

In the example we need an extra condition for match the content of the
920a field against the string C<book>.

Open a text editor and create the C<myfix.fix> file with content:

    marc_map(020a,isbn.$append)
    marc_map(920a,type)

    select all_match(type,"book") # select only the books
    select exists(isbn)           # select only the records with ISBN numbers

    retain(isbn)                  # only keep this field

Text after the C<#> sign are inline code comments.

And run the command:

    $ catmandu convert MARC to Text --field_sep "\n" --fix myfix.fix < data.mrc

=head2 Show which MARC records don't contain a 900a field matching some list of values

First we need to create a list of keys that need to be matched against our MARC records.
In the example below we create a CSV file with a C<key> , C<value>
header and all the keys that are OK:

    $ cat mylist.txt
    key,value
    book,OK
    article,OK
    journal,OK

Next we create a Fix script that maps the MARC 900a field to a field called
C<type>. This C<type> field we lookup in the C<mylist.txt> file. If a match
is found, then the C<type> field will contain the value in the list (OK). When
no match is found then the C<type> will contain the original value. We reject
all records that have OK as C<type> and keep only the ones that weren't matched
in the file.

Open a text editor and create the C<myfix.fix> file with content:

    marc_map(900a,type)

    lookup(type,'/tmp/mylist.txt')

    reject all_match(type,OK)

    retain(_id,type)

And now run the command:

    $ catmandu convert MARC to CSV --fix myfix.fix < data.mrc

=head1 Create a CSV file of all ISSN numbers found at any MARC field

To process this information we need to create a Fix script like the
one below (line numbers are added here to explain the working of this script
but should not be included in the script):

    01: marc_map('***',text.$append)
    02:
    03: filter(text,'(\b\d{4}-?\d{3}[\dxX]\b)')
    04: replace_all(text.*,'.*(\b\d{4}-?\d{3}[\dxX]\b).*',$1)
    05:
    06: do list(path:text)
    07:   unless is_valid_issn(.)
    08:     reject()
    09:   end
    10: end
    11:
    12: vacuum()
    13:
    14: select exists(text)
    15:
    16: join_field(text,' ; ')
    17:
    18: retain(_id,text)

On line 01 all the text in the MARC record is mapped into a C<text> array.
On line 03 we filter out this array all the lines that contain an ISSN string
using a regular expression.
On line 04 the C<replace_all> is used to delete everything in the C<text>
array that isn't an ISSN number.
On line 06-10 we go over every ISSN string and check if it has a valid checksum
and erase it when not.
On line 12 we use the C<vacuum> function to remove any remaining empty fields
On line 14 we select only the records that contain a valid ISSN number
On line 16 the ISSN get joined by a semicolon ';' into a long string
On line 18 we keep only the record id and the ISSNs in for the report.

Run this Fix script (without the line number) using this command

    $ catmandu convert MARC to CSV --fix myfix.fix < data.mrc

=head2 Create a MARC validator

For this example we need a Fix script that contains validation rules we need to
check. For instance, we require to have a 245 field and at least a 008 control
field with a date filled in. This can be coded as in:

    # Check if a 245 field is present
    unless marc_has('245')
      log("no 245 field",level:ERROR)
    end

    # Check if there is more than one 245 field
    if marc_has_many('245')
      log("more than one 245 field?",level:ERROR)
    end

    # Check if in 008 position 7 to 10 contains a 4 digit number ('\d' means digit)
    unless marc_match('008/07-10','\d{4}')
      log("no 4-digit year in 008 position 7 -> 10",level:ERROR)
    end

Put this Fix script in a file C<myfix.fix> and execute the Catmandu command
with the "-D" option for logging and the Null exporter to discard the normal
output

    $ catmandu -D convert MARC to Null --fix myfix.fix < data.mrc

=head1 TRANSFORMING

=head2 Add a new MARC field

In the example bellow we add new 856 field to the record with a $u subfield containing
the Google homepage:

   marc_add(856,u,"http://www.google.com")

A control field can be added by using the '_' subfield

   marc_add(009,_,0123456789)

Maybe you want to copy the data from one subfield to another. Use the marc_map to
store the data first in a temporary field and add it later to the new field:

   # copy a subfield
   marc_map(001,tmp)

   # maybe process the data a bit
   append(tmp,"-mytest")

   # add the contents of the tmp field to the new 009 field
   marc_add(009,_,$.tmp)

=head2 Set a MARC subfield

Set the $h subfield to a new value (or create it when it doesn't exist yet):

   marc_set(100h, test123)

Only set the 100 field if the first indicator is 3

   marc_set(100[3]h, test123)

=head2 Remove a MARC (sub)field

Remove all fields 500 , 501 , 5** :

   marc_remove(5**)

Remove all 245h fields:

   marc_remove(245h)

=head2 Append text to a MARC field

Append a period to the 500 field is there isn't already there:

  do marc_each()
    unless marc_match(500, "\.$")    # Only if the current field 500 doesn't end with a period
      marc_append(500,".")           # Add to the current 500 field a period
    end
  end

Use the L<Catmandu::Fix::Bind::marc_each> Bind to loop over all MARC fields. In the
context of the C<do -- end> only one MARC field at a time is visible for the C<marc_*> fixes.

=head2 The marc_each binder

All C<marc_*> fixes will operate on all MARC fields matching a MARC path. For example,

   marc_remove(856)

will remove all 856 MARC fields. In some cases you may want to change only some of the fields
in a record. You could write:

  if marc_match(856u,"google")
     marc_remove(856)
  end

in the hope it would remove the 856 fields that contain the text "google" in the $u subfield.
Alas, this is not what will happen. The C<if> condition will match when the record contains one or
more 856u fields containing "google". The C<marc_remove> Fix will delete B<all> 856 fields. To
correctly remove only the 856 fields in the context of the C<if> statement the C<marc_each> binder
is required:

  do marc_each()
    if marc_match(856u,"google")
       marc_remove(856)
    end
  end

The C<marc_each> will loop over all MARC fields one at a time. The if statement will only match when
the current MARC field is 856 and the $u field contains "google". The C<marc_remove(856)> will only
delete the current 856 field.

In C<marc_each> binder, it seems for all Fixes as if there is only one field at a time visible in the record.
This Fix will not work:

  do marc_each()
    if marc_match(856u,"google")
       marc_remove(900)           # <-- there is only a 856 field in the current context
    end
  end

=head2 marc_copy, marc_cut and marc_paste

The L<Catmandu::Fix::marc_copy>, L<Catmandu::Fix::marc_cut>, L<Catmandu::Fix::marc_paste> Fixes
are needed when complicated edits are needed in MARC record.

The C<marc_copy> fill copy parts of a MARC record matching a MARC_PATH to a temporary variable.
This tempoarary variable will contain an ARRAY of HASHes containing the content of the MARC field.

For instance,

  marc_copy(650, tmp)

The C<tmp> will contain something like:

  tmp:[
      {
          "subfields" : [
              {
                  "a" : "Perl (Computer program language)"
              }
          ],
          "ind1" : " ",
          "ind2" : "0",
          "tag" : "650"
    },
    {
          "ind1" : " ",
          "subfields" : [
              {
                  "a" : "Web servers."
              }
          ],
          "tag" : "650",
          "ind2" : "0"
    }
  ]

This structure can be edited with all the Catmandu fixes. For instance you can set the first
indicator to '1':

  set_field(tmp.*.ind1 , 1)

The JSON path C<tmp.*.ind1> will match all the first indicators. The JSON path
C<tmp.*.tag> will match all the MARC tags. The JSON path C<tmp.*.subfields.*.a> will
match all the $a subfields. For instance, to change all 'Perl' into 'Python' in the $a subfield
use this Fix:

  replace_all(tmp.*.subfields.*.a,"Perl","Python")

When the fields need to be places back into the record the C<marc_paste> command can be used:

   marc_paste(subjects)

This will add all 650 fields in the C<tmp> temporary variable at the B<end> of the record. You can
change the MARC fields in place using the C<march_each> binder:

  do marc_each()
     # Select only the 650 fields
     if marc_has(650)
        # Create a working copy
        marc_copy(650,tmp)

        # Change some fields
        set_field(tmp.*.ind1 , 1)

        # Paste the result back
        marc_paste(tmp)
     end
  end

The C<marc_cut> Fix works like C<marc_copy> but will delete the matching MARC field from the record.

=head2 Rename MARC subfields

In the example below we rename each $1 subfield in the MARC record to $0 using
the L<Catmandu::Fix::marc_cut>, L<Catmandu::Fix::marc_paste> and L<Catmandu::Fix::rename>
fixes:

    # For each marc field...
    do marc_each()
       # Cut the field into tmp..
       marc_cut(***,tmp)

       # Rename every 1 subfield to 0
       rename(tmp.*.subfields.*,1,0)

       # And paste it back
       marc_paste(tmp)
    end

The C<marc_each> bind will loop over all the MARC fields. With C<marc_cut> we
store any field (C<***> matches every field) into a C<tmp> field. The C<marc_cut>
creates an array structure in C<tmp> which is easy to process using the Fix
language. Using the C<rename> function we search for all the subfields, and replace
the field matching the regular expression C<1> with C<0>. At the end, we paste
back the C<tmp> field into the record.

=head2 Setting and remove MARC indicators

In the example below we set every indicator1 of the 500 field to the value "0".
We will use the L<Catmandu::Fix::Bind::marc_each> bind with a loop variable:

    # For each marc field...
    do marc_each(var:this)
       # If the marc field is a 500 field
       if marc_has(500)
          # Set the indicator1 to value "0"
          set_field(this.ind1,0)
          # Store the result back into the MARC record
          marc_remove(500)
          marc_paste(this)
       end
     end

Using the same method indicators can also be deleted by setting their value to
a space " ".

=head2 Adding a new MARC subfield

In the example below we append a new MARC subfield $z to the 500 field with value test.
We will use the L<Catmandu::Fix::Bind::marc_each> bind with a loop variable:

    # For each marc field...
    do marc_each(var:this)
       # If the marc field is a 500 field
       if marc_has(500)
          # add a new subfield z
          add_field(this.subfields.$append.z,Test)
          # Store the result back into the MARC record
          marc_remove(500)
          marc_paste(this)
       end
    end

=head2 Remove all non-numeric fields from the MARC record

    # For each marc field...
    do marc_each(var:this)
       # If we have a non-numeric fields
       unless all_match(this.tag,"\d{3}")
          # Remove this tag
          marc_remove(***)
       end
    end

=head1 WRITING

=head2 Convert a MARC record into a MARC record (do nothing)

    $ catmandu convert MARC to MARC < data.mrc > output.mrc

=head2 Add a 920a field with value 'checked' to all records

    $ catmandu convert MARC to MARC --fix 'marc_add("900",a,"checked")' < data.mrc > output.mrc

=head2 Delete the 024 fields from all MARC records

    $ catmandu convert MARC to MARC --fix 'marc_remove("024")' < data.mrc > output.mrc

=head2 Set the 650p field to 'test' for all records

    $ catmandu convert MARC to MARC --fix 'marc_add("650p","test")' < data.mrc > output.mrc

=head2 Select only the records with 900a == book

    $ catmandu convert MARC to MARC --fix 'marc_map(900a,type); select all_match(type,book)' < data.mrc > output.mrc

The C<all_match> also allows a regular expressions:

    $ catmandu convert MARC to MARC --fix 'marc_map(900a,type); select all_match(type,"[Bb]ook")' < data.mrc > output.mrc

=head2 Select only the records with 900a values in a given CSV file

Create a CSV file with name,value pairs (need two columns):

    $ cat values.csv
    name,values
    book,1
    journal,1
    movie,1

    $ catmandu convert MARC to MARC --fix myfixes.txt < data.mrc > output.mrc

    with myfixes.txt like:

    do marc_each()
       marc_map(900a,test)
       lookup(test,values.csv,default:0)
       select all_match(test,1)
       remove_field(test)
    end

We use a "do marc_each() ... end" loop because 900a fields can be repeated. If a
MARC tag isn't repeatable this loop not isn't needed. With marc_map we copy
first the value of a marc subfield to a 'test' field. This test we lookup against
the CSV file. Then, we select only the records that are found in the CSV file
(and return the correct value).

=head1 DEDUPLICATION

=head2 Check for duplicate ISBN numbers in a MARC file

In this example we extract from a MARC file all the ISBN numbers from
the 020 and do a little bit of data cleaning using the L<Catmandu::Identifier>
project. To install this package, we run this command:

    $ cpanm Catmandu::Identifier

To extract all the ISBN numbers we use this Fix script 'dedup.fix':

    marc_map(020a, identifier.$append)
    replace_all(identifier.*,"\s+.*","")
    do list(path:identifier)
      isbn13(.)
    end
    do hashmap(exporter:YAML)
      copy_field(identifier,key)
      copy_field(_id,value)
    end

The first C<marc_map> fix maps every 020 field to an identifier array.
The C<replace_all> cleans the data a bit and deletes some unwanted text.
The C<do list> will transform all the ISBN numbers to ISBN13.
The C<do hashmap> will create an internal mapping table of identifier,_id key
value pairs. For very identifier, one or more _id can be stored. At the end
of all MARC processing this mapping table is dumped from memory as a YAML document.

Run this fix as:

    $ catmandu convert MARC to Null --fix dedup.fix < marc.mrc > output.yml

The output YAML file will contain the ISBN to document ID mapping. We
only need the ISBN numbers with more than one hit. We need a little bit
of cleanup on this YAML file to reach our final result. Use the following
'cleanup.fix' script:

    select exists(value.1)
    join_field(value,",")

This first C<select> fix selects only the records with more than one hit.
The C<join_field> will turn the array of results into a string. Execute
this Fix like:

    $ catmandu convert YAML to TSV --fix cleanup.fix < output.yml > result.csv

This will provide a tab delimited file of double isbn numbers in the MARC
input file.