File: jq.rst

package info (click to toggle)
python-internetarchive 3.3.0-1
  • links: PTS, VCS
  • area: main
  • in suites: bookworm
  • size: 916 kB
  • sloc: python: 6,108; makefile: 180; xml: 180
file content (344 lines) | stat: -rw-r--r-- 10,332 bytes parent folder | download | duplicates (4)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
.. _jq:

Using jq with ia
================

`jq <https://stedolan.github.io/jq/>`_ is a lightweight and flexible command-line JSON processor.
It's a great tool for processing the JSON output of ``ia``.
This document will go over how to install or download ``jq`` and how to use it with ``ia``.

If you have a tip you'd like to add to this page, please email `jake@archive.org <mailto:jake@archive.org>`_ or send a pull request.
If you're unable to figure out a ``jq`` command to do what you need and don't see it on this page, please email  `jake@archive.org <mailto:jake@archive.org>`_ for help.

Installation
------------

Downloading a binary
^^^^^^^^^^^^^^^^^^^^

The easiest way to get started with ``jq`` is to download a binary.
Binaries for Linux, OS X, and Windows are available at `https://stedolan.github.io/jq/download/ <https://stedolan.github.io/jq/download/>`_.
Once you find the binary for your OS, you could right-click the hypertext and copy the link to the binary.
Then you could paste it into your terminal and download it like so:

.. code:: bash

    $ curl -Ls https://github.com/stedolan/jq/releases/download/jq-1.5/jq-osx-amd64 > jq
    $ chmod +x jq  # make it executable

To confirm it's working, simply run the following.
You should see the help page.

.. code:: bash

    $ ./jq
    jq - commandline JSON processor [version 1.5]
    Usage: ./jq [options] <jq filter> [file...]

        jq is a tool for processing JSON inputs, applying the
        given filter to its JSON text inputs and producing the
        filter's results as JSON on standard output.
        The simplest filter is ., which is the identity filter,
        copying jq's input to its output unmodified (except for
        formatting).
        For more advanced filters see the jq(1) manpage ("man jq")
        and/or https://stedolan.github.io/jq

        Some of the options include:
         -c        compact instead of pretty-printed output;
         -n        use `null` as the single input value;
         -e        set the exit status code based on the output;
         -s        read (slurp) all inputs into an array; apply filter to it;
         -r        output raw strings, not JSON texts;
         -R        read raw strings, not JSON texts;
         -C        colorize JSON;
         -M        monochrome (don't colorize JSON);
         -S        sort keys of objects on output;
         --tab    use tabs for indentation;
         --arg a v    set variable $a to value <v>;
         --argjson a v    set variable $a to JSON value <v>;
         --slurpfile a f    set variable $a to an array of JSON texts read from <f>;
        See the manpage for more options.

Just like the ``ia`` binary, downloading the ``jq`` binary does not install it to your system.
It's simply an executable binary.
To use it, you'll have to use either a relative or absolute path. For example:

.. code:: bash

    $ ~/jq --help
    $ ./jq --help
    $ /Users/jake/jq --help

Installing with a package manager
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

``jq`` can also be installed with most popular package managers:

.. code:: bash

    # Linux
    $ sudo apt-get install jq

    # OS X
    $ brew install jq

    # FreeBSD
    $ pkg install jq

    # Solaris
    $ pkgutil -i jq

    # Windows
    $ chocolately install jq

Please refer to `https://stedolan.github.io/jq/download/ <https://stedolan.github.io/jq/download/>`_ for more details.



Getting started
---------------

``jq`` can seem a bit overwhelming at first, so let's get started with some basic examples.
A good way to make sense of how you can access a specific metadata field is to use ``jq 'keys'``.
This will show you the top-level keys that exist in the JSON document.

.. code:: bash

    $ ia metadata nasa | jq 'keys'
    [
      "created",
      "d1",
      "d2",
      "dir",
      "files",
      "files_count",
      "is_collection",
      "item_size",
      "metadata",
      "reviews",
      "server",
      "uniq",
      "workable_servers"
    ]

To access the value of a given key, you can simply do:

.. code:: bash

    $ ia metadata nasa | jq '.files_count'
    8

As you can see, the command above returns the value for the ``files_count`` key.
There are 8 files in the item.

When working with ``ia metadata`` the ``metadata`` and ``files`` keys are likely to be the targets you'll want to access most.
Let's take a look at ``metadata``:

.. code:: bash

    $ ia metadata | jq '.metadata | keys'
    [
      "addeddate",
      "backup_location",
      "collection",
      "description",
      "hidden",
      "homepage",
      "identifier",
      "mediatype",
      "num_recent_reviews",
      "num_subcollections",
      "num_top_dl",
      "publicdate",
      "related_collection",
      "rights",
      "show_browse_by_date",
      "show_hidden_subcollections",
      "show_search_by_year",
      "spotlight_identifier",
      "title",
      "updatedate",
      "updater",
      "uploader"
    ]

As you might notice, this is all of the item-level metadata (i.e. the JSON equivalent of an item's ``<identifier>_meta.xml`` file).
We can descend deeper into the JSON document like so:

.. code:: bash

    $ ia metadata nasa | jq '.metadata.title'
    "NASA Images"

``jq`` returns JSON by default.
In this case, a quoted string.
To access the raw value, you can use the ``-r`` option:

.. code:: bash

    $ ia metadata nasa | jq -r '.metadata.title'
    NASA Images

Search
------

``ia search`` outputs JSONL.
JSONL is series of JSON documents separated by a newline.
In this case, one JSON document is returned per search document reutrned.


Converting search results to CSV and other formats
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

``jq`` can be used to parse the JSON returned by ``ia search`` into CSV or TSV files:

.. code:: bash

    $ ia search 'identifier:nasa OR identifier:stairs' --field title,date,subject | jq -r '[.identifier, .title, .date, .subject] | @csv'
    "nasa","NASA Images",,
    "stairs","stairs where i worked","2004-01-01T00:00:00Z","test"

If you'd prefer a tab-separated spreadsheet, you can replace ``@csv`` with ``@tsv`` in the command above.
More options can be found in the *Format strings and escaping* section in the `jq manual <https://stedolan.github.io/jq/manual/>`_.

Catalog
-------

Get info on all of your IA-S3 tasks:

.. code:: bash

    $ ia tasks --json | jq 'select(.args.comment == "s3-put")'

Or, output a link to the tasklog for each S3 task you currently have queued or running:

.. code:: bash

    $ ia tasks nasa --json \
        | jq -r 'select(.args.comment == "s3-put") | "https://archive.org/log/\(.task_id)"'
    https://archive.org/log/469558161
    https://archive.org/log/400818482

Get the identifiers for all of your redrows:

.. code:: bash

    $ ia tasks --json | jq -r 'select(.row_type == "red").identifier'

TODO
____

Recipes to document, work in progress...


Select files of a specific format
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. code:: bash

    $ ia metadata nasa | jq '.files[] | select(.format == "JPEG")'
    {
      "name": "globe_west_540.jpg",
      "source": "original",
      "size": "66065",
      "format": "JPEG",
      "mtime": "1245274910",
      "md5": "9366a4b09386bf673c447e33d806d904",
      "crc32": "2283b5fd",
      "sha1": "3e20a009994405f535cdf07cdc2974cef2fce8f2",
      "rotation": "0"
    }

Select a file by name
^^^^^^^^^^^^^^^^^^^^^

.. code:: bash

    $ ia metadata nasa | jq '.files[] | select(.name == "nasa_meta.xml")'
    {
      "name": "nasa_meta.xml",
      "source": "metadata",
      "size": "7968",
      "format": "Metadata",
      "mtime": "1530756295",
      "md5": "06cd95343d60df0f10fb8518b349a795",
      "crc32": "6b9c6e24",
      "sha1": "c0dc994eeba245671ef53e2f6c52612722bf51d3"
    }


Get the size of a collection
^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. code:: bash

    » ia search 'collection:georgeblood' -f item_size | jq '.item_size' | paste -sd+ - | bc
    51677834206186

Getting checksums for all files in an item
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. code:: bash

    $ ia metadata nasa | jq -r '.metadata.identifier as $id | .files[] | [$id, .name, .md5] | @tsv'
    nasa    NASAarchiveLogo.jpg    64dcc1092df36142eb4aab7cc255a4a6
    nasa    __ia_thumb.jpg    c354f821954f80516d163c23135e7dd7
    nasa    globe_west_540.jpg    9366a4b09386bf673c447e33d806d904
    nasa    globe_west_540_thumb.jpg    d3dab682c56058c8af0df5a2073b1dd1
    nasa    nasa_archive.torrent    70a7b2b44c318bac381c25febca3b2ca
    nasa    nasa_files.xml    5b8a61ea930ce04d093deebe260fd5f8
    nasa    nasa_meta.xml    06cd95343d60df0f10fb8518b349a795
    nasa    nasa_reviews.xml    711ba65d49383a25657640716c45e840

Creating histograms
^^^^^^^^^^^^^^^^^^^

This example creates a histogram of publisher's grouped by item_size.

.. code:: bash

    » ia search 'collection:georgeblood' -f publisher,item_size \
        | jq -r '"\(.publisher) \(.item_size)"' \
        | awk '{arr[$1]+=$2} END {for (i in arr) {print i,arr[i]}}' \
        | sort -rn -k2 \
        | head
    Decca 9518737758200
    Victor 8067854677756
    Columbia 7221975357654
    Capitol 1944338651172
    Brunswick 1574280922547
    Bluebird 1058465142211
    Mercury 1003001910967
    MGM 898067089555
    Okeh 808308437878
    Vocalion 608766709327

Get total imagecount of a collection
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. code:: bash

    $ ia search 'scanningcenter:uoft AND shiptracking:ace54704' -f imagecount | jq '.imagecount' | paste -sd+ - | bc
    8172

Selecting files based on filesize
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Get the filenames of every file in ``goodytwoshoes00newyiala`` that is larger than 3000 bytes:

.. code:: bash

    $ ia metadata goodytwoshoes00newyiala \
        | jq -r '.files[] | select(.name | endswith(".pdf")) | select((.size | tonumber) > 3000) | .name'
    goodytwoshoes00newyiala.pdf
    goodytwoshoes00newyiala_bw.pdf

You can also include the identifier in the output like so:

.. code:: bash

    $ ia metadata goodytwoshoes00newyiala \
        | jq -r '.metadata.identifier as $i | .files[] | select(.name | endswith(".pdf")) | select((.size | tonumber) > 3000) | "\($i)/\(.name)"'
    goodytwoshoes00newyiala/goodytwoshoes00newyiala.pdf
    goodytwoshoes00newyiala/goodytwoshoes00newyiala_bw.pdf