File: README.rst

package info (click to toggle)
python-ijson 3.4.0-1
  • links: PTS, VCS
  • area: main
  • in suites: forky, sid, trixie
  • size: 668 kB
  • sloc: python: 2,687; ansic: 1,776; sh: 4; makefile: 3
file content (652 lines) | stat: -rw-r--r-- 20,728 bytes parent folder | download
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
.. image:: https://github.com/ICRAR/ijson/actions/workflows/deploy-to-pypi.yml/badge.svg
    :target: https://github.com/ICRAR/ijson/actions/workflows/deploy-to-pypi.yml

.. image:: https://github.com/ICRAR/ijson/actions/workflows/fast-built-and-test.yml/badge.svg
    :target: https://github.com/ICRAR/ijson/actions/workflows/deploy-to-pypi.yml

.. image:: https://coveralls.io/repos/github/ICRAR/ijson/badge.svg?branch=master
    :target: https://coveralls.io/github/ICRAR/ijson?branch=master

.. image:: https://badge.fury.io/py/ijson.svg
    :target: https://badge.fury.io/py/ijson

.. image:: https://img.shields.io/pypi/pyversions/ijson.svg
    :target: https://pypi.python.org/pypi/ijson

.. image:: https://img.shields.io/pypi/dd/ijson.svg
    :target: https://pypi.python.org/pypi/ijson

.. image:: https://img.shields.io/pypi/dw/ijson.svg
    :target: https://pypi.python.org/pypi/ijson

.. image:: https://img.shields.io/pypi/dm/ijson.svg
    :target: https://pypi.python.org/pypi/ijson


=====
ijson
=====

Ijson is an iterative JSON parser with standard Python iterator interfaces.

.. contents::
   :local:


Installation
============

Ijson is hosted in PyPI, so you should be able to install it via ``pip``::

  pip install ijson

Binary wheels are provided
for major platforms
and python versions.
These are built and published automatically
using `cibuildwheel <https://cibuildwheel.readthedocs.io/en/stable/>`_
via GitHub Actions.


Usage
=====

All usage example will be using a JSON document describing geographical
objects:

.. code-block:: json

    {
      "earth": {
        "europe": [
          {"name": "Paris", "type": "city", "info": { ... }},
          {"name": "Thames", "type": "river", "info": { ... }},
          // ...
        ],
        "america": [
          {"name": "Texas", "type": "state", "info": { ... }},
          // ...
        ]
      }
    }


High-level interfaces
---------------------

Most common usage is having ijson yield native Python objects out of a JSON
stream located under a prefix.
This is done using the ``items`` function.
Here's how to process all European cities:

.. code-block::  python

    import ijson

    f = urlopen('http://.../')
    objects = ijson.items(f, 'earth.europe.item')
    cities = (o for o in objects if o['type'] == 'city')
    for city in cities:
        do_something_with(city)

For how to build a prefix see the prefix_ section below.

Other times it might be useful to iterate over object members
rather than objects themselves (e.g., when objects are too big).
In that case one can use the ``kvitems`` function instead:

.. code-block::  python

    import ijson

    f = urlopen('http://.../')
    european_places = ijson.kvitems(f, 'earth.europe.item')
    names = (v for k, v in european_places if k == 'name')
    for name in names:
        do_something_with(name)


Lower-level interfaces
----------------------

Sometimes when dealing with a particularly large JSON payload it may worth to
not even construct individual Python objects and react on individual events
immediately producing some result.
This is achieved using the ``parse`` function:

.. code-block::  python

    import ijson

    parser = ijson.parse(urlopen('http://.../'))
    stream.write('<geo>')
    for prefix, event, value in parser:
        if (prefix, event) == ('earth', 'map_key'):
            stream.write('<%s>' % value)
            continent = value
        elif prefix.endswith('.name'):
            stream.write('<object name="%s"/>' % value)
        elif (prefix, event) == ('earth.%s' % continent, 'end_map'):
            stream.write('</%s>' % continent)
    stream.write('</geo>')

Even more bare-bones is the ability to react on individual events
without even calculating a prefix
using the ``basic_parse`` function:

.. code-block:: python

    import ijson

    events = ijson.basic_parse(urlopen('http://.../'))
    num_names = sum(1 for event, value in events
                    if event == 'map_key' and value == 'name')


.. _command_line:

Command line
------------

A command line utility is included with ijson
to help visualise the output of each of the routines above.
It reads JSON from the standard input,
and it prints the results of the parsing method chosen by the user
to the standard output.

The tool is available by running the ``ijson.dump`` module.
For example::

 $> echo '{"A": 0, "B": [1, 2, 3, 4]}' | python -m ijson.dump -m parse
 #: path, name, value
 --------------------
 0: , start_map, None
 1: , map_key, A
 2: A, number, 0
 3: , map_key, B
 4: B, start_array, None
 5: B.item, number, 1
 6: B.item, number, 2
 7: B.item, number, 3
 8: B.item, number, 4
 9: B, end_array, None
 10: , end_map, None

Using ``-h/--help`` will show all available options.


.. _benchmarking:

Benchmarking
------------

A command line utility is included with ijson
to help benchmarking the different methods offered by the package.
It offers some built-in example inputs
that try to mimic different scenarios,
but more importantly it also supports user-provided inputs.
You can also specify which backends to time,
number of iterations,
and more.

The tool is available by running the ``ijson.benchmark`` module.
For example::

 $> python -m ijson.benchmark my/json/file.json -m items -p values.item

Using ``-h/--help`` will show all available options.


``bytes``/``str`` support
-------------------------

Although not usually how they are meant to be run,
all the functions above also accept
``bytes`` and ``str`` objects
directly as inputs.
These are then internally wrapped into a file object,
and further processed.
This is useful for testing and prototyping,
but probably not extremely useful in real-life scenarios.


``asyncio`` support
-------------------

All of the methods above
work also on file-like asynchronous objects,
so they can be iterated asynchronously.
In other words, something like this:

.. code-block:: python

   import asyncio
   import ijson

   async def run():
      f = await async_urlopen('http://..../')
      async for object in ijson.items(f, 'earth.europe.item'):
         if object['type'] == 'city':
            do_something_with(city)
   asyncio.run(run())

An explicit set of ``*_async`` functions also exists
offering the same functionality,
except they will fail if anything other
than a file-like asynchronous object is given to them.
(so the example above can also be written using ``ijson.items_async``).
In fact in ijson version 3.0
this was the only way to access
the ``asyncio`` support.


Intercepting events
-------------------

The four routines shown above
internally chain against each other:
tuples generated by ``basic_parse``
are the input for ``parse``,
whose results are the input to ``kvitems`` and ``items``.

Normally users don't see this interaction,
as they only care about the final output
of the function they invoked,
but there are occasions when tapping
into this invocation chain this could be handy.
This is supported
by passing the output of one function
(i.e., an iterable of events, usually a generator)
as the input of another,
opening the door for user event filtering or injection.

For instance if one wants to skip some content
before full item parsing:

.. code-block:: python

  import io
  import ijson

  parse_events = ijson.parse(io.BytesIO(b'["skip", {"a": 1}, {"b": 2}, {"c": 3}]'))
  while True:
      prefix, event, value = next(parse_events)
      if value == "skip":
          break
  for obj in ijson.items(parse_events, 'item'):
      print(obj)


Note that this interception
only makes sense for the ``basic_parse -> parse``,
``parse -> items`` and ``parse -> kvitems`` interactions.

Note also that event interception
is currently not supported
by the ``async`` functions.


Push interfaces
---------------

All examples above use a file-like object as the data input
(both the normal case, and for ``asyncio`` support),
and hence are "pull" interfaces,
with the library reading data as necessary.
If for whatever reason it's not possible to use such method,
you can still **push** data
through yet a different interface: `coroutines <https://www.python.org/dev/peps/pep-0342/>`_
(via generators, not ``asyncio`` coroutines).
Coroutines effectively allow users
to send data to them at any point in time,
with a final *target* coroutine-like object
receiving the results.

In the following example
the user is doing the reading
instead of letting the library do it:

.. code-block:: python

   import ijson

   @ijson.coroutine
   def print_cities():
      while True:
         obj = (yield)
         if obj['type'] != 'city':
            continue
         print(obj)

   coro = ijson.items_coro(print_cities(), 'earth.europe.item')
   f = urlopen('http://.../')
   for chunk in iter(functools.partial(f.read, buf_size)):
      coro.send(chunk)
   coro.close()

All four ijson iterators
have a ``*_coro`` counterpart
that work by pushing data into them.
Instead of receiving a file-like object
and option buffer size as arguments,
they receive a single ``target`` argument,
which should be a coroutine-like object
(anything implementing a ``send`` method)
through which results will be published.

An alternative to providing a coroutine
is to use ``ijson.sendable_list`` to accumulate results,
providing the list is cleared after each parsing iteration,
like this:

.. code-block:: python

   import ijson

   events = ijson.sendable_list()
   coro = ijson.items_coro(events, 'earth.europe.item')
   f = urlopen('http://.../')
   for chunk in iter(functools.partial(f.read, buf_size)):
      coro.send(chunk)
      process_accumulated_events(events)
      del events[:]
   coro.close()
   process_accumulated_events(events)


.. _options:

Options
=======

Additional options are supported by **all** ijson functions
to give users more fine-grained control over certain operations:

- The ``use_float`` option (defaults to ``False``)
  controls how non-integer values are returned to the user.
  If set to ``True`` users receive ``float()`` values;
  otherwise ``Decimal`` values are constructed.
  Note that building ``float`` values is usually faster,
  but on the other hand there might be loss of precision
  (which most applications will not care about)
  and will raise an exception when overflow occurs
  (e.g., if ``1e400`` is encountered).
  This option also has the side-effect
  that integer numbers bigger than ``2^64``
  (but *sometimes* ``2^32``, see capabilities_)
  will also raise an overflow error,
  due to similar reasons.
  Future versions of ijson
  might change the default value of this option
  to ``True``.
- The ``multiple_values`` option (defaults to ``False``)
  controls whether multiple top-level values are supported.
  JSON content should contain a single top-level value
  (see `the JSON Grammar <https://tools.ietf.org/html/rfc7159#section-2>`_).
  However there are plenty of JSON files out in the wild
  that contain multiple top-level values,
  often separated by newlines.
  By default ijson will fail to process these
  with a ``parse error: trailing garbage`` error
  unless ``multiple_values=True`` is specified.
- Similarly the ``allow_comments`` option (defaults to ``False``)
  controls whether C-style comments (e.g., ``/* a comment */``),
  which are not supported by the JSON standard,
  are allowed in the content or not.
- For functions taking a file-like object,
  an additional ``buf_size`` option (defaults to ``65536`` or 64KB)
  specifies the amount of bytes the library
  should attempt to read each time.
- The ``items`` and ``kvitems`` functions, and all their variants,
  have an optional ``map_type`` argument (defaults to ``dict``)
  used to construct objects from the JSON stream.
  This should be a dict-like type supporting item assignment.


Events
======

When using the lower-level ``ijson.parse`` function,
three-element tuples are generated
containing a prefix, an event name, and a value.
Events will be one of the following:

- ``start_map`` and ``end_map`` indicate
  the beginning and end of a JSON object, respectively.
  They carry a ``None`` as their value.
- ``start_array`` and ``end_array`` indicate
  the beginning and end of a JSON array, respectively.
  They also carry a ``None`` as their value.
- ``map_key`` indicates the name of a field in a JSON object.
  Its associated value is the name itself.
- ``null``, ``boolean``, ``integer``, ``double``, ``number`` and ``string``
  all indicate actual content, which is stored in the associated value.


.. _prefix:

Prefix
======

A prefix represents the context within a JSON document
where an event originates at.
It works as follows:

- It starts as an empty string.
- A ``<name>`` part is appended when the parser starts parsing the contents
  of a JSON object member called ``name``,
  and removed once the content finishes.
- A literal ``item`` part is appended when the parser is parsing
  elements of a JSON array,
  and removed when the array ends.
- Parts are separated by ``.``.

When using the ``ijson.items`` function,
the prefix works as the selection
for which objects should be automatically built and returned by ijson.


.. _backends:

Backends
========

Ijson provides several implementations of the actual parsing in the form of
backends located in ijson/backends:

- ``yajl2_c``: a C extension using `YAJL <http://lloyd.github.io/yajl/>`__ 2.x.
  This is the fastest, but *might* require a compiler and the YAJL development files
  to be present when installing this package.
  Binary wheel distributions exist for major platforms/architectures to spare users
  from having to compile the package.
- ``yajl2_cffi``: wrapper around `YAJL <http://lloyd.github.io/yajl/>`__ 2.x
  using CFFI.
- ``yajl2``: wrapper around YAJL 2.x using ctypes, for when you can't use CFFI
  for some reason.
- ``yajl``: deprecated YAJL 1.x + ctypes wrapper, for even older systems.
- ``python``: pure Python parser, good to use with PyPy

This list of backend names is available under the ``ijson.ALL_BACKENDS`` constant.

You can import a specific backend and use it in the same way as the top level
library:

.. code-block::  python

    import ijson.backends.yajl2_cffi as ijson

    for item in ijson.items(...):
        # ...

Importing the top level library as ``import ijson``
uses the first available backend in the same order of the list above,
and its name is recorded under ``ijson.backend``.
If the ``IJSON_BACKEND`` environment variable is set
its value takes precedence and is used to select the default backend.

You can also use the ``ijson.get_backend`` function
to get a specific backend based on a name:

.. code-block:: python

    backend = ijson.get_backend('yajl2_c')
    for item in backend.items(...):
        # ...


.. _capabilities:

Capabilities
------------

Apart from their performance,
all backends are designed to support the same capabilities.
There are however some small known differences,
all of which can be queried by inspecting
the ``capabilities`` module constant.
It contains the following members:

* ``c_comments``: C-style comments are supported.
* ``multiple_values``: multiple top-level JSON values are supported.
* ``detects_invalid_leading_zeros``: numbers with leading zeroes
  are reported as invalid (as they should, as pert the JSON standard),
  raising a ``ValueError``.
* ``detects_incomplete_json_tokens``: detects incomplete JSON tokens
  at the end of an incomplete document (e.g., ``{"a": fals``),
  raising an ``IncompleteJSONError``.
* ``int64``: when using ``use_float=True``,
    values greater than or equal to ``2^32`` are correctly returned.

These capabilities are supported by all backends,
with the following exceptions:

* The ``yajl`` backend doesn't support ``multiple_values``,
  ``detects_invalid_leading_zeros`` and ``detects_incomplete_json_tokens``.
  It also doesn't support ``int64``
  in platforms with a 32-bit C ``long`` type.

* The ``python`` backend doesn't support ``c_comments``.


Performance tips
================

In more-or-less decreasing order,
these are the most common actions you can take
to ensure you get most of the performance
out of ijson:

- Make sure you use the fastest backend available.
  See backends_ for details.
- If you know your JSON data
  contains only numbers that are "well behaved"
  consider turning on the ``use_float`` option.
  See options_ for details.
- Make sure you feed ijson with binary data
  instead of text data.
  See faq_ #1 for details.
- Play with the ``buf_size`` option,
  as depending on your data source and your system
  a value different from the default
  might show better performance.
  See options_ for details.

The benchmarking_ tool should help
with trying some of these options
and observing their effect on your input files.


.. _faq:

FAQ
===

#. **Q**: Does ijson work with ``bytes`` or ``str`` values?

   **A**: In short: both are accepted as input, outputs are only ``str``.

   All ijson functions expecting a file-like object
   should ideally be given one
   that is opened in binary mode
   (i.e., its ``read`` function returns ``bytes`` objects, not ``str``).
   However if a text-mode file object is given
   then the library will automatically
   encode the strings into UTF-8 bytes.
   A warning is currently issued (but not visible by default)
   alerting users about this automatic conversion.

   On the other hand ijson always returns text data
   (JSON string values, object member names, event names, etc)
   as ``str`` objects.
   This mimics the behavior of the system ``json`` module.

#. **Q**: How are numbers dealt with?

   **A**: ijson returns ``int`` values for integers
   and ``decimal.Decimal`` values for floating-point numbers.
   This is mostly because of historical reasons.
   Since 3.1 a new ``use_float`` option (defaults to ``False``)
   is available to return ``float`` values instead.
   See the options_ section for details.

#. **Q**: I'm getting an ``UnicodeDecodeError``, or an ``IncompleteJSONError`` with no message

   **A**: This error is caused by byte sequences that are not valid in UTF-8.
   In other words, the data given to ijson is not *really* UTF-8 encoded,
   or at least not properly.

   Depending on where the data comes from you have different options:

   * If you have control over the source of the data, fix it.

   * If you have a way to intercept the data flow,
     do so and pass it through a "byte corrector".
     For instance, if you have a shell pipeline
     feeding data through ``stdin`` into your process
     you can add something like ``... | iconv -f utf8 -t utf8 -c | ...``
     in between to correct invalid byte sequences.

   * If you are working purely in python,
     you can create a UTF-8 decoder
     using codecs' `incrementaldecoder <https://docs.python.org/3/library/codecs.html#codecs.getincrementaldecoder>`_
     to leniently decode your bytes into strings,
     and feed those strings (using a file-like class) into ijson
     (see our `string_reader_async internal class <https://github.com/ICRAR/ijson/blob/0157f3c65a7986970030d3faa75979ee205d3806/ijson/utils35.py#L19>`_
     for some inspiration).

   In the future ijson might offer something out of the box
   to deal with invalid UTF-8 byte sequences.

#. **Q**: I'm getting ``parse error: trailing garbage`` or ``Additional data found`` errors

   **A**: This error signals that the input
   contains more data than the top-level JSON value it's meant to contain.
   This is *usually* caused by JSON data sources
   containing multiple values, and is *usually* solved
   by passing the ``multiple_values=True`` to the ijson function in use.
   See the options_ section for details.


Acknowledgements
================

ijson was originally developed and actively maintained until 2016
by `Ivan Sagalaev <http://softwaremaniacs.org/>`_.
In 2019 he
`handed over <https://github.com/isagalaev/ijson/pull/58#issuecomment-500596815>`_
the maintenance of the project and the PyPI ownership.

Python parser in ijson is relatively simple thanks to `Douglas Crockford
<http://www.crockford.com/>`_ who invented a strict, easy to parse syntax.

The `YAJL <https://lloyd.github.io/yajl>`__ library by `Lloyd Hilaiel
<http://lloyd.io/>`_ is the most popular and efficient way to parse JSON in an
iterative fashion.
When building the library ourselves,
we use `our own fork <https://github.com/rtobar/yajl>`__
that contains fixes for all known CVEs.

Ijson was inspired by `yajl-py <http://pykler.github.com/yajl-py/>`_ wrapper by
`Hatem Nassrat <http://www.nassrat.ca/>`_. Though ijson borrows almost nothing
from the actual yajl-py code it was used as an example of integration with yajl
using ctypes.