1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468
|
ijson design notes
##################
Version 3.0 will come with a complete re-design
of the underlying mechanism used to move data
through the different library layers.
This was done to address some limitations of the current design,
and allow more advanced use cases out of the box.
This document explains the design in ijson 2.x,
and then it shows the new design coming with ijson 3.x,
and how it offers both backward compatibility
(with no performance cost)
and new features out of the box.
2.x design (pull)
=================
ijson is a library for iterating over big (or small) JSON documents.
The main goal of the library
is to avoid loading the entire document into memory;
instead the document is iterated over in small chunks,
results are given back to the user as they are found,
and hence memory consumption stays low
throughout the whole parsing process.
This is achieved using python generators,
which naturally lend themselves to this kind of problems:
just write a ``for`` loop and ``yield`` values out of it,
and the caller can iterate over them in their own ``for`` loop.
ijson offers four modes of iterating over a JSON document,
(not long ago they were three, as ``kvitems`` was recently added),
each returning different types of results:
* ``basic_parse`` is the most basic one,
and returns ``(event, value)`` tuples
each time an element is found in the JSON stream
(e.g., the beginning of an array, a object member name, etc).
* ``parse`` returns ``(prefix, event, value)`` tuples
containing not only the information returned by ``basic_parse``,
but also the *prefix* or *location* in the document hierarchy
under which the event occurred.
* ``items`` returns fully-built python objects
constructed from the contents found at a given ``prefix``.
* Finally, ``kvitems`` is similar to ``items``,
but instead of returning full objects
it returns ``(name, value)`` tuples
consisting on its *members*,
which is useful when individual objects in the JSON document
are too big.
As a side note,
in ijson there are also different *backends*,
all of which offer the same four iteration modes.
Most of them share most of the code,
but most interestingly the ``yajl2_c`` backend doesn't,
re-implementing all the logic in C.
These four iteration modes
are all implemented as python generators.
They also have the nice property
that they are built on top of one another.
For example, ``parse`` basically works like this:
.. code-block:: python
def parse(...):
for event, value in basic_parse(...):
prefix = calculate_prefix(event, value)
yield (prefix, event, value)
All in all, the different generators are combined like this
to form a *pipeline*::
+---------+
---| kvitems |
+-------------+ +-------+ | +---------+
| basic_parse |-----| parse |----+
+-------------+ +-------+ | +---------+
|---| items |
+---------+
Now, what the diagram above doesn't show
is *who* calls *who*.
Python generators yielding values
work in a "pull" fashion:
the caller drives the execution by calling ``next(generator)``,
the generator runs
until the next ``yield`` statement is executed
and the yielded value is returned to the caller
(or until the generator exists,
in which case ``StopIteration`` is raised).
So, adding the direction of how calls are made
(including internal and external call points),
the diagram above looks like this::
+---------+ next()
---| kvitems |<-------
+-------------+ next() +-------+ next() | +---------+
| basic_parse |<--------| parse |<--------+
+-------------+ +-------+ | +---------+ next()
^ ^ |---| items |<-------
| next() | next() +---------+
The only missing bit in this diagram now is: where does data reading happen?
Because this is a pull model, the only possible solution is to do the reading
at the lowest level possible: ``basic_parse`` reads a chunk out of the JSON
stream when it is invoked for new tuples (and none are held in memory),
calculates new tuples with the parsing results,
yields them,
and eventually when there are no more contents on the stream it exits,
provoking the whole pipeline to exit
In other words, ``basic_parse`` exhausts the JSON stream,
but users drive the process
by iterating over the generator of choice until it returns.
All in all, the library works like this::
+---------+ next()
---| kvitems |<-------
+------+ read() +-------------+ next() +-------+ next() | +---------+
| file |<--------| basic_parse |<--------| parse |<--------+
+------+ +-------------+ +-------+ | +---------+ next()
^ ^ |---| items |<-------
| next() | next() +---------+
Limitations
===========
This is all well and good if you live in a blocking I/O world.
All you need to do is pass down
your blocking, file-like object
when you call ``items``, ``kvitems``, ``parse`` or ``basic_parse``,
and ijson will do the reading for you
as you iterate over the results.
However, this is also a main fundamental limitation
as it prevents users from easily using ijson in other scenarios.
One such scenario comes up when using ``asyncio``.
In an ``asyncio`` world users don't perform blocking reads anymore,
and are instead handed over chunks of data to operate on.
The only way to operate with ijson in these situations
is emulating a blocking, file-like object,
which is error-prone and not a great solution overall.
Another example in which ijson doesn't work well
would be the following:
consider a multi-stage, blocking socket-based conversation
between ``A`` and ``B``.
``A`` first receives some JSON content from ``B``,
and after parsing it it needs to reply to ``B``,
which will send more content
through the same channel of communication,
and so on.
In such case, one would need to wrap the socket
into a new object that emulates exhausting of the stream when required,
even though the underlying socket is not exhausted yet.
In both cases what we want to do
is let users *push* data into ijson rather than
having ijson *pull* data.
This is basis for the new design.
3.x design (push)
=================
In ijson 3.x we completely redesigned the parsing pipeline
to work in a push model rather than a pull model,
where chunks of data are pushed by the user into the pipeline.
This effectively decouples I/O
from the core parsing logic.
In other words, at the low level instead of working like this::
+---------+ next()
---| kvitems |<-------
+------+ read() +-------------+ next() +-------+ next() | +---------+
| file |<--------| basic_parse |<--------| parse |<--------+
+------+ +-------------+ +-------+ | +---------+ next()
^ ^ |---| items |<-------
| next() | next() +---------+
the library now works like this::
+---------+
+-->| kvitems |
+-------------+ +-------+ | +---------+
chunk ---->| basic_parse |---->| parse |----+
+-------------+ +-------+ | +---------+
|-->| items |
+---------+
Here, ``chunk`` is a piece of data held by the user,
who sends it into ``basic_parse``,
who upon calculating a new tuple
sends it to ``parse`` and so on,
depending on what iteration mode you request.
Now the user is in full control of feeding data into the pipeline
and reacting to its results,
without the library being in charge anymore.
This is implemented using... python generators!
Generators are usually used to yield values,
but since python 2.5 they can also to *receive* values back from their callers.
This turns then effectively into "coroutines",
very much like those found in python 3.5+ ``asyncio`` coroutines
using the ``async`` and ``await`` syntax.
On that note: why didn't we use *those*?
There are at least two drawbacks to using them directly:
* Using ``asyncio`` coroutines would require users of the library
to work within an ``asyncio`` event loop
when interacting with ijson routines.
As of now, that represents 0% of the current user base of ijson,
which so far hasn't offered support for ``asyncio``
out of the box.
Using generator coroutines
doesn't impose any execution context,
giving users the freedom to use them wherever they want.
* Python 2.7 support would be removed.
While python 2.7 is no longer maintained by the Python Software Foundation,
there might still be programs out there using ijson with python 2.7,
and we don't want to break them (yet).
Using generator coroutines we maintain python 2.7 support
in the library core.
For the rest of the text,
"coroutine" then means "plain, generator-based coroutine"
and not "python 3 ``asyncio``-based coroutine",
unless explicitly mentioned.
How do these new coroutines look like?
Firstly, they have new names
to avoid clashing with those
of the current set of generators,
and hence they all end up with a ``_basecoro`` suffix (more on this later).
Apart from this they look fairly similar
to the original generators.
For example, let's see ``basic_parse_basecoro`` and ``parse_basecoro`` in action
(this is not actual code, just a skeleton):
.. code-block:: python
def basic_parse_basecoro(target, ...):
while True:
chunk = (yield)
event, value = do_some_parsing(chunk)
target.send((event, value))
def parse_basecoro(target, ...):
while True:
event, value = (yield)
prefix = calculate_prefix(event, value)
target.send((prefix, event, value))
The key components are the ``(yield)`` statements,
which allow coroutines to receive data,
and the ``target.send`` calls,
which is how one sends data into a coroutine.
Moreover, we can chain them again forming a pipeline,
with data being pushed from one side,
and the appropriate events making it out on the other.
With these changes
the core pipeline now looks like this
(again including internal and external calls)::
+------------------+ send()
+-->| kvitems_basecoro |------->
send() +----------------------+ send() +----------------+ send() | +------------------+
chunk ------->| basic_parse_basecoro |------->| parse_basecoro |--------+
+----------------------+ +----------------+ | +------------------+ send()
| | |-->| items_basecoro |------->
+--> send() +--> send() +------------------+
Backwards-compatibility
-----------------------
Implementing the original generators
on top of this coroutine-based pipeline
can be easily done,
thus retaining backwards-compatibility
for all users.
This is basically how ``parse`` works now:
.. code-block:: python
def sendable_list(list):
send = list.append
def parse(f, ...):
results = sendable_list()
coro = parse_basecoro(results)
coro = basic_parse_basecoro(parse)
while True:
chunk = f.read()
coro.send(chunk)
for result in results:
yield result
del results[:]
if not chunk:
break
Or, in other words::
parse
+------------------------------------------------------------------------------+
+------+ read() | +----------------------+ send() +----------------+ send() +------+ | next()
| file |<--------| chunk --->| basic_parse_basecoro |------->| parse_basecoro |------->| list | |<-------
+------+ | +----------------------+ +----------------+ +------+ |
+------------------------------------------------------------------------------+
The other generators work similarly,
with the corresponding coroutine pipeline constructed
inside each generator.
Support for asyncio
-------------------
Using this new framework,
adding support for ``asyncio`` generators
(i.e., that can be used in an ``async for`` loop)
is also trivial.
Now when running under python 3.5+
sll generators have an ``async`` counterpart
ending with an ``_async`` suffix,
which are roughly implemented like this:
.. code-block:: python
def async_iterable(object):
def __init__(self, f):
self.f = f
self.events = sendable_deque()
async def __anext__(self):
data = await self.f.read()
try:
self.coro.send(data)
except StopIteration:
raise StopAsyncIteration
return self.events.pop()
def basic_parse_async(f, ...):
iterable = async_iterable(f)
iterable.coro = basic_parse_basecoro(iterable.events)
return iterable
Or, in other words::
parse_async
+-----------------------------------------------------------------------------------+
+------+ await read() | send() +----------------------+ send() +----------------+ send() +-------+ | __anext__()
| file |<--------------| chunk ------->| basic_parse_basecoro |------->| parse_basecoro |------->| deque | |<------------
+------+ | +----------------------+ +----------------+ +-------+ |
+-----------------------------------------------------------------------------------+
Again, the other async generators work similarly,
with the full corresponding coroutine pipeline
constructed inside the async generator.
User-facing coroutines
----------------------
Finally,
it would also make sense to offer users
access to the underlying coroutines,
with users pushing data into them,
and registering a target to receive the results.
The ``*_basecoro`` coroutines
are designed to work each on their own though,
and users would have to create them inidividually
and then chain them together manually,
which can be error prone.
Instead we also offer
"pre-chained" coroutines for each of the iterators,
which receive a chunk of data on one side,
and send out the relevant event to the user-provided coroutine.
These are called ``*_coro``
(which is why the *core* ones are called ``*_basecoro`` instead).
They are roughly implemented like this:
.. code-block:: python
def parse_coro(target, ...):
return basic_parse_basecoro(parse_basecoro(target), ...)
Or, in other words::
parse_coro
+-----------------------------------------------------+
send() | +----------------------+ send() +----------------+ | send()
chunk --------|->| basic_parse_basecoro |------->| parse_basecoro |-|------->
| +----------------------+ +----------------+ |
+-----------------------------------------------------+
The other user-facing coroutines
are constructed similarly.
Performance
===========
This is the best part:
performance is still on-par with the previous implementation,
and has even been improved as part of this exercise.
The plot below shows a comparison of processing speeds
of each generator
over three different backends
(python, yajl2 and yajl2_c)
for different synthetic test cases.
The other backends should have similar results.
For each generator/backend/test combination,
the old implementation (pure generator)
is compared to the new (coroutine-based generator).
Values are processing speeds in [MB/s],
so higher is better.
All these measurements have been made
using the ``benchmark.py`` tool
found in the git repository of ijson,
so you can give it a try as well.
.. image:: performance_comparison.png
These measurements were run
on an Intel(R) Core(TM) i7-5600U CPU
using python 3.7
(full results: `old.csv <./old.csv>`_, `new.csv <./new.csv>`_).
It can be seen that results vary depending on the backend,
but overall speeds are comparable
to the original implementation.
Special attention was given to the C backend,
as it was, and remains, the fastest of all,
usually by a factor of 10x.
During the porting to the new push model,
we added some modifications to its inner working
to avoid unnecessary data copies
and tuple packing/unpacking where possible,
leading to a noticeable improvement on performance
(~25% as the median).
Again, your mileage might vary
depending on the document you are parsing,
but overall these are very good results.
No proper performance comparison has been made yet
on the ``asyncio`` generators offered by ijson,
but early results suggest
there is still work to be done
to fully catch up with the generators' speed.
On the one hand,
implementing them as ``async`` generators
(which would require python 3.6+)
instead of classes with ``__aiter__`` and ``__anext__``
apparently would give a boost in speed.
Other strategies could also be investigated
for storing the temporary results
rather than keeping them in a deque.
Finally,
the C backend could see its own implementation
of the ``async`` iterables,
which will probably not be too hard.
|