1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406
|
iterate_many
==========
When serializing large databases, it is often better to write out many independent JSON
documents, instead of one large monolithic document containing many records. The simdjson
library provides high-speed access to files or streams containing multiple small JSON documents separated by ASCII white-space characters. Given an input such as
```JSON
{"text":"a"}
{"text":"b"}
{"text":"c"}
"..."
```
... you want to read the entries (individual JSON documents) as quickly and as conveniently as possible. Importantly, the input might span several gigabytes, but you want to use a small (fixed) amount of memory. Ideally, you'd also like the parallelize the processing (using more than one core) to speed up the process.
Contents
--------
- [Motivations](#motivations)
- [How it works](#how-it-works)
- [Context](#context)
- [Design](#design)
- [Threads](#threads)
- [Support](#support)
- [API](#api)
- [Use cases](#use-cases)
- [Tracking your position](#tracking-your-position)
- [Incomplete streams](#incomplete-streams)
- [C++20 features](#c20-features)
Motivation
-----------
The main motivation for this piece of software is to achieve maximum speed and offer a
better quality of life in parsing files containing multiple small JSON documents.
The JavaScript Object Notation (JSON) [RFC7159](https://tools.ietf.org/html/rfc7159) is a handy
serialization format. However, when serializing a large sequence of
values as an array, or a possibly indeterminate-length or never-
ending sequence of values, JSON may be inconvenient.
Consider a sequence of one million values, each possibly one kilobyte
when encoded -- roughly one gigabyte. It is often desirable to process such a dataset incrementally
without having to first read all of it before beginning to produce results.
How it works
------------
### Context
Before parsing anything, simdjson first preprocesses the JSON text by identifying all structural indexes
(i.e. the starting position of any JSON value, as well as any important operators like `,`, `:`, `]` or
`}`) and validating UTF8. This stage is referred to stage 1. However, during this process, simdjson has
no knowledge of whether parsed a valid document, multiple documents, or even if the document is complete.
Then, to iterate through the JSON text during parsing, we use what we call a JSON iterator that will navigate
through the text using these structural indexes. This JSON iterator is not visible though, but it is the
key component to make parsing work.
Prior to iterate_many, most people who had to parse a multiline JSON file would proceed by reading the
file line by line, using a utility function like `std::getline` or equivalent, and would then use
the `parse` on each of those lines. From a performance point of view, this process is highly
inefficient, in that it requires a lot of unnecessary memory allocation and makes use of the
`getline` function, which is fundamentally slow, slower than the act of parsing with simdjson
[(more on this here)](https://lemire.me/blog/2019/06/18/how-fast-is-getline-in-c/).
Unlike the popular parser RapidJson, our DOM does not require the buffer once the parsing job is
completed, the DOM and the buffer are completely independent. The drawback of this architecture is
that we need to allocate some additional memory to store our ParsedJson data, for every document
inside a given file. Memory allocation can be slow and become a bottleneck, therefore, we want to
minimize it as much as possible.
### Design
To achieve a minimum amount of allocations, we opted for a design where we create only one
parser object and therefore allocate its memory once, and then recycle it for every document in a
given file. But, knowing that they often have largely varying size, we need to make sure that we
allocate enough memory so that all the documents can fit. This value is what we call the batch size.
As of right now, we need to manually specify a value for this batch size, it has to be at least as
big as the biggest document in your file, but not too big so that it submerges the cached memory.
The bigger the batch size, the fewer we need to make allocations. We found that 1MB is somewhat a
sweet spot.
1. When the user calls `iterate_many`, we return a `document_stream` which the user can iterate over
to receive parsed documents.
2. We call stage 1 on the first batch_size bytes of JSON in the buffer, detecting structural
indexes for all documents in that batch.
3. The `document_stream` owns a `document` instance that keeps track of the current document position
in the stream using a JSON iterator. To obtain a valid document, the `document_stream` returns a
**reference** to its document instance.
4. Each time the user calls `++` to read the next document, the JSON iterator moves to the start the
next document.
5. When we reach the end of the batch, we call stage 1 on the next batch, starting from the end of
the last document, and go to step 3.
### Threads
But how can we make use of threads if they are available? We found a pretty cool algorithm that allows
us to quickly identify the position of the last JSON document in a given batch. Knowing exactly where
the end of the last document in the batch is, we can safely parse through the last document without any
worries that it might be incomplete. Therefore, we can run stage 1 on the next batch concurrently while
parsing the documents in the current batch. Running stage 1 in a different thread can, in best cases,
remove almost entirely its cost and replaces it by the overhead of a thread, which is orders of magnitude
cheaper. Ain't that awesome!
Thread support is only active if thread supported is detected in which case the macro
SIMDJSON_THREADS_ENABLED is set. You can also manually pass `SIMDJSON_THREADS_ENABLED=1` flag
to the library. Otherwise the library runs in single-thread mode.
You should be consistent. If you link against the simdjson library built for multithreading
(i.e., with `SIMDJSON_THREADS_ENABLED`), then you should build your application with multithreading
system (setting `SIMDJSON_THREADS_ENABLED=1` and linking against a thread library).
A `document_stream` instance uses at most two threads: there is a main thread and a worker thread.
Support
-------
Since we want to offer flexibility and not restrict ourselves to a specific file
format, we support any file that contains any amount of valid JSON document, **separated by one
or more character that is considered whitespace** by the JSON spec. Anything that is
not whitespace will be parsed as a JSON document and could lead to failure.
Whitespace Characters:
- **Space**
- **Linefeed**
- **Carriage return**
- **Horizontal tab**
If your documents are all objects or arrays, then you may even have nothing between them.
E.g., `[1,2]{"32":1}` is recognized as two documents.
Some official formats **(non-exhaustive list)**:
- [Newline-Delimited JSON (NDJSON)](https://github.com/ndjson/ndjson-spec/)
- [JSON lines (JSONL)](http://jsonlines.org/)
- [Record separator-delimited JSON (RFC 7464)](https://tools.ietf.org/html/rfc7464) <- Not supported by simdjson!
- [More on Wikipedia...](https://en.wikipedia.org/wiki/JSON_streaming)
API
---
Example:
```cpp
// R"( ... )" is a C++ raw string literal.
auto json = R"({ "foo": 1 } { "foo": 2 } { "foo": 3 } )"_padded;
// _padded returns an simdjson::padded_string instance
ondemand::parser parser;
ondemand::document_stream docs = parser.iterate_many(json);
for (auto doc : docs) {
std::cout << doc["foo"] << std::endl;
}
// Prints 1 2 3
```
See [basics.md](basics.md#newline-delimited-json-ndjson-and-json-lines) for an overview of the API.
## Use cases
From [jsonlines.org](http://jsonlines.org/examples/):
- **Better than CSV**
```json
["Name", "Session", "Score", "Completed"]
["Gilbert", "2013", 24, true]
["Alexa", "2013", 29, true]
["May", "2012B", 14, false]
["Deloise", "2012A", 19, true]
```
CSV seems so easy that many programmers have written code to generate it themselves, and almost every implementation is
different. Handling broken CSV files is a common and frustrating task. CSV has no standard encoding, no standard column
separator and multiple character escaping standards. String is the only type supported for cell values, so some programs
attempt to guess the correct types.
JSON Lines handles tabular data cleanly and without ambiguity. Cells may use the standard JSON types.
The biggest missing piece is an import/export filter for popular spreadsheet programs so that non-programmers can use
this format.
- **Easy Nested Data**
```json
{"name": "Gilbert", "wins": [["straight", "7♣"], ["one pair", "10♥"]]}
{"name": "Alexa", "wins": [["two pair", "4♠"], ["two pair", "9♠"]]}
{"name": "May", "wins": []}
{"name": "Deloise", "wins": [["three of a kind", "5♣"]]}
```
JSON Lines' biggest strength is in handling lots of similar nested data structures. One .jsonl file is easier to
work with than a directory full of XML files.
Tracking your position
-----------
Some users would like to know where the document they parsed is in the input array of bytes.
It is possible to do so by accessing directly the iterator and calling its `current_index()`
method which reports the location (in bytes) of the current document in the input stream.
You may also call the `source()` method to get a `std::string_view` instance on the document
and `error()` to check if there were any error.
Let us illustrate the idea with code:
```cpp
auto json = R"([1,2,3] {"1":1,"2":3,"4":4} [1,2,3] )"_padded;
simdjson::ondemand::parser parser;
simdjson::ondemand::document_stream stream;
auto error = parser.iterate_many(json).get(stream);
if (error) { /* do something */ }
auto i = stream.begin();
size_t count{0};
for(; i != stream.end(); ++i) {
auto doc = *i;
if (!i.error()) {
std::cout << "got full document at " << i.current_index() << std::endl;
std::cout << i.source() << std::endl;
count++;
} else {
std::cout << "got broken document at " << i.current_index() << std::endl;
return false;
}
}
```
This code will print:
```
got full document at 0
[1,2,3]
got full document at 9
{"1":1,"2":3,"4":4}
got full document at 29
[1,2,3]
```
Incomplete streams
-----------
Some users may need to work with truncated streams. The simdjson may truncate documents at the very end of the stream that cannot possibly be valid JSON (e.g., they contain unclosed strings, unmatched brackets, unmatched braces). After iterating through the stream, you may query the `truncated_bytes()` method which tells you how many bytes were truncated. If the stream is made of full (whole) documents, then you should expect `truncated_bytes()` to return zero.
Consider the following example where a truncated document (`{"key":"intentionally unclosed string `) containing 39 bytes has been left within the stream. In such cases, the first two whole documents are parsed and returned, and the `truncated_bytes()` method returns 39.
```cpp
auto json = R"([1,2,3] {"1":1,"2":3,"4":4} {"key":"intentionally unclosed string )"_padded;
simdjson::ondemand::parser parser;
simdjson::ondemand::document_stream stream;
auto error = parser.iterate_many(json,json.size()).get(stream);
if (error) { std::cerr << error << std::endl; return; }
for(auto i = stream.begin(); i != stream.end(); ++i) {
std::cout << i.source() << std::endl;
}
std::cout << stream.truncated_bytes() << " bytes "<< std::endl; // returns 39 bytes
```
This will print:
```
[1,2,3]
{"1":1,"2":3,"4":4}
39 bytes
```
Importantly, you should only call `truncated_bytes()` after iterating through all of the documents since the stream cannot tell whether there are truncated documents at the very end when it may not have accessed that part of the data yet.
Comma-separated documents
-----------
We also support comma-separated documents, but with some performance limitations. The `iterate_many` function takes in an option to allow parsing of comma separated documents (which defaults on false). In this mode, the entire buffer is processed in one batch. Therefore, the total size of the document should not exceed the maximal capacity of the parser (4 GB). This mode also effectively disallow multithreading. It is therefore mostly suitable for not "very large" inputs. In this mode, the batch_size parameter
is effectively ignored, as it is set to at least the document size.
Example:
```cpp
auto json = R"( 1, 2, 3, 4, "a", "b", "c", {"hello": "world"} , [1, 2, 3])"_padded;
ondemand::parser parser;
ondemand::document_stream doc_stream;
// We pass '32' as the batch size, but it is a bogus parameter because, since
// we pass 'true' to the allow_comma parameter, the batch size will be set to at least
// the document size.
auto error = parser.iterate_many(json, 32, true).get(doc_stream);
if (error) { std::cerr << error << std::endl; return; }
for (auto doc : doc_stream) {
std::cout << doc.type() << std::endl;
}
```
This will print:
```
number
number
number
number
string
string
string
object
array
```
C++20 features
--------------------
In C++20, the standard introduced the notion of *customization point*.
A customization point is a function or function object that can be customized for different types. It allows library authors to provide default behavior while giving users the ability to override this behavior for specific types.
A tag_invoke function serves as a mechanism for customization points. It is not directly part of the C++ standard library but is often used in libraries that implement customization points.
The tag_invoke function is typically a generic function that takes a tag type and additional arguments.
The first argument is usually a tag type (often an empty struct) that uniquely identifies the customization point (e.g., deserialization of custom types in simdjson). Users or library providers can specialize tag_invoke for their types by defining it in the appropriate namespace, often inline namespace.
You can deserialize you own data structures conveniently if your system supports C++20.
When it is the case, the macro `SIMDJSON_SUPPORTS_CONCEPTS` will be set to 1 by
the simdjson library.
Consider a custom class `Car`:
```cpp
struct Car {
std::string make;
std::string model;
int year;
std::vector<float> tire_pressure;
};
```
You may support deserializing directly from a JSON value or document to your own `Car` instance
by defining a single `tag_invoke` function:
```cpp
namespace simdjson {
// This tag_invoke MUST be inside simdjson namespace
template <typename simdjson_value>
auto tag_invoke(deserialize_tag, simdjson_value &val, Car& car) {
ondemand::object obj;
auto error = val.get_object().get(obj);
if (error) {
return error;
}
if ((error = obj["make"].get_string(car.make))) {
return error;
}
if ((error = obj["model"].get_string(car.model))) {
return error;
}
if ((error = obj["year"].get(car.year))) {
return error;
}
if ((error = obj["tire_pressure"].get<std::vector<float>>().get(
car.tire_pressure))) {
return error;
}
return simdjson::SUCCESS;
}
} // namespace simdjson
```
Importantly, the `tag_invoke` function must be inside the `simdjson` namespace.
Let us explain each argument of `tag_invoke` function.
- `simdjson::deserialize_tag`: it is the tag for Customization Point Object (CPO). You may often ignore this parameter. It is used to indicate that you mean to provide a deserialization function for simdjson.
- `var`: It receives automatically a `simdjson` value type (document, value, document_reference).
- The third parameter is an instance of the type that you want to support.
Please see our main documentation (`basics.md`) under
"Use `tag_invoke` for custom types (C++20)" for details about
tag_invoke functions.
Given a stream of JSON documents, you can add them to a data structure
such as a `std::vector<Car>` like so if you support exceptions:
```cpp
padded_string json =
R"( { "make": "Toyota", "model": "Camry", "year": 2018,
"tire_pressure": [ 40.1, 39.9 ] }
{ "make": "Kia", "model": "Soul", "year": 2012,
"tire_pressure": [ 30.1, 31.0 ] }
{ "make": "Toyota", "model": "Tercel", "year": 1999,
"tire_pressure": [ 29.8, 30.0 ] }
)"_padded;
ondemand::parser parser;
ondemand::document_stream stream;
[[maybe_unused]] auto error = parser.iterate_many(json).get(stream);
std::vector<Car> cars;
for(auto doc : stream) {
cars.push_back((Car)doc); // an exception may be thrown
}
```
Otherwise you may use this longer version for explicit handling of errors:
```cpp
std::vector<Car> cars;
for(auto doc : stream) {
Car c;
if ((error = doc.get<Car>().get(c))) {
std::cerr << simdjson::error_message(error); << std::endl;
return EXIT_FAILURE;
}
cars.push_back(c);
}
```
|