File: developer_overview.rst

package info (click to toggle)
python-asdf 2.14.3-1%2Bdeb12u1
  • links: PTS, VCS
  • area: main
  • in suites: bookworm
  • size: 2,280 kB
  • sloc: python: 16,612; makefile: 124
file content (713 lines) | stat: -rw-r--r-- 38,068 bytes parent folder | download
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
High level overview of the basic ASDF library
=============================================

This document is an attempt to make it easier to understand the design and
workings of the python asdf library for those unfamiliar with it. This is
expected to grow organically so at the moment it should not be considered
complete or comprehensive.

Understanding the design is complicated by the fact that the library
effectively inserts custom methods or classes into the objects that
the pyyaml and jsonschema libraries use. Understanding what is going on
thus means having some understanding of the relevant parts of the
internals of both of those libraries. This overview will try to provide
a small amount of context for these packages to illuminate how the code
in asdf interacts with them.

There are at least two ways of outlining the design. One is to give high level
overviews of the various modules and how they interact with other modules. The
other is to illustrate how code is actually invoked in common operations, this
often being much more informative on a practical level (at least some find that to
be the case). This document will attempt to do both.

We will start with a high-level review of concepts and terms and point to where
these are handled in the asdf modules.

Because of the complexity, this initial design overview will focus on issues of
validation and tree construction when reading.

Construction in progress
------------------------

Before we get into further details, a word on the transition to new plugin APIs.
Starting in asdf 2.8 we've introduced new interfaces for extending the asdf
library to support additional tags and schemas.  The interfaces were redesigned
with the following goals in mind:

- Simplify the connection between tags and their schema content.  The old
  "resolver" system involves sending the tag URI through a lengthy series of
  transformations to get the filesystem path to the schema document.  This has
  been error-prone and difficult to troubleshoot, so the new "resource mapping"
  system explicitly maps schema URIs to their content, and tag URIs directly
  to schema URIs.

- Make it easier to separate schemas from extension code.  Until now the schemas
  have always been provided by the same Python package that implements support
  for their tags, but we would like to move the schemas to language-agnostic
  repositories that non-Python implementations can use.  To better support this,
  the new interface splits the old extension plugin into two new plugins, one
  of which is dedicated to schemas.

- Allow tag serialization support to handle arbitrary sets of URIs.  Previously
  tag code was restricted to working with tag URIs that were identical
  except for version.  This presented a problem for the transition of URIs
  from stsci.edu to asdf-format.org, so the new interface allows for supporting
  diverse URIs with the same code.

- Improve the terminology used in the tag serialization support classes.  The
  old ``ExtensionType`` has been renamed ``Converter`` to indicate its purpose,
  and to eliminate the ambiguity betwenen YAML types and Python types.  The
  ``to_tree`` and ``from_tree`` methods have been renamed ``to_yaml_tree`` and
  ``from_yaml_tree`` to better indicate which tree they're expected to convert.

- Simplify the code and behavior of tag classes.  Converters are used as instances
  instead of classes with a custom metaclass, Python sub-types are no longer
  automatically handled, URIs are treated as single values instead of broken
  down into various components, etc.

You can witness the gory details of this effort by clicking through the PR links
on the asdf 2.8.0 `roadmap <https://github.com/asdf-format/asdf/wiki/Roadmap#280>`_.

Support for ASDF core tags has not yet been moved to the new system.  Doing so
would be a breaking change for users who subclass that code, so we'll need
to wait until asdf 3.0 to do that.

Some terminology and definitions
--------------------------------

**URI vs URL (Universal Resource Identifier)**. This is distinguished from URL
(Universal Resource Locator) primarily in that URI is a mechanism for a unique
name that follows a particular syntax, but itself may not indicate where the
resource is. Generally URLs are expected to be used on the web for the HTTP
protocol, though for asdf, this isn't necessarily the case as mentioned next.
Recent changes to the library permit use of URIs with the asdf:// scheme, which
is intended to reduce confusion over the distinction between identifiers
and locations.

**Resolver:** Tools to map URIs and tags into actual locations of schema files,
which may be local directories (the usual approach) or an actual URL for
retrieval over the network. This is more complicated that it may seem for
reasons explained later.  The resolver system has been deprecated in favor
of resource mappings; new code should use the latter instead.

**Global config:** A global library configuration feature that was added in
asdf 2.8.  Allows plugins to be added or removed at runtime and ``AsdfFile``
defaults to be modified by the user.  Accessed via the ``get_config`` method
on the top-level ``asdf`` module.  For example, the default ASDF Standard
version for new files can be set like this::

    asdf.get_config().default_version = "1.3.0"

Or a resource mapping plugin added at runtime like this::

    asdf.get_config().add_resource_mapping({"http://somewhere.org/resources/foo", b"foo resource content"})

**Entry point:** A Python packaging feature that allows asdf to use plugins
provided by other packages.  Entry points are registered when a package is
installed and become available to asdf without any additional effort on
the part of the user.  See :ref:`pypa-packaging:entry-points`
for more information.

**Resource mapping:** An asdf plugin that provides access to "resources" which
are binary blobs associated with a URI.  These resources are mostly schemas,
but any resource may provided by a mapping.  Resource mappings are provided
via entry points or added at runtime using a method on the global config object.
This feature is intended to replace the deprecated "resolver" mechanism.

**Extension:** An extension to the ASDF Standard that defines additional
YAML tags.  In the future an extension may include other additional features
such as binary block compressors or filters, but currently only tags
are supported.

**Extension implementation:** An asdf plugin that implements an extension
to the ASDF Standard.  This is the asdf library's support for an extended
set of YAML tags.  The library currently provides two interfaces for
implementing extensions: the ``AsdfExtension`` class and the
new, still-experimental ``Extension`` class.  Extension implementations are
provided via entry points or added at runtime using a method on the global
config object.  The ``AsdfFile`` also permits adding additional extensions
on a per-instance basis, but use of that feature is discouraged and may be
removed in asdf 3.0.

**Tag code/tag class:** A class responsible for converting a family of tags
into Python objects and vice versa.  Each extension implementation includes
a list of such classes.  For the original ``AsdfExtension`` API, the tag
classes all implement the ``ExtensionType`` interface.  For the new API,
tag classes implement ``Converter``.

**Validator:** Tool to confirm that the YAML conforms to the schemas that
apply. A lot goes on in this area and it is pretty complex in the
implementation.

**Tree building:** The YAML content is built into a tree in two stages. The YAML
parser converts the raw YAML into a custom Python structure. It is that
structure that is validated. Then if no errors are found, the tree is
converted into a tree where tagged nodes get converted into corresponding Python
objects (usually, an option exists to prevent this from happening, which is
useful for some applications), e.g., WCS object or numpy arrays (well, not
quite that simply for numpy arrays).

The above is a simplified view of what happens when an ASDF file is read.

Most of resolver tools and code is in ``resolver.py`` (but not all).

Most of the validation code is in ``schema.py``.

The code that builds the trees is spread in many places: ``tagged.py``,
``treeutil.py``, ``types.py`` as well as all the extension code that supplies
code to handle the tags within (and often the the associated schemas).

A note on the location of schemas and tag code; there is a bit of schizophrenic
aspect to this since schema should be language agnostic and in that view, not
bundled with specific language library code. But currently nearly all of the
implementation is in Python so while the long-term goal is to keep them
separate, it is more convenient to keep them together for now. You will see
cases where they are separate and some where they are bundled.  The introduction
of a separate plugin for providing access to schemas (the "resource mapping")
is intended to allow extension authors to keep the schema documents in a separate
language-neutral repository.

Actions that happen when an AsdfFile is instantiated
----------------------------------------------------

The asdf plugins (new and old-style extensions as well as resource mappings)
registered as entry points can be obtained by calling methods in ``entry_points.py``.
These methods are invoked by ``config.AsdfConfig`` the first time library needs to
use the plugins, and thereafter are cached within that config object.  Both
extensions and resource mappings are stored wrapped in proxy objects (``ExtensionProxy``
and ``ResourceMappingProxy``, respectively) that carry additional metadata
like the package name and version of the entry point, and add some convenience
methods on top of what the extension developer provides.  Additionally, ``ExtensionProxy``
allows the library to treat both new-style ``Extension`` instances and old-style
``AsdfExtension`` instances similarly.

To see the list of extensions loaded by the library, call ``asdf.get_config().extensions``.
To see the list of resource mappings, call ``asdf.get_config().resource_mappings``.
Both of these properties are lazy-loaded and then cached, so the first call will take
a moment to complete but subsequent calls will return immediately.

When an ``AsdfFile`` class is instantiated, one thing that happens on the
``__init__`` is that ``self._process_plugin_extensions()`` is called.  This method
retrieves the extensions from the global config and selects those that
are compatible with the ``AsdfFile``'s ASDF Standard version.  It returns the
resulting list, which is assigned to the ``_plugin_extensions`` variable.  The
term "plugin extensions" contrasts with "user extensions" which are additional
extensions provided by the user as an argument to ``AsdfFile.__init__``.

The extension lists are used by ``AsdfFile`` to create the file's ``ExtensionList``
and ``ExtensionManager`` instances, which manage extensions for the old and
new extension APIs, respectively.  These instances are created lazily when
the ``extension_list`` and ``extension_manager`` properties are first accessed,
to help speed up the initial construction of the ``AsdfFile``.

The ``extension_manager`` is responsible for mapping tag URIs to schema URIs
for validation and retrieving type converters (instances of the ``Converter`` interface)
by Python type or by YAML tag URI.  ``extension_list`` handles the same duties,
but for old-style extensions.  ``extension_manager`` takes precedence over
``extension_list`` throughout the asdf library, so ``extension_list`` will
only be consulted if ``extension_manager`` can't handle a particular tag
or Python type.

On the subject of resolvers and tag/url mapping
-----------------------------------------------

The ``AsdfFile`` class has ``tag_mapping`` and ``url_mapping`` properties
that each return the ``extension_list`` properties of the same name.  These
objects implement the original support for mapping tag URIs to schema content
that in the new API is provided by resource mappings.

``tag_mapping`` and ``url_mapping`` are each ``resolver.Resolver`` instances that
are generated from the mapping lists in the old-style extensions. These lists consist
of 2-tuples. In the first case it is a mechanism to map the tag string to a url string,
typically with an expected prefix or suffix to the tag (suffix is typical)  so that
given a full tag, it generates a url that includes the suffix.  This permits one mapping
to cover many tag variants (The details of mapping machinery with examples are given
in a later section since understanding this is essential to defining new tags and
corresponding schemas).

The URL mapping works in a similar way, except that it consists of 2-tuples
where the first element is the common elements of the url, and the second part
maps it to an actual location (url or file path). Again the second part may
include a place holder for the suffix or prefix, and code to generate the path
to the schema file.

The use of the resolver object turns these lists into functions so that
supplied the appropriate input that matches something in the list, it gives the
corresponding output.

Outline of how an ASDF file is opened and read into the corresponding Python object.
------------------------------------------------------------------------------------

The starting point can be found in ``asdf.py`` essentially through the following
chain (many calls and steps left out to keep it simpler to follow)

When ``asdf.open("myasdffile.asdf")`` is called, it is aliased to
``asdf.open_asdf`` which first creates an instance of ``asdf.AsdfFile`` (let's
call the instance ``af``), then calls ``af._open_impl()`` and then
``af._open_asdf``. That invokes a call to ``generic_io.get_file()``.

``generic.py`` basically contains code to handle all the variants of I/O
possible (files, streaming, http access, etc). In this case it returns a
``RealFile`` instance that wraps a  local file system file.

Next the file is examined to see if it is an ASDF file (first by examining the
first few lines in the header). If it passes those checks, the header (yaml)
section of the file is extracted through a proxy mechanism that signals an end
of file when the end of the yaml is reached, but otherwise looks like a file
object.

The yaml parsing phase described below normally returns a "tagged_tree". That is
(somewhat simplified), it returns the data structure that yaml would normally
return without any object conversion (i.e., all nodes are either dicts, lists,
or scalar values), except that they are objects that now support a tag attribute
that indicates if a tag was associated with that node and what the tag was.

This reader object is passed to the yaml parser by calling
``yamlutil.load_tree``. A simple explanation for what goes on here is necessary
to understand how this all works. Yaml supports various kinds of loaders. For
security reasons, the "safe" loader is used (note that both C and python
versions are supported through an indirection of the ``_yaml_base_loader``
defined at the beginning of that module that determines whether the C version is
available). The loaders are recursive mechanisms that build the tree structure.
Note that ``yamlutil.load_tree`` creates a temporary subclass of ``AsdfLoader``
and attaches a reference to the AsdfFile instance as the ``.ctx`` attribute of
that temporary subclass.

One of the hooks that pyyaml supplies is the ability to overload the method
``construct_object``. That's what the class ``yamlutil.AsdfLoader`` does. pyyaml
calls this method at each node in the tree to see if anything special should be
done. One could perform conversion to predefined objects here, but instead it
does the following: it sees if the node.tag attribute is handled by yaml itself
(examples?) it calls that constructor which returns the type yaml converts it
to. Otherwise:

 - it converts the node to the type indicated (dict, list, or scalar type) by
   yaml for that node.
 - it obtains the appropriate tag class (an AsdfType subclass) from the AsdfFile
   instance (using ``ctx.type_index.fix_yaml_tag`` to deal with version issues
   to match the most appropriate tag class).  The new extension API does not
   support this "fix YAML tag" feature so file's ExtensionManager is not used
   here.
 - it wraps all the node alternatives in a special asdf ``Tagged`` class instance
   variant where that object contains a ._tag attribute that is a reference to
   the corresponding Tag class.

The loading process returns a tree of these Tagged object instances. This
tagged_tree is then returned to the ``af`` instance (still running the
``_open_asdf()`` method) this tree is  passed to to the ``_validate()`` method
(This is the major reason that the tree isn't  directly converted to an object
tree since jsonschema would not be able to use the  final object tree for
validation, besides issues relate to the fact that things that don't validate
may not be convertible to the designated object.)

The validate machinery is a bit confusing since there are essentially two basic
approaches to how validation is done. One type of validation is for validation
of schema files themselves, and the other for schemas for tags.

The schema.py file is fairly involved and the details are covered elsewhere.
When the validator machinery is constructed, it uses the fundamental validation
files (schemas). But this doesn't handle the fact that the file being validated
is yaml, not json and that there are items in yaml not part of json so special
handling is needed. And the way it is handled is through a internal mechanism of
the jsonschema library. There is a method that jsonschema calls recursively for
a validator and it is called iter_errors. The subclass of the jsonschema
validator class is defined as schema.ASDFValidator and this method is overloaded
in this class. Despite its name, it's primary purpose is to validate the special
features that yaml has, namely applying schemas associated with tags (this is
not part of the normal jsonschema scheme [ahem]). It is in this method that it
looks for a tag for a node and if it exists and in the tag_index, loads the
appropriate schema and applies it to the node. (jsonschemas are normally only
associated with a whole json entity rather than specific nodes). While the
purpose of this  method is to iteratively handle errors that jsonschema detects,
it has essentially been repurposed as the means of interjecting handling tag
schemas.

In order to prevent repeated loading of the same schema, the lru caching scheme
is used (from functools in the standard library) where the last n cached schemas
are  saved (details of how this works were recently changed to prevent a serious
memory leak)

In any event, a lot is going on behind the scenes in validation and it deserves
its own description elsewhere.

After validation, the tagged tree is then passed to
yamlutil.tagged_tree_to_custom_tree() where the nodes in the tree that have
special tag code convert the nodes into the  appropriate Python objects that the
base asdf and extensions are aware of. This is accomplished by that function
defining a walker "callback" function (defined within that function as to pick
up the af object intrinsically). The function then passes the callback walker to
treeutil.walk_and_modify() where the tree will be traversed recursively applying
the tag code associated with the tag to the more primitive tree representation
replacing such nodes with Python objects. The tree traversal starts from the
top, but the objects are created from the bottom up due to recursion (well, not
quite that simple).

Understanding how this works is described more fully later on.

The result is what af.tree is set to, after doing another tree traversal looking
for special type hooks for each node. It isn't clear if there is yet any use of that
feature.

Not quite that simple
---------------------

Outline of schema.py
--------------------

This module is somewhat confusing due to the many functions and methods with
some variant of validate in their name. This will try to make clear what they do
(a renaming of these may be in order).

Here is a list of the functions/classes in ``schema.py`` and their purpose and
where  they sit in the order of things

default_ext_resolver

**_type_to_tag:** Handles mapping python types to yaml_tags, with the addition
of support for OrderedDicts.

The next 5 functions are put in the ``YAML_VALIDATORS`` dictionary to ultimately
be used by ``_create_validator`` to create the json validator object

------

**validate_tag:** Obtain the relevant tag for the supplied instance (either
built ins or custom objects) and check that it matches the tag supplied to the
function.

**validate_propertyOrder:** Not really a validator but rather as a trick to
indicate that properties should retain their order.

**validate_flowStyle:** Not really a validator but rather as a trick to store
what style to use to write the elements (for yaml objects and arrays)

**validate_style:** Not really a validator but rather as a trick to store info
on what style to use to write the string.

**validate_type:** Used to deal with date strings

(It may make sense to rename the above to be more descriptive of the action than where
they  are stuck in the validation machinery; e.g., ``set_propertyOrder``)

**validate_fill_default:** Set the default values for all properties that have a
subschema  that defines a default. Called indirectly in ``fill_defaults``

**validate_remove_default:** does the opposite; remove all properties where
value equals  subschema default. Called indirectly in ``remove_defaults`` (For
this and the above, validate in the name mostly confuses although it is used by
the json validator.)

[these could be renamed as well since they do more than validate]


**_create_validator:** Creates an ``ASDFValidator`` class on the fly that uses
the  ``jsonchema.validators`` class created. This ``ASDFValidator`` class
overrides the ``iter_errors`` method that is used to handle yaml tag cases
(using the ``._tag`` attribute of the node to obtain the corresponding  schema
for that tag; e.g., it calls ``load_schema`` to obtain the right schema when
called for each node in the jsonschema machinery). What isn't clear to me is why
this is done on the fly and at least cached since it really only handles two
variants of calls (basically which JSONSCHEMA version is to be used). Otherwise
it doesn't appear to vary except for that. Admittedly, this is only created at
the top level. This is called by ``get_validator``.

**class OrderedLoader:** Inherits from the ``_yaml_base_loader``, but otherwise
does nothing new in the definition. But the following code defines
``construct_mapping``, and then adds it as a method.

**construct_mapping:** Defined outside the ``OrderedLoader`` class but to be
added to the  ``OrderedLoader`` class by use of the base class add_constructor
method. This function flattens the mapping and returns an ``OrderedDict`` of the
property attributes (This needs some deep understanding of how the yaml parser
actually works, which is not covered here. Apparently mappings can be
represented as nested trees as the yaml is originally parsed. Or something like
that.)

**_load_schema:** Loads json or yaml schemas (using the ``OrderedLoader``).

**_make_schema_loader:** Defines the function load_schema using the provided
resolver and _load_schema.

**_make_resolver:** Sets the schema loader for http, https, file, tag using a
dictionary where these access methods are the keys and the schema loader
returning only the schema (and not the uri). These all appear to use the same
schema loader.

**_load_draft4_metaschema:**

**load_custom_schema:** Deals with custom schemas.

**load_schema:** Loads a schema from the specified location (this is cached).
Called for every tag encountered (uses resolver machinery). Most of the
complexity is in resolving json references. Calls ``_make_schema_loader,
resolver, reference.resolve_fragment, load_schema``

**get_validator:** Calls ``_create_validator``. Is called by validate to return
the created validator.

**validate_large_literals:** Ensures tree has no large literals (raises error if
it does)

**validate:** Uses ``get_validator`` to get a validator object and then calls
its validate method, and validates any large literals using
``validate_large_literals``.

**fill_defaults:** Inserts attributes missing with the default value

**remove_defaults:** Where the tree has attributes with value equal to the
default, strip the attribute.

**check_schema:** Checks schema against the metaschema.

---------------

**Illustration of the where these are called:**

``af._open_asdf`` calls ``af.validate`` which calls ``af._validate`` which then
calls  ``schema.validate`` with the tagged tree as the first argument (it can be
called again if there is a custom schema).

**in schema.py**

``validate -> get_validator -> _create_validator`` (returns ``ASDFValidator``).
There are two levels of validation, those passed to the json_validation
machinery for the  schemas themselves, and those that the tag machinery triggers
when the jsonschema validator calls through ``iter_errors``. The first level
handles all the tricks at the top. the ``ASDFValidator`` uses ``load_schema``
which in turn calls ``_make_schema_loader``, then ``_load_schema``.
``_load_schema`` uses the ``OrderedLoader`` to load the schemas.

Got that?

How the ASDF library works with pyyaml
--------------------------------------

A Tree Identifier
.................

There are three flavors of trees in the process of reading ASDF files, one
will see many references to each in the code and description below.

**pyyaml native tree.** This consists of standard Python containers like dict
and list, and primitive values like string, integer, float, etc.

**Tagged tree.** These are similar to pyyaml native trees, but with the basic
types wrapped in a class that has has an attribute that identifies the tag
associated with that node so that later processing can apply the appropriate
conversion code to convert to the final Python object.

**Custom tree**. This is a tree where all nodes are converted to the
destination Python objects. For example, a numpy array or GWCS object.

Brief overview of how pyyaml constructs a Python tree
.....................................................

Understanding the process of creating Python objects from yaml requires some
understanding of how pyyaml works. We will not go into all the details of
pyyaml, but instead concentrate on one phase of its loading process. First
an outline of the phases of processing that pyyaml goes through in loading
a yaml file:

1. **scanning:** Converting the text into lexical tokens. Done in scanner.py
#. **parsing:** Converting the lexical tokens into parsing events. Done in
   parser.py.
#. **composing:** Converting the parsing events into a tree structure of pyyaml
   objects. Done in composer.py
#. **loading:** Converting the pyyaml tree into a Python object tree. Done in
   constructor.py

We will focus on the last step since that is where asdf integrates with how
pyyaml works.

The key object in that module is ``BaseConstructor`` and its subclasses (asdf
uses ``SafeConstructor`` for security purposes). Note that the pyyaml code is
severely deficient in docstrings and comments. The key method that kicks
off the conversion is ``construct_document()``. Its responsibilities are to call
the ``construct_object()`` method on the top node, "drain" any generators
produced by construction (more on this later), and finally reset internal
data structures once construction is complete.

The actual process seems somewhat mysterious because what is going on is
that it is using generators in place of vanilla code to construct the
children for mutable items. The general scheme is that each constructor
for mutable elements (see as an example the
``SafeConstructor.construct_yaml_seq()`` method) is written
as a generator that is expected to be asked a value twice. The first value
returned is an empty object of the expected type (e.g., empty dict or
list) and when asked a second time, it populates the previous object
returned (and returns None, which is not used). (In rare exceptions,
when called with ``deep=True``, it does immediately populate the child nodes.)

Normally the generator is appended to the loader's state_generators
attribute (a list) for later use. Any generators not handled in the
recursive chain are handled when contruct_object returns to
``construct_document``, where it iteratively asks each generator to complete
populating its referenced object. Since that step of populating the object
may in turn create new generators on the ``state_generator`` list, it only
stops when no more generators appear on the list.

Why is this done? One reason is to handle references (anchors and aliases)
that may be circular.

Suppose one had the following yaml source::

    A: &a
        x: 1
        B:
            item1: 42
            item2: life, the universe, and everything
        circular: *a

Without generators, it would not be possible to handle this case since the node
identified by anchor ``a`` has not been fully constructed when pyyaml encounters
a reference to that anchor among the same node's descendants. The use
of the generator allows creation of the container object to reference
to before it is populated so that the above construction will work when
constructing the tree. To follow the above example in more detail, the
construction creates a dictionary for ``a`` and then returns to the
``construct_document()`` method, which then starts handling the generators put on
the list (there is only one in this case). The generator then populates
the contents of ``a``. For the attribute ``B`` it encounters a new
mutable container, and puts its generator on the list to handle, and then
makes a reference to ``a`` which now is defined. One last time it
handles the generator for ``B`` and since each item in that is not
a container, the construction completes.

Pyyaml tracks pending objects in a recursive objects dict and throws
an exception if generators fail to handle reference cycles. (The conversion
of the tagged tree to the custom tree, performed later does not use the
same technique; explained later)

How ASDF hooks into pyyaml construction
.......................................

ASDF makes use of this by adding generators to this process by defining
a new construct method ``construct_undefined()`` that handles all ASDF tag
cases. This is added to the pyyaml dict of construct methods under the
key of ``None``. When pyyaml doesn't find a tag, that is what it uses as
a key to handle unknown tags. Thus the construction is redirected to
ASDF code. That code returns a generator in the case of mutable ASDF
objects in line with how yaml works with mutable objects.

Historical note: Versions older than 2.6.0 did not work this way. Instead,
those versions completely replaced the pyyaml method ``construct_object()`` with
their own version that did not use generators as pyyaml did.

How conversion to ASDF objects is done
......................................

The current means of conversion is simpler to use by tag code, but
also more subtle to understand how it actually works (for many,
that means harder ;-)

The YAML loading process produces a tagged tree of basic Python types.
The conversion of these into ASDF types is kicked off when the ``AsdfFile``
method ``_open_asdf()`` calls ``yamlutil.tagged_tree_to_custom_tree()``.
This function defines a walker function that is to be used with
``treeutil.walk_and_modify()``. Most of what the walker function does is
handle tag issues (e.g., can the tag be appropriately mapped to the
tag creation code) and then returns the appropriate ASDF type by calling
``tag_type.from_tree_tagged()``.

A note on tree traversal. One can traverse a tree in three ways:
inorder, preorder, and postorder (``asdf.info()`` uses a breadth-first
traversal, yet another exciting option, which we won't describe here).
These respectively mean whether
nodes are visited in the horizontal ordering of the nodes displayed on
a graphs (inorder), descending the tree from the root, doing the left
node first, before the right node (preorder), or from the bottom up, doing
both leaf nodes before the parent node (postorder). In generating the
pyyaml tree, preorder works since it builds the tree from the root
as one would expect in constructing the tree. But in converting the
tagged tree into the custom tree, postorder is the natural course, where
the children are generated first so that the parent node can refer to
the final objects.

An important part of this conversion process is handled by an instance
of the class ``treeutil._TreeModificationContext``. This class does much the
same trick that pyyaml does with generators. Although pyyaml creates
references between basic python objects, these references must be
converted to references between ASDF objects, and doing so requires
a similar mechanism for building the ASDF objects. The
``_TreeModificationContext`` object (hereafter context object)
holds the incomplete generators in a way similar to the pyyaml
``construct_document`` function.

There are differences though. The class ``TreeModificationContext`` provides
methods to indicate if nodes are pending (i.e., incomplete), and there
is a special value ``PendingValue`` that is a signal that the node hasn't
been handled yet (e.g., it may be referencing something yet to be done).
If ``PendingValue`` persists to the end, it indicates a failure to handle
circular references in the tag code. This approach was taken because
one of the earlier prototype implementations did something like this,
passing dict and list subclasses that would throw an exception if a
``PendingValue`` element was accessed.  That would have been more friendly
to extension developers, but it was discarded because it wasn't thought
it was worth turning all those high performance containers into slower
asdf subclasses.  We may want to revisit this if we decide to implement
a tree that tracks "dirty" nodes and only writes to disk those that
have changed, since in that case we'll need custom container subclasses
anyway.  We could also consider writing our own dict/list subclass in C
so we could have our cake and eat it too.

The ``walk_and_modify`` code handles the case where the tag code returns
a generator instead of a value. This generator is expected to be a
similar kind of generator to what pyyaml uses, but differing in that instead
of returning an empty container object it will populate whatever elements
it can complete (e.g, all non-mutable ones), and complete the
population of all the mutable members on the second iteration
(which may, in turn, generate new generators for mutable elements
contained within). When it detects a generator, the ``walk_and_modify``
code retrieves the first yielded value, then saves the generator in the
context. When the
top level of the context is reached (it handles nesting by indicating
how many times it has been entered as a context), it starts "draining"
the saved generators by doing the second iteration on them. Like
pyyaml, this second iteration may produce yet more generators that
get saved, and thus keeps iterating on the saved generators until none
are left.

It is not possible to construct reference cycles in immutable
objects within pure Python code, and thus the generators are only needed
for mutable constructs (e.g., dicts and lists).

Historical note: versions of the ASDF library prior to 2.6.0 required
tag code when converting from a tagged object to a custom object to
call ``tagged_tree_to_custom_tree`` on any values of attributes that may be
arbitrarily nested objects. That no longer is needed with the latest code
since any attribute that contains a mapping or sequence object automatically
uses a generator, so population of that attribute is automatically
deferred until the context is exited. Thus there is no need to explicitly
call a function to populate it.

More explicitly, the ``_recurse`` function defined within ``walk_and_modify``
(in this postorder case) calls ``_handle_children()`` on the node
in question first.  If the node contains children, they are each fed back into
``_recurse`` and transformed into their final objects.  A new node is populated
with these transformed children, and that is the node that gets handed to
``tag.from_tree_tagged()``.  The effect is that the tag class receives
a structure containing only transformed children, so it has no need to
call ``tagged_tree_to_custom_tree`` on its own.

Future plans for SerializationContext
-------------------------------------

Currently, the ``AsdfFile`` itself is used as a container for serialization
parameters and is passed to various methods in block.py, reference.py,
schema.py, yamlutil.py, in ``ExtensionType`` subclasses, and others.  This
doesn't work very well for a couple of reasons.  For one, the intention of
``AsdfFile.write_to`` is to "export" a copy of the file to disk without
changing the in-memory ``AsdfFile``, but since serialization parameters
are read from the ``AsdfFile``, the code currently modifies the open file
as part of the write (and doesn't change it back).  The second issue is that
requiring an ``AsdfFile`` instance in so many method signatures forces
the code (or users themselves) to create an empty dummy ``AsdfFile`` just
to use the method.

The new ``Converter`` interface also accepts a ``ctx`` variable, but
instead of an ``AsdfFile`` it's an instance of ``SerializationContext``.  This
new object will serve the purpose of configuring serialization parameters
and keeping necessary state, which means that the ``AsdfFile`` can go
unmodified.  The ``SerializationContext`` will be relatively lightweight and
creating it will not incur as much of a performance penalty as creating an
``AsdfFile``.