1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246
|
# Contributing to Markup.ml
<br/>
#### Table of contents
- [Getting started](#getting-started)
- [Building and testing](#building)
- [Code overview](#code-overview)
- [Common concepts](#common-concepts)
- [Structure](#structure)
<br/>
<a id="getting-started"></a>
## Getting started
To get a development version of Markup.ml, do:
```
git clone https://github.com/aantron/markup.ml.git
cd markup.ml
opam install --deps-only .
```
<br/>
<a id="building"></a>
## Building and testing
To test the code, run `make test`. To generate a coverage report, run `make
coverage`. There are several other kinds of testing:
- `make performance-test` measures time for Markup.ml to parse some XML and HTML
files. You should have `ocamlnet` and `xmlm` installed. Those libraries will
also be measured, for comparison.
- `make js-test` checks that `Markup_lwt` can be linked into a `js_of_ocaml`
program, i.e. that it is not accidentally pulling in any Unix dependencies.
- `make dependency-test` pins and installs Markup.ml using opam, then builds
some small programs that depend on Markup.ml. This tests correct installation
and that no dependencies are missing.
<br/>
<a id="code-overview"></a>
## Code overview
<a id="common-concepts"></a>
### Common concepts
The library is internally written entirely in continuation-passing style (CPS),
i.e., roughly speaking, *using callbacks*. Except for really trivial helpers,
most internal functions in Markup.ml take two continuations (callbacks): one to
call if the function succeeds, and one to call if it fails with an exception.
So, for a function `f` we would think of as taking as one `int` argument, and
returning a `string`, the type signature would look like this:
```ocaml
val f : int -> (exn -> unit) -> (string -> unit) -> unit
```
The code will call it on `1337` as `f 1337 throw k`. If `f` succeeds, say with
result `"foo"`, it will call `k "foo"`. If it fails, say with `Exit`, it will
call `throw Exit`.
The point of all this is that `f` doesn't have to return right away: it can,
perhaps transitively, trigger some I/O, and call `throw` or `k` only later,
when the I/O completes.
Due to pervasive use of CPS, there are two useful type aliases defined in
[`Markup.Common`][common]:
```ocaml
type 'a cont = 'a -> unit
type 'a cps = exn cont -> 'a cont -> unit
```
With these aliases, the signature of `f` can be abbreviated as:
```ocaml
val f : int -> string cps
```
which is much more legible.
The other important internal type in Markup.ml is the continuation-passing style
stream, or *kstream* (`k` being the traditional meta-variable for a
continuation). The fundamental operation on a stream is getting the next
element, and for kstreams this looks like:
```ocaml
Kstream.next : 'a Kstream.t -> exn cont -> unit cont -> 'a cont -> unit
```
When you call `next kstream on_exn on_empty k`, `next` eventually calls:
- `on_exn exn` if trying to retrieve the next element resulted in exception
`exn`,
- `on_empty ()` if the stream ended, or
- `k v` in the remaining case, when the stream has a next value `v`.
Each of the parsers and serializers in Markup.ml is a chain of stream
processors, tied together by these kstreams. For example, the HTML and XML parsers both...
- take a stream of bytes,
- transform it into a stream of Unicode characters paired with locations,
- transform that into a stream of language tokens, like "start tag,"
- and transform that into a stream of parsing signals, like "start element."
<br/>
The synchronous default API of Markup.ml, seen in the [`README`][readme], is a
thin wrapper over this internal implementation. What makes it synchronous is
that the underlying I/O functions guarantee that each call to a CPS function `f`
will call one of its continuations (callbacks) *before* `f` returns.
Likewise, the Lwt API is another thin wrapper, which translates between CPS and
Lwt promises. What makes this API asynchronous is that underlying I/O functions
might not call their continuations until long after the functions have returned,
and this delay is propagated to the continuations nearest to the surface API.
[readme]: https://github.com/aantron/markup.ml#readme
<br/>
<a id="structure"></a>
### Structure
As for how the stream processors are chained together, The HTML specification
strongly suggests a structure for the parser in the section
[*8.2.1 Overview of the parsing model*][model], from where the following diagram
is taken:
<p align="center">
<img src="https://www.w3.org/TR/html5/images/parsing-model-overview.svg" />
</p>
[model]: https://www.w3.org/TR/html5/syntax.html#overview-of-the-parsing-model
The XML parser follows the same structure, even though it is not explicitly
suggested by the XML specification.
The modules can be arranged in the following categories. Where a module directly
implements a box from the diagram, the box name is indicated in boldface.
Until the modules dealing with Lwt, only `Markup.Stream_io` does I/O. The rest
of the modules are pure with respect to I/O.
Almost everything is based directly on specifications. Most functions are
commented with the HTML or XML specification section number they are
implementing. It may also be useful to see the [conformance status][conformance]
– these are all the known deviations by Markup.ml from the specifications.
#### Helpers
- [`Markup.Common`][common] – shared definitions, compiler compatibility, etc.
- [`Markup.Error`][error] – parsing and serialization error type. Markup.ml does
not throw exceptions, because all errors are recoverable.
- [`Markup.Namespace`][namespace] – namespace URI to prefix conversion and back.
- [`Markup.Entities`][entities] – checked-in auto-generated HTML5 entity list.
The source for this file is `src/entities.json`, and the generator is
`src/translate_entities.ml`. Neither of these latter two files is part of the
built Markup.ml, nor of the build process.
- [`Markup.Trie`][trie] – trie for incrementally searching the entity list.
- [`Markup.Kstream`][kstream] – above-mentioned CPS streams.
- [`Markup.Text`][text] – some utilities for `Markup.Html_tokenizer` and
`Markup.Xml_tokenizer`; see below.
#### I/O
- [`Markup.Stream_io`][stream_io] – make byte streams from files, strings, etc.,
write byte streams to strings, etc. – the first stage of parsing and the last
stage of serialization (**Network** in the diagram). This uses the I/O
functions in `Pervasives`.
#### Encodings
- [`Markup.Encoding`][encoding] – byte streams to Unicode character streams
(**Byte Stream Decoder** in the diagram). For UTF-8, this is a wrapper around
`uutf`.
- [`Markup.Detect`][detect] – prescans byte streams to detect encodings.
- [`Markup.Input`][input] – Unicode streams to "preprocessed" Unicode streams –
in HTML5 parlance, this just means normalizing CR-LF to CR, and attaching
locations (**Input Stream Preprocessor** in the diagram).
#### HTML parsing
- [`Markup.Html_tokenizer`][html_tokenizer] – preprocessed Unicode streams to
HTML lexeme streams (**Tokenizer** in the diagram). HTML lexemes are things
like start tags, end tags, and runs of text.
- [`Markup.Html_parser`][html_parser] – HTML lexeme streams to HTML signal
streams (**Tree Construction** in the diagram). Signal streams are things like
"start an element," "start another element as its child," "now end the child,"
"now end the root element." They are basically a left-to-right traversal of a
DOM tree, without the DOM tree actually being in memory.
#### XML parsing
- [`Markup.Xml_tokenizer`][xml_tokenizer] – as for HTML above, but for XML.
- [`Markup.Xml_parser`][xml_parser] - as for HTML above, but for XML.
#### HTML writing
- [`Markup.Html_writer`][html_writer] – HTML signal streams back to
UTF-8-encoded byte streams.
#### XML writing
- [`Markup.Xml_writer`][xml_writer] - as for HTML above, but for XML.
#### User-friendly APIs
- [`Markup.Utility`][utility] – convenience functions on signal streams for the
user.
- [`Markup`][main], [`Markup_lwt`][lwt], [`Markup_lwt_unix`][lwt_unix] – the
public interface for operating all of the above machinery without having to
touch CPS.
[common]: https://github.com/aantron/markup.ml/blob/master/src/common.ml
[error]: https://github.com/aantron/markup.ml/blob/master/src/error.ml
[namespace]: https://github.com/aantron/markup.ml/blob/master/src/namespace.mli
[entities]: https://github.com/aantron/markup.ml/blob/master/src/entities.ml
[trie]: https://github.com/aantron/markup.ml/blob/master/src/trie.ml
[kstream]: https://github.com/aantron/markup.ml/blob/master/src/kstream.mli
[stream_io]: https://github.com/aantron/markup.ml/blob/master/src/stream_io.ml
[encoding]: https://github.com/aantron/markup.ml/blob/master/src/encoding.ml
[input]: https://github.com/aantron/markup.ml/blob/master/src/input.mli
[html_tokenizer]: https://github.com/aantron/markup.ml/blob/master/src/html_tokenizer.mli
[html_parser]: https://github.com/aantron/markup.ml/blob/master/src/html_parser.mli
[html_writer]: https://github.com/aantron/markup.ml/blob/master/src/html_writer.mli
[xml_tokenizer]: https://github.com/aantron/markup.ml/blob/master/src/xml_tokenizer.mli
[xml_parser]: https://github.com/aantron/markup.ml/blob/master/src/xml_parser.mli
[xml_writer]: https://github.com/aantron/markup.ml/blob/master/src/xml_writer.mli
[text]: https://github.com/aantron/markup.ml/blob/master/src/text.ml
[detect]: https://github.com/aantron/markup.ml/blob/master/src/detect.mli
[utility]: https://github.com/aantron/markup.ml/blob/master/src/utility.ml
[main]: https://github.com/aantron/markup.ml/blob/master/src/markup.mli
[lwt]: https://github.com/aantron/markup.ml/blob/master/src/markup_lwt.mli
[lwt_unix]: https://github.com/aantron/markup.ml/blob/master/src/markup_lwt_unix.mli
[conformance]: http://aantron.github.io/markup.ml/#2_Conformancestatus
|