File: CONTRIBUTING.md

package info (click to toggle)
ocaml-markup 1.0.3-1
  • links: PTS, VCS
  • area: main
  • in suites: forky, sid, trixie
  • size: 1,340 kB
  • sloc: ml: 15,131; makefile: 89
file content (246 lines) | stat: -rw-r--r-- 9,936 bytes parent folder | download
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
# Contributing to Markup.ml

<br/>

#### Table of contents

- [Getting started](#getting-started)
- [Building and testing](#building)
- [Code overview](#code-overview)
  - [Common concepts](#common-concepts)
  - [Structure](#structure)



<br/>

<a id="getting-started"></a>
## Getting started

To get a development version of Markup.ml, do:

```
git clone https://github.com/aantron/markup.ml.git
cd markup.ml
opam install --deps-only .
```



<br/>

<a id="building"></a>
## Building and testing

To test the code, run `make test`. To generate a coverage report, run `make
coverage`. There are several other kinds of testing:

- `make performance-test` measures time for Markup.ml to parse some XML and HTML
  files. You should have `ocamlnet` and `xmlm` installed. Those libraries will
  also be measured, for comparison.
- `make js-test` checks that `Markup_lwt` can be linked into a `js_of_ocaml`
  program, i.e. that it is not accidentally pulling in any Unix dependencies.
- `make dependency-test` pins and installs Markup.ml using opam, then builds
  some small programs that depend on Markup.ml. This tests correct installation
  and that no dependencies are missing.



<br/>

<a id="code-overview"></a>
## Code overview

<a id="common-concepts"></a>
### Common concepts

The library is internally written entirely in continuation-passing style (CPS),
i.e., roughly speaking, *using callbacks*. Except for really trivial helpers,
most internal functions in Markup.ml take two continuations (callbacks): one to
call if the function succeeds, and one to call if it fails with an exception.
So, for a function `f` we would think of as taking as one `int` argument, and
returning a `string`, the type signature would look like this:

```ocaml
val f : int -> (exn -> unit) -> (string -> unit) -> unit
```

The code will call it on `1337` as `f 1337 throw k`. If `f` succeeds, say with
result `"foo"`, it will call `k "foo"`. If it fails, say with `Exit`, it will
call `throw Exit`.

The point of all this is that `f` doesn't have to return right away: it can,
perhaps transitively, trigger some I/O, and call `throw` or `k` only later,
when the I/O completes.

Due to pervasive use of CPS, there are two useful type aliases defined in
[`Markup.Common`][common]:

```ocaml
type 'a cont = 'a -> unit
type 'a cps = exn cont -> 'a cont -> unit
```

With these aliases, the signature of `f` can be abbreviated as:

```ocaml
val f : int -> string cps
```

which is much more legible.

The other important internal type in Markup.ml is the continuation-passing style
stream, or *kstream* (`k` being the traditional meta-variable for a
continuation). The fundamental operation on a stream is getting the next
element, and for kstreams this looks like:

```ocaml
Kstream.next : 'a Kstream.t -> exn cont -> unit cont -> 'a cont -> unit
```

When you call `next kstream on_exn on_empty k`, `next` eventually calls:

- `on_exn exn` if trying to retrieve the next element resulted in exception
  `exn`,
- `on_empty ()` if the stream ended, or
- `k v` in the remaining case, when the stream has a next value `v`.

Each of the parsers and serializers in Markup.ml is a chain of stream
processors, tied together by these kstreams. For example, the HTML and XML parsers both...

- take a stream of bytes,
- transform it into a stream of Unicode characters paired with locations,
- transform that into a stream of language tokens, like "start tag,"
- and transform that into a stream of parsing signals, like "start element."

<br/>

The synchronous default API of Markup.ml, seen in the [`README`][readme], is a
thin wrapper over this internal implementation. What makes it synchronous is
that the underlying I/O functions guarantee that each call to a CPS function `f`
will call one of its continuations (callbacks) *before* `f` returns.

Likewise, the Lwt API is another thin wrapper, which translates between CPS and
Lwt promises. What makes this API asynchronous is that underlying I/O functions
might not call their continuations until long after the functions have returned,
and this delay is propagated to the continuations nearest to the surface API.

[readme]: https://github.com/aantron/markup.ml#readme

<br/>

<a id="structure"></a>
### Structure

As for how the stream processors are chained together, The HTML specification
strongly suggests a structure for the parser in the section
[*8.2.1 Overview of the parsing model*][model], from where the following diagram
is taken:

<p align="center">
<img src="https://www.w3.org/TR/html5/images/parsing-model-overview.svg" />
</p>

[model]: https://www.w3.org/TR/html5/syntax.html#overview-of-the-parsing-model

The XML parser follows the same structure, even though it is not explicitly
suggested by the XML specification.

The modules can be arranged in the following categories. Where a module directly
implements a box from the diagram, the box name is indicated in boldface.

Until the modules dealing with Lwt, only `Markup.Stream_io` does I/O. The rest
of the modules are pure with respect to I/O.

Almost everything is based directly on specifications. Most functions are
commented with the HTML or XML specification section number they are
implementing. It may also be useful to see the [conformance status][conformance]
– these are all the known deviations by Markup.ml from the specifications.

#### Helpers

- [`Markup.Common`][common] – shared definitions, compiler compatibility, etc.
- [`Markup.Error`][error] – parsing and serialization error type. Markup.ml does
  not throw exceptions, because all errors are recoverable.
- [`Markup.Namespace`][namespace] – namespace URI to prefix conversion and back.
- [`Markup.Entities`][entities] – checked-in auto-generated HTML5 entity list.
  The source for this file is `src/entities.json`, and the generator is
  `src/translate_entities.ml`. Neither of these latter two files is part of the
  built Markup.ml, nor of the build process.
- [`Markup.Trie`][trie] – trie for incrementally searching the entity list.
- [`Markup.Kstream`][kstream] – above-mentioned CPS streams.
- [`Markup.Text`][text] – some utilities for `Markup.Html_tokenizer` and
  `Markup.Xml_tokenizer`; see below.

#### I/O

- [`Markup.Stream_io`][stream_io] – make byte streams from files, strings, etc.,
  write byte streams to strings, etc. – the first stage of parsing and the last
  stage of serialization (**Network** in the diagram). This uses the I/O
  functions in `Pervasives`.

#### Encodings

- [`Markup.Encoding`][encoding] – byte streams to Unicode character streams
  (**Byte Stream Decoder** in the diagram). For UTF-8, this is a wrapper around
  `uutf`.
- [`Markup.Detect`][detect] – prescans byte streams to detect encodings.
- [`Markup.Input`][input] – Unicode streams to "preprocessed" Unicode streams –
  in HTML5 parlance, this just means normalizing CR-LF to CR, and attaching
  locations (**Input Stream Preprocessor** in the diagram).

#### HTML parsing

- [`Markup.Html_tokenizer`][html_tokenizer] – preprocessed Unicode streams to
  HTML lexeme streams (**Tokenizer** in the diagram). HTML lexemes are things
  like start tags, end tags, and runs of text.
- [`Markup.Html_parser`][html_parser] – HTML lexeme streams to HTML signal
  streams (**Tree Construction** in the diagram). Signal streams are things like
  "start an element," "start another element as its child," "now end the child,"
  "now end the root element." They are basically a left-to-right traversal of a
  DOM tree, without the DOM tree actually being in memory.

#### XML parsing

- [`Markup.Xml_tokenizer`][xml_tokenizer] – as for HTML above, but for XML.
- [`Markup.Xml_parser`][xml_parser] - as for HTML above, but for XML.

#### HTML writing

- [`Markup.Html_writer`][html_writer] – HTML signal streams back to
  UTF-8-encoded byte streams.

#### XML writing

- [`Markup.Xml_writer`][xml_writer] - as for HTML above, but for XML.

#### User-friendly APIs

- [`Markup.Utility`][utility] – convenience functions on signal streams for the
  user.
- [`Markup`][main], [`Markup_lwt`][lwt], [`Markup_lwt_unix`][lwt_unix] – the
  public interface for operating all of the above machinery without having to
  touch CPS.

[common]: https://github.com/aantron/markup.ml/blob/master/src/common.ml
[error]: https://github.com/aantron/markup.ml/blob/master/src/error.ml
[namespace]: https://github.com/aantron/markup.ml/blob/master/src/namespace.mli
[entities]: https://github.com/aantron/markup.ml/blob/master/src/entities.ml
[trie]: https://github.com/aantron/markup.ml/blob/master/src/trie.ml
[kstream]: https://github.com/aantron/markup.ml/blob/master/src/kstream.mli
[stream_io]: https://github.com/aantron/markup.ml/blob/master/src/stream_io.ml
[encoding]: https://github.com/aantron/markup.ml/blob/master/src/encoding.ml
[input]: https://github.com/aantron/markup.ml/blob/master/src/input.mli
[html_tokenizer]: https://github.com/aantron/markup.ml/blob/master/src/html_tokenizer.mli
[html_parser]: https://github.com/aantron/markup.ml/blob/master/src/html_parser.mli
[html_writer]: https://github.com/aantron/markup.ml/blob/master/src/html_writer.mli
[xml_tokenizer]: https://github.com/aantron/markup.ml/blob/master/src/xml_tokenizer.mli
[xml_parser]: https://github.com/aantron/markup.ml/blob/master/src/xml_parser.mli
[xml_writer]: https://github.com/aantron/markup.ml/blob/master/src/xml_writer.mli
[text]: https://github.com/aantron/markup.ml/blob/master/src/text.ml
[detect]: https://github.com/aantron/markup.ml/blob/master/src/detect.mli
[utility]: https://github.com/aantron/markup.ml/blob/master/src/utility.ml
[main]: https://github.com/aantron/markup.ml/blob/master/src/markup.mli
[lwt]: https://github.com/aantron/markup.ml/blob/master/src/markup_lwt.mli
[lwt_unix]: https://github.com/aantron/markup.ml/blob/master/src/markup_lwt_unix.mli
[conformance]: http://aantron.github.io/markup.ml/#2_Conformancestatus