File: README.md

package info (click to toggle)
ocaml-markup 1.0.3-1
  • links: PTS, VCS
  • area: main
  • in suites: forky, sid, trixie
  • size: 1,340 kB
  • sloc: ml: 15,131; makefile: 89
file content (205 lines) | stat: -rw-r--r-- 7,704 bytes parent folder | download
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
# Markup.ml   [![CI status][ci-img]][ci] [![Coverage][coveralls-img]][coveralls]

[ci]:            https://github.com/aantron/markup.ml/actions
[ci-img]:        https://img.shields.io/github/workflow/status/aantron/markup.ml/test/master
[coveralls]:     https://coveralls.io/github/aantron/markup.ml?branch=master
[coveralls-img]: https://img.shields.io/coveralls/aantron/markup.ml/master.svg

Markup.ml is a pair of parsers implementing the [HTML5][HTML5] and [XML][XML]
specifications, including error recovery. Usage is simple, because each parser
is a function from byte streams to parsing signal streams:

![Usage example][sample]

[sample]: https://github.com/aantron/markup.ml/blob/master/docs/sample.png

In addition to being error-correcting, the parsers are:

- **streaming**: parsing partial input and emitting signals while more input is
  still being received;
- **lazy**: not parsing input unless you have requested the next parsing signal,
  so you can easily stop parsing partway through a document;
- **non-blocking**: they can be used with [Lwt][lwt], but still provide a
  straightforward synchronous interface for simple usage; and
- **one-pass**: memory consumption is limited since the parsers don't build up a
  document representation, nor buffer input beyond a small amount of lookahead.

The parsers detect character encodings automatically, and emit everything in
UTF-8. The HTML parser understands SVG and MathML, in addition to HTML5.

Here is a breakdown showing the signal stream and errors emitted during the
parsing and pretty-printing of `bad_html`:

```ocaml
string bad_html         "<body><p><em>Markup.ml<p>rocks!"

|> parse_html           `Start_element "body"
|> signals              `Start_element "p"
                        `Start_element "em"
                        `Text ["Markup.ml"]
                        ~report (1, 10) (`Unmatched_start_tag "em")
                        `End_element                   (* </em>: recovery *)
                        `End_element                   (* </p>: not an error *)
                        `Start_element "p"
                        `Start_element "em"            (* recovery *)
                        `Text ["rocks!"]
                        `End_element                   (* </em> *)
                        `End_element                   (* </p> *)
                        `End_element                   (* </body> *)

|> pretty_print         (* adjusts the `Text signals *)

|> write_html
|> to_channel stdout;;  "...shown above..."            (* valid HTML *)
```

The parsers are [tested][tests] thoroughly.

For a higher-level parser, see [Lambda Soup][lambdasoup], which is based on
Markup.ml, but can search documents using CSS selectors, and perform various
manipulations.

<br/>

## Overview and basic usage

The interface is centered around four functions between byte streams and signal
streams: [`parse_html`][parse_html], [`write_html`][write_html],
[`parse_xml`][parse_xml], and [`write_xml`][write_xml]. These have several
optional arguments for fine-tuning their behavior. The rest of the functions
either [input][input] or [output][output] byte streams, or
[transform][transform] signal streams in some interesting way.

Here is an example with an optional argument:

```ocaml
(* Show up to 10 XML well-formedness errors to the user. Stop after
   the 10th, without reading more input. *)
let report =
  let count = ref 0 in
  fun location error ->
    error |> Error.to_string ~location |> prerr_endline;
    count := !count + 1;
    if !count >= 10 then raise_notrace Exit

file "some.xml" |> fst |> parse_xml ~report |> signals |> drain
```

[input]: http://aantron.github.io/markup.ml/#2_Inputsources
[output]: http://aantron.github.io/markup.ml/#2_Outputdestinations
[transform]: http://aantron.github.io/markup.ml/#2_Utility

<br/>

## Advanced: [Cohttp][cohttp] + Markup.ml + [Lambda Soup][lambdasoup] + [Lwt][lwt]

This program requests a Google search, then does a streaming scrape of result
titles. It exits when it finds a GitHub link, without reading more input. Only
one `h3` element is converted into an in-memory tree at a time.

```ocaml
let () =
  Lwt_main.run begin
    (* Send request. Assume success. *)
    let url = "https://www.google.com/search?q=markup.ml" in
    let%lwt _, body = Cohttp_lwt_unix.Client.get (Uri.of_string url) in

    (* Adapt response to a Markup.ml stream. *)
    let body = body |> Cohttp_lwt.Body.to_stream |> Markup_lwt.lwt_stream in

    (* Set up a lazy stream of h3 elements. *)
    let h3s = Markup.(body
      |> strings_to_bytes |> parse_html |> signals
      |> elements (fun (_ns, name) _attrs -> name = "h3"))
    in

    (* Find the GitHub link. .iter and .load cause actual reading of data. *)
    h3s |> Markup_lwt.iter (fun h3 ->
      let%lwt h3 = Markup_lwt.load h3 in
      match Soup.(from_signals h3 $? "a[href*=github]") with
      | None -> Lwt.return_unit
      | Some anchor ->
        print_endline (String.concat "" (Soup.texts anchor));
        exit 0)
  end
```

This prints
`GitHub - aantron/markup.ml: Error-recovering streaming HTML5 and ...`. To run
it, do:

```sh
ocamlfind opt -linkpkg -package lwt.ppx,cohttp.lwt,markup.lwt,lambdasoup \
    scrape.ml && ./a.out
```

You can get all the necessary packages by

```
opam install lwt_ssl
opam install cohttp-lwt-unix lambdasoup markup
```

<br/>

## Installing

```
opam install markup
```

<br/>

## Documentation

The interface of Markup.ml is three modules: [`Markup`][Markup],
[`Markup_lwt`][Markup_lwt], and [`Markup_lwt_unix`][Markup_lwt_unix]. The last
two are available only if you have [Lwt][lwt] installed (OPAM package `lwt`).

The documentation includes a summary of the [conformance status][conformance] of
Markup.ml.

<br/>

## Depending

Markup.ml uses [semantic versioning][semver], but is currently in `0.x.x`. The
minor version number will be incremented on breaking changes.

<br/>

## Contributing

Contributions are very much welcome. Please see [`CONTRIBUTING`][contributing]
for instructions, suggestions, and an overview of the code. There is also a list
of [easy issues][easy].

<br/>

## License

Markup.ml is distributed under the [MIT license][license]. The Markup.ml source
distribution includes a copy of the HTML5 entity list, which is distributed
under the [W3C document license][w3c-license].

[parse_html]:      http://aantron.github.io/markup.ml/#VALparse_html
[write_html]:      http://aantron.github.io/markup.ml/#VALwrite_html
[parse_xml]:       http://aantron.github.io/markup.ml/#VALparse_xml
[write_xml]:       http://aantron.github.io/markup.ml/#VALwrite_xml
[HTML5]:           https://www.w3.org/TR/html5/
[XML]:             https://www.w3.org/TR/xml/
[tests]:           https://github.com/aantron/markup.ml/tree/master/test
[signal]:          http://aantron.github.io/markup.ml/#TYPEsignal
[lwt]:             https://github.com/ocsigen/lwt
[lambdasoup]:      https://github.com/aantron/lambda-soup
[cohttp]:          https://github.com/mirage/ocaml-cohttp
[license]:         https://github.com/aantron/markup.ml/blob/master/LICENSE.md
[contributing]:    https://github.com/aantron/markup.ml/blob/master/docs/CONTRIBUTING.md
[email]:           mailto:antonbachin@yahoo.com
[Markup]:          http://aantron.github.io/markup.ml
[Markup_lwt]:      http://aantron.github.io/markup.ml/Markup_lwt.html
[Markup_lwt_unix]: http://aantron.github.io/markup.ml/Markup_lwt_unix.html
[conformance]:     http://aantron.github.io/markup.ml/#2_Conformancestatus
[w3c-license]: https://www.w3.org/Consortium/Legal/2002/copyright-documents-20021231
[semver]: http://semver.org/
[easy]: https://github.com/aantron/markup.ml/labels/easy