File: parsing.md

package info (click to toggle)
rust-roxmltree 0.20.0-1
  • links: PTS, VCS
  • area: main
  • in suites: forky, sid, trixie
  • size: 1,060 kB
  • sloc: xml: 308; makefile: 4
file content (101 lines) | stat: -rw-r--r-- 2,109 bytes parent folder | download | duplicates (4)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
# roxmltree parsing strategy

XML parsing is hard. Everyone knows that. But the other problem is that it
can be represented in very different ways:

- You can preserve comment or ignore them completely or partially.
- You can represent text data as a separated node or embed it into the element node.
- You can keep CDATA as a separated node or merge it into the text node.
- You can preserve XML declaration or ignore it completely.
- ... and many more.

This document explains how *roxmltree* parses and represents the XML document.

## XML declaration

[XML declaration](https://www.w3.org/TR/xml/#NT-XMLDecl) is completely ignored.
Mostly because it doesn't contain any valuable information for us.

- `version` is expected to be `1.*`. Otherwise an error will occur.
- `encoding` is irrelevant since we are parsing only valid UTF-8 strings.
- And no one really follow the `standalone` constraints.

## DTD

Only `ENTITY` objects will be resolved. Everything else will be ignored
at the moment.

```xml
<!DOCTYPE test [
    <!ENTITY a 'text<p/>text'>
]>
<e>&a;</e>
```

will be parsed into:

```xml
<e>text<p/>text</e>
```

Were `p` is an element, not a text.

## Comments

All comment will be preserved.

## Processing instructions

All processing instructions will be preserved.

## Whitespaces

All whitespaces inside the root element will be preserved.

```xml
<p>
    text
</p>
```

it will be parsed as `\n␣␣␣␣text\n`.

Same goes to an escaped one:

```xml
<p>&#x20;&#x20;text&#x20;&#x20;</p>
```

it will be parsed as `␣␣text␣␣`.

## CDATA

CDATA will be embedded to a text node:

```xml
<p>t<![CDATA[e&#x20;]]>&#x20;x<![CDATA[t]]></p>
```

it will be parsed as `te&#x20; xt`.

## Text

Text will be unescaped. All entity references will be resolved.

```xml
<!DOCTYPE test [
    <!ENTITY b 'Some&#x20;text'>
]>
<p>&b;</p>
```

it will be parsed as `Some text`.

## Attribute-Value Normalization

[Attribute-Value Normalization](https://www.w3.org/TR/xml/#AVNormalize) works
as explained in the spec.

## Namespaces resolving

*roxmltree* has a complete support for XML namespaces.