1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204
|
# ua-parser Specification
Version 0.2 Draft
This document describes the specification on how a parser must implement the `regexes.yaml` file for correctly parsing user-agent strings on basis of that file.
This specification intends to help maintainers and contributors to correctly use the provided information within the `regexes.yaml` file for obtaining information from the different user-agent strings. Furthermore this specification tries to be the basis for discussions on evolving the projects and the needed parsing algorithms.
This document will not provide any information on how to implement the ua-parser project on your server and how to retreive the user-agent string for further processing.
# `regexes.yaml`
Any information which can be obtained from a user-agent string may contain information on:
* User-Agent aka “the browser”
* OS (Operating System) the User-Agent currently uses (or runs on)
* Device information by means of the physical device the User-Agent is using
This information is provided within the `regexes.yaml` file. Each kind of information requires a different parser which extracts the related type. These are:
* `user_agent_parser`
* `os_parsers`
* `device_parsers`
Each parser contains a list of regular-expressions which are named `regex`. For each `regex` replacements specific to the parser can be named to attribute or change information. A replacement may require a match from the regular-expression which is extracted by an expression enclosed in normal brackets `"()"`. Each match can be addressed with `$1` to `$9` and used in a parser specific replacement.
**TODO**: Provide some insights into the used chars. E.g. escape `"."` as `"\."` and `"("` as `"\("`. `"/"` does not need to be escaped.
## `user_agent_parsers`
The `user_agent_parsers` returns information of the `family` type of the User-Agent.
If available the version infomation specifying the `family` may be extracted as well if available.
Here major, minor and patch version information can be addressed or overwritten.
| match in regex | default replacement | placeholder in replacement | note |
| ---- | ------------------- | ---- | --------------------------------------- |
| 1 | family_replacement | $1 | specifies the User-Agents family |
| 2 | v1_replacement | $2 | major version number/info of the family |
| 3 | v2_replacement | $3 | minor version number/info of the family |
| 4 | v3_replacement | $4 | patch version number/info of the family |
In case that no replacement is specified, the association is given by order of the match. If in the `regex` no first match (within normal brackets) is given, the `family_replacement` shall be specified!
To overwrite the respective value the replacement value needs to be named for a `regex`-item.
**Parser Implementation:**
The list of regular-expressions `regex` shall be evaluated for a given user-agent string beginning with the first `regex`-item in the list to the last item. The first matching `regex` stops processing the list. Regex-matching shall be case sensitive.
In case that no replacement for a match is specified for a `regex`-item, the first match defines the `family`, the second `major`, the third `minor`and the forth `patch` information.
If a `*_replacement` string is specified it shall overwrite or replace the match.
As placeholder for inserting matched characters use within
* `family_replacement`: `$1`
* `v1_replacement`: `$2`
* `v2_replacement`: `$3`
* `v3_replacement`: `$4`
If no matching `regex` is found the value for `family` shall be “Other”. The version information `major`, `minor` and `patch` shall not be defined.
**Example:**
For the User-Agent: `Mozilla/5.0 (Windows; Windows NT 5.1; rv:2.0b3pre) Gecko/20100727 Minefield/4.0.1pre`
the matching `regex`:
```
- regex: '(Namoroka|Shiretoko|Minefield)/(\d+)\.(\d+)\.(\d+(?:pre)?)'
family_replacement: 'Firefox ($1)'
```
resolves to:
```
family: Firefox (Minefield)
major : 4
minor : 0
patch : 1pre
```
## `os_parsers`
The `os_parsers` return information of the `os` type of the Operating System (OS) the User-Agent runs.
If available the version information specifying the `os` may be extracted as well if available.
Here major, minor and patch version information can be addressed or overwritten.
| match in regex | default replacement | placeholder in replacement | note |
| ---- | ----------------- | ---- | ---------------------------------------- |
| 1 | os_replacement | $1 | specifies the OS |
| 2 | os_v1_replacement | $2 | major version number/info of OS |
| 3 | os_v2_replacement | $3 | minor version number/info of the OS |
| 4 | os_v3_replacement | $4 | patch version number/info of the OS |
| 5 | os_v4_replacement | $5 | patchMinor version number/info of the OS |
In case that no replacement is specified, the association is given by order of the match. If in the `regex` no first match (within normal brackets) is given, the `os_replacement` shall be specified!
To overwrite the respective value the replacement value needs to be named for a `regex`-item.
**Parser Implementation:**
The list of regular-expressions `regex` shall be evaluated for a given user-agent string beginning with the first `regex`-item in the list to the last item. The first matching `regex` stops processing the list. Regex-matching shall be case sensitive.
In case that no replacement for a match is specified for a `regex`-item, the first match defines the `os` family, the second `major`, the third `minor`, the forth `patch` and the fifth `patchMinor` version information.
If a `*_replacement` string is specified it shall overwrite or replace the match.
As placeholder for inserting matched characters use within
* `os_replacement`: `$1`
* `os_v1_replacement`: `$2`
* `os_v2_replacement`: `$3`
* `os_v3_replacement`: `$4`
* `os_v4_replacement`: `$5`
In case that no matching `regex` is found the value for `os` shall be “Other”. The version information `major`, `minor`, `patch` and `patchMinor` shall not be defined.
**Example:**
For the User-Agent: `Mozilla/5.0 (Windows; U; Win95; en-US; rv:1.1) Gecko/20020826`
the matching `regex`:
```
- regex: 'Win(95|98|3.1|NT|ME|2000)'
os_replacement: 'Windows $1'
```
resolves to:
```
os: Windows 95
```
## `device_parsers`
The `device_parsers` return information of the device `family` the User-Agent runs on.
Furthermore `brand` and `model` of the device can be specified.
`brand` names the manufacturer of the device, where model specifies the model of the device.
| match in regex | default replacement | placeholder in replacement | note |
| ---- | ------------------ | ------- | ---------------------------------------- |
| 1 | device_replacement | $1...$9 | specifies the device family |
| any | brand_replacement | $1...$9 | major version number/info of OS |
| 1 | model_replacement | $1...$9 | minor version number/info of the OS |
In case that no replacement is specified the association is given by order of the match.
If in the `regex` no first match (within normal brackets) is given the `device_replacement` together with the `model_replacement` shall be specified!
To overwrite the respective value the replacement value needs to be named for a given `regex`.
For the `device_parsers` some `regex` require case insensitive parsing for proper matching. (E.g. Generic Feature Phones). To distinguish this from the case sensitive default case, the value `regex_flag: 'i'` is used to indicate that the regular-expression matching shall be case-insensitive for this regular expression.
**Parser Implementation:**
The list of regular-expressions `regex` shall be evaluated for a given user-agent string beginning with the first `regex`-item in the list to the last item. The first matching `regex` stops processing the list. Regex-matching shall be case sensitive.
In case that no replacement for a match is given, the first match defines the `family` and the `model`.
If a `*_replacement` string is specified it shall overwrite or replace the match.
As placeholder for inserting matched characters `$1` to `$9` can be used to insert the matched characters from the regex into the replacement string.
In case that no matching `regex` is found the value for `family` shall be “Other”. `brand` and `model` shall not be defined.
Leading and tailing whitespaces shall be trimmed from the result.
**Example:**
For the User-Agent: `Mozilla/5.0 (Linux; U; Android 4.2.2; de-de; PEDI_PLUS_W Build/JDQ39) AppleWebKit/534.30 (KHTML, like Gecko) Version/4.0 Safari/534.30`
the matching `regex`:
```
- regex: '; *(PEDI)_(PLUS)_(W) Build'
device_replacement: 'Odys $1 $2 $3'
brand_replacement: 'Odys'
model_replacement: '$1 $2 $3'
```
resolves to:
```
family: 'Odys PEDI PLUS W'
brand: 'Odys'
model: 'PEDI PLUS W'
```
# Parser Output
To allow interoperability with code that builds upon ua-parser, it is recommended to provide the parser output in a standardized way. The structure defined in [WebIDL](http://www.w3.org/TR/WebIDL/) may follow:
```
interface ua-parser-output {
attribute string string; // The "user-agent" string
object ua: { // The "user_agent_parsers" result
attribute string family;
attribute string major;
attribute string minor;
attribute string patch;
};
object os: { // The "os_parsers" result
attribute string family;
attribute string major;
attribute string minor;
attribute string patch;
attribute string patchMinor;
};
object device: { // The "device_parsers" result
attribute string family;
attribute string brand;
attribute string model;
};
};
```
|