1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341
|
# Comma-Separated Values (CSV) parser tutorial
A tutorial for building parser for well known CSV format.
---
In this tutorial we will see how to make a parser for a simple data interchange
format - [CSV]() (Comma-Separated Values).
CSV is a textual format for tabular data interchange. It is described by
[RFC 4180](https://tools.ietf.org/html/rfc4180).
[Here](https://en.wikipedia.org/wiki/Comma-separated_values) is an example of
CSV file:
```csv
Year,Make,Model,Length
1997,Ford,E350,2.34
2000,Mercury,Cougar,2.38
```
Although, there is [csv module](https://docs.python.org/3/library/csv.html) in
the standard Python library this example has been made as the CSV is ubiquitous
and easy to understand so it it a good starter for learning Arpeggio.
## The grammar
Let's start first by creating a python module called `csv.py`.
Now, let's define CSV grammar.
- CSV file consists of one or more records or newlines and the End-Of-File at the
end. Python list inside `OneOrMore` will be interpreted as [Ordered
Choice](../grammars.md#grammars-written-in-python).
def csvfile(): return OneOrMore([record, '\n']), EOF
- Each record consists of fields separated with commas.
def record(): return field, ZeroOrMore(",", field)
- Each field may be quoted or not.
def field(): return [quoted_field, field_content]
- Field content is everything until newline or comma.
def field_content(): return _(r'([^,\n])+')
We use regular expression to match everything that is not comma or
newline.
- Quoted field starts and ends with double quotes.
def quoted_field(): return '"', field_content_quoted, '"'
- Quoted field content is defined as
def field_content_quoted(): return _(r'(("")|([^"]))+')
Quoted field content is defined with regular expression that will match
everything until the closing double-quote. Double quote inside data must
be escaped by doubling it (`""`).
The whole content of the `csv.py` file until now should be:
from arpeggio import *
from arpeggio import RegExMatch as _
# This is the CSV grammar
def record(): return field, ZeroOrMore(",", field)
def field(): return [quoted_field, field_content]
def quoted_field(): return '"', field_content_quoted, '"'
def field_content(): return _(r'([^,\n])+')
def field_content_quoted(): return _(r'(("")|([^"]))+')
def csvfile(): return OneOrMore([record, '\n']), EOF
## The parser
Let's instantiate parser. In order to catch newlines in `csvfile` rule we must
tell Arpeggio not to treat newlines as whitespace, i.e. not to skip over them.
Thus, we will be able to handle them explicitly as we do in csvfile rule. To do
so we will use `ws` parameter in parser construction to redefine what is
considered as whitespace. You can find more information
[here](../configuration.md#white-space-handling).
After the grammar in `csv.py` instantiate the parser:
parser = ParserPython(csvfile, ws='\t ')
So, whitespace will be a tab char or a space. Newline will be treated as regular
character. We give grammar root rule to the `ParserPython`. In this example it
is `csvfile` function.
`parser` now refers to the parser object capable of parsing CSV inputs.
## Parsing
Let's parse some CSV example string.
Create file `test_data.csv` with the following content:
Unquoted test, "Quoted test", 23234, One Two Three, "343456.45"
Unquoted test 2, "Quoted test with ""inner"" quotes", 23234, One Two Three, "34312.7"
Unquoted test 3, "Quoted test 3", 23234, One Two Three, "343486.12"
In `csv.py` file write:
```python
test_data = open('test_data.csv', 'r').read()
parse_tree = parser.parse(test_data)
```
`test_data` is Python string containing test CSV data from the file. Calling
`parser.parse` on the data will produce the [parse tree](../parse_trees.md).
If you run `csv.py` module, and there are no syntax errors in the `test_data.csv`
file, `parse_tree` will be a reference to [parse tree](../parse_trees.md) of
the test CSV data.
```bash
$ python csv.py
```
**Congratulations!! You have successfully parsed CSV file.**
This parse tree is [visualized](../debugging.md#visualization) below (Tip: The
image is large. Click on it to see it in a separate tab and to be able to use
zooming):
<a href="../../images/csvfile_parse_tree.dot.png" target="_blank"><img src="../../images/csvfile_parse_tree.dot.png"/></a>
!!! note
To visualize grammar (aka parser model) and parse tree instantiate the
parser in debug mode.
parser = ParserPython(csvfile, ws='\t ', debug=True)
Transform generated `dot` files to images.
See more [here](../debugging.md#visualization)
## Defining grammar using PEG notation
Now, let's try the same but using [textual PEG
notation](../grammars.md#grammars-written-in-peg-notations) for the grammar
definition.
We shall repeat the process above but we shall encode rules in PEG.
We shall use clean PEG variant (`arpeggio.cleanpeg` module).
First, create textual file `csv.peg` to store the grammar.
- CSV file consists of one or more records or newlines and the End-Of-File at
the end.
csvfile = (record / '\n')+ EOF
- Each record consists of fields separated with commas.
record = field ("," field)*
- Each field may be quoted or not.
field = quoted_field / field_content
- Field content is everything until newline or comma.
field_content = r'([^,\n])+'
We use regular expression to match everything that is not comma or
newline.
- Quoted field starts and ends with double quotes.
quoted_field = '"' field_content_quoted '"'
- Quoted field content is defined as
field_content_quoted = r'(("")|([^"]))+'
Quoted field content is defined with regular expression that will match
everything until the closing double-quote. Double quote inside data must
be escaped by doubling it (`""`).
The whole grammar (i.e. the contents of `csv.peg` file) is:
csvfile = (record / r'\n')+ EOF
record = field ("," field)*
field = quoted_field / field_content
field_content = r'([^,\n])+'
quoted_field = '"' field_content_quoted '"'
field_content_quoted = r'(("")|([^"]))+'
Now, we shall create `csv_peg.py` file in order to instantiate our parser and
parse inputs. This time we shall instantiate different parser class
(`ParserPEG`). The whole content of `csv_peg.py` should be:
```python
from arpeggio.cleanpeg import ParserPEG
csv_grammar = open('csv.peg', 'r').read()
parser = ParserPEG(csv_grammar, 'csvfile', ws='\t ')
```
Here we load the grammar from `csv.peg` file and construct the parser using
`ParserPEG` class.
The rest of the code is the same as in `csv.py`. We load `test_data.csv` and
call `parser.parse` on it to produce parse tree.
To verify that everything works without errors execute `csv_peg.py` module.
```bash
$ python csv_peg.py
```
If we put the parser in debug mode and generate parse tree image we can
verify that we are getting the same parse tree regardless of the grammar
specification approach we use.
To put parser in debug mode add `debug=True` to the parser parameters list.
```python
parser = ParserPEG(csv_grammar, 'csvfile', ws='\t ', debug=True)
```
## Extract data
Our main goal is to extract data from the `csv` file.
The parse tree we get as a result of parsing is not very useful on its own.
We need to transform it to some other data structure that we can use.
First lets define our target data structure we want to get.
Since `csv` consists of list of records where each record consists of fields
we shall construct python list of lists:
[
[field1, field2, field3, ...], # First row
[field1, field2, field3,...], # Second row
[...], # ...
...
]
To construct this list of list we may process parse tree by navigating its
nodes and building the required target data structure.
But, it is easier to use Arpeggio's support for [semantic analysis - Visitor
Pattern](../semantics.md).
Let's make a Visitor for CSV that will build our list of lists.
```python
class CSVVisitor(PTNodeVisitor):
def visit_record(self, node, children):
# record is a list of fields. The children nodes are fields so just
# transform it to python list.
return list(children)
def visit_csvfile(self, node, children):
# We are not interested in empty lines so we will filter them.
return [x for x in children if x!='\n']
```
and apply this visitor to the parse tree:
```python
csv_content = visit_parse_tree(parse_tree, CSVVisitor())
```
Now if we pretty-print `csv_content` we can see that it is exactly what we wanted:
```python
[ [ u'Unquoted test',
u'Quoted test',
u'23234',
u'One Two Three',
u'343456.45'],
[ u'Unquoted test 2',
u'Quoted test with ""inner"" quotes',
u'23234',
u'One Two Three',
u'34312.7'],
[ u'Unquoted test 3',
u'Quoted test 3',
u'23234',
u'One Two Three',
u'343486.12']]
```
But, there is more we can do. If we look at our data we can see that some fields
are of numeric type but they end up as strings in our target structure. Let's
convert them to Python floats or ints. To do this conversion we will introduce
`visit_field` method in our `CSVVisitor` class.
```python
class CSVVisitor(PTNodeVisitor):
...
def visit_field(self, node, children):
value = children[0]
try:
return float(value)
except:
pass
try:
return int(value)
except:
return value
...
```
If we pretty-print `csv_content` now we can see that numeric values are not strings
anymore but a proper Python types.
```python
[ [u'Unquoted test', u'Quoted test', 23234.0, u'One Two Three', 343456.45],
[ u'Unquoted test 2',
u'Quoted test with ""inner"" quotes',
23234.0,
u'One Two Three',
34312.7],
[ u'Unquoted test 3',
u'Quoted test 3',
23234.0,
u'One Two Three',
343486.12]]
```
This example code can be found [here](https://github.com/textX/Arpeggio/tree/master/examples/csv).
|