File: csv.md

package info (click to toggle)
arpeggio 2.0.2-3
  • links: PTS, VCS
  • area: main
  • in suites: forky, sid, trixie
  • size: 3,480 kB
  • sloc: python: 3,198; javascript: 54; sh: 19; makefile: 9
file content (341 lines) | stat: -rw-r--r-- 10,399 bytes parent folder | download | duplicates (3)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
# Comma-Separated Values (CSV) parser tutorial

A tutorial for building parser for well known CSV format.

---

In this tutorial we will see how to make a parser for a simple data interchange
format - [CSV]() (Comma-Separated Values).
CSV is a textual format for tabular data interchange. It is described by
[RFC 4180](https://tools.ietf.org/html/rfc4180).


[Here](https://en.wikipedia.org/wiki/Comma-separated_values) is an example of
CSV file:

```csv
Year,Make,Model,Length
1997,Ford,E350,2.34
2000,Mercury,Cougar,2.38
```

Although, there is [csv module](https://docs.python.org/3/library/csv.html) in
the standard Python library this example has been made as the CSV is ubiquitous
and easy to understand so it it a good starter for learning Arpeggio.

## The grammar

Let's start first by creating a python module called `csv.py`.

Now, let's define CSV grammar. 

- CSV file consists of one or more records or newlines and the End-Of-File at the
  end. Python list inside `OneOrMore` will be interpreted as [Ordered
  Choice](../grammars.md#grammars-written-in-python).

        def csvfile():    return OneOrMore([record, '\n']), EOF

- Each record consists of fields separated with commas.

        def record():     return field, ZeroOrMore(",", field) 

- Each field may be quoted or not.

        def field():      return [quoted_field, field_content]

- Field content is everything until newline or comma.

        def field_content():            return _(r'([^,\n])+')

      We use regular expression to match everything that is not comma or
      newline.

- Quoted field starts and ends with double quotes.

        def quoted_field():             return '"', field_content_quoted, '"'

- Quoted field content is defined as 

        def field_content_quoted():     return _(r'(("")|([^"]))+')

      Quoted field content is defined with regular expression that will match
      everything until the closing double-quote. Double quote inside data must
      be escaped by doubling it (`""`).


The whole content of the `csv.py` file until now should be:

    from arpeggio import *
    from arpeggio import RegExMatch as _

    # This is the CSV grammar
    def record():                   return field, ZeroOrMore(",", field)
    def field():                    return [quoted_field, field_content]
    def quoted_field():             return '"', field_content_quoted, '"'
    def field_content():            return _(r'([^,\n])+')
    def field_content_quoted():     return _(r'(("")|([^"]))+')
    def csvfile():                  return OneOrMore([record, '\n']), EOF


## The parser

Let's instantiate parser. In order to catch newlines in `csvfile` rule we must
tell Arpeggio not to treat newlines as whitespace, i.e. not to skip over them.
Thus, we will be able to handle them explicitly as we do in csvfile rule. To do
so we will use `ws` parameter in parser construction to redefine what is
considered as whitespace.  You can find more information
[here](../configuration.md#white-space-handling).

After the grammar in `csv.py` instantiate the parser:

    parser = ParserPython(csvfile, ws='\t ')

So, whitespace will be a tab char or a space. Newline will be treated as regular
character.  We give grammar root rule to the `ParserPython`. In this example it
is `csvfile` function.

`parser` now refers to the parser object capable of parsing CSV inputs.


## Parsing

Let's parse some CSV example string.

Create file `test_data.csv` with the following content:

    Unquoted test, "Quoted test", 23234, One Two Three, "343456.45"

    Unquoted test 2, "Quoted test with ""inner"" quotes", 23234, One Two Three, "34312.7"
    Unquoted test 3, "Quoted test 3", 23234, One Two Three, "343486.12"

In `csv.py` file write:

```python
test_data = open('test_data.csv', 'r').read()
parse_tree = parser.parse(test_data)

```

`test_data` is Python string containing test CSV data from the file. Calling
`parser.parse` on the data will produce the [parse tree](../parse_trees.md).

If you run `csv.py` module, and there are no syntax errors in the `test_data.csv`
file, `parse_tree` will be a reference to [parse tree](../parse_trees.md) of
the test CSV data.

```bash
$ python csv.py
```

**Congratulations!! You have successfully parsed CSV file.**

This parse tree is [visualized](../debugging.md#visualization) below (Tip: The
image is large. Click on it to see it in a separate tab and to be able to use
zooming):


<a href="../../images/csvfile_parse_tree.dot.png" target="_blank"><img src="../../images/csvfile_parse_tree.dot.png"/></a>


!!! note
    To visualize grammar (aka parser model) and parse tree instantiate the
    parser in debug mode.

        parser = ParserPython(csvfile, ws='\t ', debug=True)

    Transform generated `dot` files to images.
    See more [here](../debugging.md#visualization)


## Defining grammar using PEG notation

Now, let's try the same but using [textual PEG
notation](../grammars.md#grammars-written-in-peg-notations) for the grammar
definition.

We shall repeat the process above but we shall encode rules in PEG.
We shall use clean PEG variant (`arpeggio.cleanpeg` module).

First, create textual file `csv.peg` to store the grammar.

- CSV file consists of one or more records or newlines and the End-Of-File at
  the end.

        csvfile = (record / '\n')+ EOF

- Each record consists of fields separated with commas.

        record = field ("," field)*

- Each field may be quoted or not.

        field = quoted_field / field_content

- Field content is everything until newline or comma.

        field_content = r'([^,\n])+' 

      We use regular expression to match everything that is not comma or
      newline.

- Quoted field starts and ends with double quotes.

        quoted_field = '"' field_content_quoted '"'

- Quoted field content is defined as 

        field_content_quoted = r'(("")|([^"]))+'

      Quoted field content is defined with regular expression that will match
      everything until the closing double-quote. Double quote inside data must
      be escaped by doubling it (`""`).


The whole grammar (i.e. the contents of `csv.peg` file) is:

      csvfile = (record / r'\n')+ EOF
      record = field ("," field)*
      field = quoted_field / field_content
      field_content = r'([^,\n])+'
      quoted_field = '"' field_content_quoted '"'
      field_content_quoted = r'(("")|([^"]))+'


Now, we shall create `csv_peg.py` file in order to instantiate our parser and
parse inputs.  This time we shall instantiate different parser class
(`ParserPEG`). The whole content of `csv_peg.py` should be:

```python
from arpeggio.cleanpeg import ParserPEG

csv_grammar = open('csv.peg', 'r').read()
parser = ParserPEG(csv_grammar, 'csvfile', ws='\t ')
```

Here we load the grammar from `csv.peg` file and construct the parser using
`ParserPEG` class.

The rest of the code is the same as in `csv.py`. We load `test_data.csv` and
call `parser.parse` on it to produce parse tree.

To verify that everything works without errors execute `csv_peg.py` module.

```bash
$ python csv_peg.py
```

If we put the parser in debug mode and generate parse tree image we can 
verify that we are getting the same parse tree regardless of the grammar
specification approach we use.

To put parser in debug mode add `debug=True` to the parser parameters list.

```python
parser = ParserPEG(csv_grammar, 'csvfile', ws='\t ', debug=True)
```


## Extract data

Our main goal is to extract data from the `csv` file.

The parse tree we get as a result of parsing is not very useful on its own.
We need to transform it to some other data structure that we can use.

First lets define our target data structure we want to get.

Since `csv` consists of list of records where each record consists of fields
we shall construct python list of lists:

      [
        [field1, field2, field3, ...],  # First row
        [field1, field2, field3,...],   # Second row
        [...],  # ...
        ...
      ]

To construct this list of list we may process parse tree by navigating its
nodes and building the required target data structure.
But, it is easier to use Arpeggio's support for [semantic analysis - Visitor
Pattern](../semantics.md).

Let's make a Visitor for CSV that will build our list of lists.

```python
class CSVVisitor(PTNodeVisitor):
    def visit_record(self, node, children):
        # record is a list of fields. The children nodes are fields so just
        # transform it to python list.
        return list(children)

    def visit_csvfile(self, node, children):
        # We are not interested in empty lines so we will filter them.
        return [x for x in children if x!='\n']
```

and apply this visitor to the parse tree:

```python
csv_content = visit_parse_tree(parse_tree, CSVVisitor())
```

Now if we pretty-print `csv_content` we can see that it is exactly what we wanted:

```python
[   [   u'Unquoted test',
        u'Quoted test',
        u'23234',
        u'One Two Three',
        u'343456.45'],
    [   u'Unquoted test 2',
        u'Quoted test with ""inner"" quotes',
        u'23234',
        u'One Two Three',
        u'34312.7'],
    [   u'Unquoted test 3',
        u'Quoted test 3',
        u'23234',
        u'One Two Three',
        u'343486.12']]
```

But, there is more we can do. If we look at our data we can see that some fields
are of numeric type but they end up as strings in our target structure. Let's
convert them to Python floats or ints.  To do this conversion we will introduce
`visit_field` method in our `CSVVisitor` class.

```python
class CSVVisitor(PTNodeVisitor):
  ...
  def visit_field(self, node, children):
      value = children[0]
      try:
          return float(value)
      except:
          pass
      try:
          return int(value)
      except:
          return value
  ...
```

If we pretty-print `csv_content` now we can see that numeric values are not strings
anymore but a proper Python types.

```python
[   [u'Unquoted test', u'Quoted test', 23234.0, u'One Two Three', 343456.45],
    [   u'Unquoted test 2',
        u'Quoted test with ""inner"" quotes',
        23234.0,
        u'One Two Three',
        34312.7],
    [   u'Unquoted test 3',
        u'Quoted test 3',
        23234.0,
        u'One Two Three',
        343486.12]]
```


This example code can be found [here](https://github.com/textX/Arpeggio/tree/master/examples/csv).