1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437
|
<p align="center">
<img width="60%" src="https://raw.githubusercontent.com/alan-turing-institute/CleverCSV/eea72549195e37bd4347d87fd82bc98be2f1383d/.logo.png">
<br>
<a href="https://github.com/alan-turing-institute/CleverCSV/actions">
<img src="https://github.com/alan-turing-institute/CleverCSV/workflows/build/badge.svg" alt="Github Actions Build Status">
</a>
<a href="https://pypi.org/project/clevercsv/">
<img src="https://badge.fury.io/py/clevercsv.svg" alt="PyPI version">
</a>
<a href="https://clevercsv.readthedocs.io/en/latest/?badge=latest">
<img src="https://readthedocs.org/projects/clevercsv/badge/?version=latest" alt="Documentation Status">
</a>
<a href="https://pepy.tech/project/clevercsv">
<img src="https://pepy.tech/badge/clevercsv" alt="Downloads">
</a>
<a href="https://mybinder.org/v2/gh/alan-turing-institute/CleverCSVDemo/master?filepath=CSV_dialect_detection_with_CleverCSV.ipynb">
<img src="https://mybinder.org/badge_logo.svg" alt="Binder">
</a>
<a href="https://rdcu.be/bLVur">
<img src="https://img.shields.io/badge/DOI-10.1007%2Fs10618--019--00646--y-blue">
</a>
</p>
*CleverCSV provides a drop-in replacement for the Python* ``csv`` *package
with improved dialect detection for messy CSV files. It also provides a handy
command line tool that can standardize a messy file or generate Python code to
import it.*
**Useful links:**
- [CleverCSV on Github](https://github.com/alan-turing-institute/CleverCSV)
- [CleverCSV on PyPI](https://pypi.org/project/clevercsv/)
- [Demo of CleverCSV on Binder (interactive!)](https://mybinder.org/v2/gh/alan-turing-institute/CleverCSVDemo/master?filepath=CSV_dialect_detection_with_CleverCSV.ipynb)
- [Research Paper on CSV dialect detection
(PDF)](https://gertjanvandenburg.com/papers/VandenBurg_Nazabal_Sutton_-_Wrangling_Messy_CSV_Files_by_Detecting_Row_and_Type_Patterns_2019.pdf)
- [Reproducible Research Repo](https://github.com/alan-turing-institute/CSV_Wrangling/)
- [Blog post on messy CSV files](https://towardsdatascience.com/handling-messy-csv-files-2ef829aa441d)
- [Discussion
forum](https://github.com/alan-turing-institute/CleverCSV/discussions): a
place to ask questions and share ideas!
---
*Contents:* <a href="#quick-start"><b>Quick Start</b></a> | <a href="#introduction"><b>Introduction</b></a> | <a href="#installation"><b>Installation</b></a> | <a href="#usage"><b>Usage</b></a> | <a href="#python-library">Python Library</a> | <a href="#command-line-tool">Command-Line Tool</a> | <a href="#version-control-integration">Version Control Integration</a> | <a href="#contributing"><b>Contributing</b></a> | <a href="#notes"><b>Notes</b></a>
---
## Quick Start
[Click here](#introduction) to go to the introduction with more details about
CleverCSV. If you're in a hurry, below is a quick overview of how to get
started with the CleverCSV Python package and the command line interface.
For the Python package:
```python
# Import the package
>>> import clevercsv
# Load the file as a list of rows
# This uses the imdb.csv file in the examples directory
>>> rows = clevercsv.read_table('./imdb.csv')
# Load the file as a Pandas Dataframe
# Note that df = pd.read_csv('./imdb.csv') would fail here
>>> df = clevercsv.read_dataframe('./imdb.csv')
# Use CleverCSV as drop-in replacement for the Python CSV module
# This follows the Sniffer example: https://docs.python.org/3/library/csv.html#csv.Sniffer
# Note that csv.Sniffer would fail here
>>> with open('./imdb.csv', newline='') as csvfile:
... dialect = clevercsv.Sniffer().sniff(csvfile.read())
... csvfile.seek(0)
... reader = clevercsv.reader(csvfile, dialect)
... rows = list(reader)
```
And for the command line interface:
```python
# Install the full version of CleverCSV (this includes the command line interface)
$ pip install clevercsv[full]
# Detect the dialect
$ clevercsv detect ./imdb.csv
Detected: SimpleDialect(',', '', '\\')
# Generate code to import the file
$ clevercsv code ./imdb.csv
import clevercsv
with open("./imdb.csv", "r", newline="", encoding="utf-8") as fp:
reader = clevercsv.reader(fp, delimiter=",", quotechar="", escapechar="\\")
rows = list(reader)
# Explore the CSV file as a Pandas dataframe
$ clevercsv explore -p imdb.csv
Dropping you into an interactive shell.
CleverCSV has loaded the data into the variable: df
>>> df
```
## Introduction
- CSV files are awesome! They are lightweight, easy to share, human-readable,
version-controllable, and supported by many systems and tools!
- CSV files are terrible! They can have many different formats, multiple
tables, headers or no headers, escape characters, and there's no support for
recording metadata!
CleverCSV is a Python package that aims to solve some of the pain points of
CSV files, while maintaining many of the good things. The package
automatically detects (with high accuracy) the format (*dialect*) of CSV
files, thus making it easier to simply point to a CSV file and load it,
without the need for human inspection. In the future, we hope to solve some of
the other issues of CSV files too.
CleverCSV is [based on
science](https://gertjanvandenburg.com/papers/VandenBurg_Nazabal_Sutton_-_Wrangling_Messy_CSV_Files_by_Detecting_Row_and_Type_Patterns_2019.pdf).
We investigated thousands of real-world CSV files to find a robust way to
automatically detect the dialect of a file. This may seem like an easy
problem, but to a computer a CSV file is simply a long string, and every
dialect will give you *some* table. In CleverCSV we use a technique based on
the patterns of row lengths of the parsed file and the data type of the
resulting cells. With our method we achieve 97% accuracy for dialect
detection, with a 21% improvement on non-standard (*messy*) CSV files compared
to the Python standard library.
We think this kind of work can be very valuable for working data scientists
and programmers and we hope that you find CleverCSV useful (if there's a
problem, please open an issue!) Since the academic world counts citations,
please **cite CleverCSV if you use the package**. Here's a BibTeX entry you
can use:
```bib
@article{van2019wrangling,
title = {Wrangling Messy {CSV} Files by Detecting Row and Type Patterns},
author = {{van den Burg}, G. J. J. and Naz{\'a}bal, A. and Sutton, C.},
journal = {Data Mining and Knowledge Discovery},
year = {2019},
volume = {33},
number = {6},
pages = {1799--1820},
issn = {1573-756X},
doi = {10.1007/s10618-019-00646-y},
}
```
And of course, if you like the package please *spread the word!* You can do
this by Tweeting about it
([#CleverCSV](https://twitter.com/hashtag/clevercsv)) or clicking the ⭐️ [on
GitHub](https://github.com/alan-turing-institute/CleverCSV)!
## Installation
CleverCSV is available on PyPI. You can install either the full version, which
includes the command line interface and all optional dependencies, using
```bash
$ pip install clevercsv[full]
```
or you can install a lighter, core version of CleverCSV with
```bash
$ pip install clevercsv
```
## Usage
CleverCSV consists of a Python library and a command line tool called
``clevercsv``.
### Python Library
We designed CleverCSV to provide a drop-in replacement for the built-in CSV
module, with some useful functionality added to it. Therefore, if you simply
want to replace the builtin CSV module with CleverCSV, you can import
CleverCSV as follows, and use it as you would use the builtin [csv
module](https://docs.python.org/3/library/csv.html).
```python
import clevercsv
```
CleverCSV provides an improved version of the dialect sniffer in the CSV
module, but it also adds some useful wrapper functions. These functions
automatically detect the dialect and aim to make working with CSV files
easier. We currently have the following helper functions:
* [detect_dialect](https://clevercsv.readthedocs.io/en/latest/source/clevercsv.html#clevercsv.wrappers.detect_dialect):
takes a path to a CSV file and returns the detected dialect
* [read_table](https://clevercsv.readthedocs.io/en/latest/source/clevercsv.html#clevercsv.wrappers.read_table):
automatically detects the dialect and encoding of the file, and returns the
data as a list of rows. A version that returns a generator is also
available:
[stream_table](https://clevercsv.readthedocs.io/en/latest/source/clevercsv.html#clevercsv.wrappers.stream_table)
* [read_dataframe](https://clevercsv.readthedocs.io/en/latest/source/clevercsv.html#clevercsv.wrappers.read_dataframe):
detects the dialect and encoding of the file and then uses
[Pandas](https://pandas.pydata.org/) to read the CSV into a DataFrame. Note
that this function requires Pandas to be installed.
* [read_dicts](https://clevercsv.readthedocs.io/en/latest/source/clevercsv.html#clevercsv.wrappers.read_dicts):
detect the dialect and return the rows of the file as dictionaries, assuming
the first row contains the headers. A streaming version called
[stream_dicts](https://clevercsv.readthedocs.io/en/latest/source/clevercsv.html#clevercsv.wrappers.stream_dicts)
is also available.
* [write_table](https://clevercsv.readthedocs.io/en/latest/source/clevercsv.html#clevercsv.wrappers.write_table):
write a table (a list of lists) to a file using the
[RFC-4180](https://tools.ietf.org/html/rfc4180) dialect.
* [write_dicts](https://clevercsv.readthedocs.io/en/latest/source/clevercsv.html#clevercsv.wrappers.write_dicts):
write a list of dictionaries to a file using the
[RFC-4180](https://tools.ietf.org/html/rfc4180) dialect.
Of course, you can also use the traditional way of loading a CSV file, as in
the Python CSV module:
```python
import clevercsv
with open("data.csv", "r", newline="") as fp:
# you can use verbose=True to see what CleverCSV does
dialect = clevercsv.Sniffer().sniff(fp.read(), verbose=False)
fp.seek(0)
reader = clevercsv.reader(fp, dialect)
rows = list(reader)
```
For **large files**, you can speed up detection by supplying a smaller sample
to the sniffer, for example:
```python
dialect = clevercsv.Sniffer().sniff(fp.read(10000))
```
You can also speed up encoding detection by installing
[cCharDet](https://github.com/PyYoshi/cChardet), it will automatically be used
when it is available on the system.
That's the basics! If you want more details, you can look at the code of the
package, the test suite, or the [API
documentation](https://clevercsv.readthedocs.io/en/latest/source/modules.html).
If you run into any issues or have comments or suggestions, please open an
issue [on GitHub](https://github.com/alan-turing-institute/CleverCSV).
### Command-Line Tool
*To use the command line tool, make sure that you install the full version of
CleverCSV (see above).*
The ``clevercsv`` command line application has a number of handy features to
make working with CSV files easier. For instance, it can be used to view a CSV
file on the command line while automatically detecting the dialect. It can
also generate Python code for importing data from a file with the correct
dialect. The full help text is as follows:
```text
usage: clevercsv [-h] [-V] [-v] command ...
Available commands:
help Display help information
detect Detect the dialect of a CSV file
view View the CSV file on the command line using TabView
standardize Convert a CSV file to one that conforms to RFC-4180
code Generate Python code to import a CSV file
explore Explore the CSV file in an interactive Python shell
```
Each of the commands has further options (for instance, the ``code`` and
``explore`` commands have support for importing the CSV file as a Pandas
DataFrame). Use ``clevercsv help <command>`` or ``man clevercsv <command>``
for more information. Below are some examples for each command.
Note that each command accepts the ``-n`` or ``--num-chars`` flag to set the
number of characters used to detect the dialect. This can be especially
helpful to speed up dialect detection on large files.
#### Code
Code generation is useful when you don't want to detect the dialect of the
same file over and over again. You simply run the following command and copy
the generated code to a Python script!
```text
$ clevercsv code imdb.csv
# Code generated with CleverCSV
import clevercsv
with open("imdb.csv", "r", newline="", encoding="utf-8") as fp:
reader = clevercsv.reader(fp, delimiter=",", quotechar="", escapechar="\\")
rows = list(reader)
```
We also have a version that reads a Pandas dataframe:
```text
$ clevercsv code --pandas imdb.csv
# Code generated with CleverCSV
import clevercsv
df = clevercsv.read_dataframe("imdb.csv", delimiter=",", quotechar="", escapechar="\\")
```
#### Detect
Detection is useful when you only want to know the dialect.
```text
$ clevercsv detect imdb.csv
Detected: SimpleDialect(',', '', '\\')
```
The ``--plain`` flag gives the components of the dialect on separate lines,
which makes combining it with ``grep`` easier.
```text
$ clevercsv detect --plain imdb.csv
delimiter = ,
quotechar =
escapechar = \
```
#### Explore
The ``explore`` command is great for a command-line based workflow, or when
you quickly want to start working with a CSV file in Python. This command
detects the dialect of a CSV file and starts an interactive Python shell with
the file already loaded! You can either have the file loaded as a list of
lists:
```text
$ clevercsv explore milk.csv
Dropping you into an interactive shell.
CleverCSV has loaded the data into the variable: rows
>>>
>>> len(rows)
381
```
or you can load the file as a Pandas dataframe:
```text
$ clevercsv explore -p imdb.csv
Dropping you into an interactive shell.
CleverCSV has loaded the data into the variable: df
>>>
>>> df.head()
fn tid ... War Western
0 titles01/tt0012349 tt0012349 ... 0 0
1 titles01/tt0015864 tt0015864 ... 0 0
2 titles01/tt0017136 tt0017136 ... 0 0
3 titles01/tt0017925 tt0017925 ... 0 0
4 titles01/tt0021749 tt0021749 ... 0 0
[5 rows x 44 columns]
```
#### Standardize
Use the ``standardize`` command when you want to rewrite a file using the
[RFC-4180 standard](https://tools.ietf.org/html/rfc4180):
```text
$ clevercsv standardize --output imdb_standard.csv imdb.csv
```
In this particular example the use of the escape character is replaced by
using quotes.
#### View
This command allows you to view the file in the terminal. The dialect is of
course detected using CleverCSV! Both this command and the ``standardize``
command support the ``--transpose`` flag, if you want to transpose the file
before viewing or saving:
```text
$ clevercsv view --transpose imdb.csv
```
### Version Control Integration
If you'd like to make sure that you never commit a messy (non-standard) CSV
file to your repository, you can install a
[pre-commit](https://pre-commit.com/) hook. First, install pre-commit using
the [installation instructions](https://pre-commit.com/#install). Next, add
the following configuration to the ``.pre-commit-config.yaml`` file in your
repository:
```yaml
repos:
- repo: https://github.com/alan-turing-institute/CleverCSV-pre-commit
rev: v0.6.6 # or any later version
hooks:
- id: clevercsv-standardize
```
Finally, run ``pre-commit install`` to set up the git hook. Pre-commit will
now use CleverCSV to standardize your CSV files following
[RFC-4180](https://tools.ietf.org/html/rfc4180) whenever you commit a CSV file
to your repository.
## Contributing
If you want to encourage development of CleverCSV, the best thing to do now is
to *spread the word!*
If you encounter an issue in CleverCSV, please [open an
issue](https://help.github.com/en/github/managing-your-work-on-github/creating-an-issue)
or [submit a pull
request](https://help.github.com/en/github/collaborating-with-issues-and-pull-requests/creating-a-pull-request).
Don't hesitate, you're helping to make this project better for everyone! If
GitHub's not your thing but you still want to contact us, you can send an
email to ``gertjanvandenburg at gmail dot com`` instead. You can also ask
questions [on Gitter](https://gitter.im/alan-turing-institute/CleverCSV).
Note that all contributions to the project must adhere to the [Code of
Conduct](https://github.com/alan-turing-institute/CleverCSV/blob/master/CODE_OF_CONDUCT.md).
The CleverCSV package was originally written by [Gertjan van den
Burg](https://gertjan.dev) and came out of [scientific
research](https://gertjanvandenburg.com/papers/VandenBurg_Nazabal_Sutton_-_Wrangling_Messy_CSV_Files_by_Detecting_Row_and_Type_Patterns_2019.pdf)
on wrangling messy CSV files by [Gertjan van den Burg](https://gertjan.dev),
[Alfredo Nazabal](https://scholar.google.com/citations?user=IanHvT4AAAAJ), and
[Charles Sutton](https://homepages.inf.ed.ac.uk/csutton/).
## Notes
CleverCSV is licensed under the [MIT license](./LICENSE). Please [cite our
research](https://link.springer.com/article/10.1007/s10618-019-00646-y) if you
use CleverCSV in your work.
Copyright (c) 2018-2021 [The Alan Turing Institute](https://turing.ac.uk).
|