1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162
|
===============
Getting started
===============
About this tutorial
===================
There is no better way to learn how to use a new tool than to see it applied in a real world situation. To that end, this tutorial explains how to use csvkit tools by analyzing a real dataset.
The data we will be using is a subset of the United States Defense Logistic Agency Law Enforcement Support Office's (LESO) 1033 Program dataset, which describes how surplus military arms have been distributed to local police forces. This data was widely cited in the aftermath of the Ferguson, Missouri protests. The particular data we are using comes from an `NPR report <https://www.npr.org/2014/09/02/342494225/mraps-and-bayonets-what-we-know-about-the-pentagons-1033-program>`_ analyzing the data.
This tutorial assumes you have some basic familiarity with the command line. If you don't have much experience, fear not! This has been written with beginners in mind. No prior experience with data processing or analysis is assumed.
Installing csvkit
=================
.. note::
It's best to install csvkit in a `virtual environment <https://docs.python.org/3/library/venv.html>`_.
.. code-block:: bash
pip install csvkit
Not working? Check :ref:`troubleshooting`.
If you need to work with `Zstandard <https://facebook.github.io/zstd/>`_ files with the ``.zst`` extension, install Zstandard support:
.. code-block:: bash
pip install csvkit[zstandard]
Do you use Homebrew? You can `install csvkit system-wide <https://formulae.brew.sh/formula/csvkit>`_ with:
.. code-block:: bash
brew install csvkit
Getting the data
================
Let's start by creating a clean workspace:
.. code-block:: bash
mkdir csvkit_tutorial
cd csvkit_tutorial
Now let's fetch the data:
.. code-block:: bash
curl -L -O https://raw.githubusercontent.com/wireservice/csvkit/master/examples/realdata/ne_1033_data.xlsx
in2csv: the Excel killer
========================
For purposes of this tutorial, I've converted this data to Excel format. (NPR published it in CSV format.) If you have Excel you can open the file and take a look at it, but really, who wants to wait for Excel to load? Instead, let's convert it to a CSV:
.. code-block:: bash
in2csv ne_1033_data.xlsx
You should see a CSV version of the data dumped into your terminal. All csvkit tools write to the terminal output, called "standard out", by default. This isn't very useful, so let's write it to a file instead:
.. code-block:: bash
in2csv ne_1033_data.xlsx > data.csv
``data.csv`` will now contain a CSV version of our original file. If you aren't familiar with the ``>`` syntax, it means "redirect standard out to a file". If that's hard to remember it may be more convenient to think of it as "save to".
We can verify that the data is saved to the new file by using the ``cat`` command to print it:
.. code-block:: bash
cat data.csv
:doc:`/scripts/in2csv` can convert a variety of common file formats to CSV, including both ``.xls`` and ``.xlsx`` Excel files, JSON files, and fixed-width formatted files.
csvlook: data periscope
=======================
Now that we have some data, we probably want to get some idea of what's in it. We could open it in Excel or Google Docs, but wouldn't it be nice if we could just take a look in the command line? To do that, we can use :doc:`/scripts/csvlook`:
.. code-block:: bash
csvlook data.csv
At first the output of :doc:`/scripts/csvlook` isn't going to appear very promising. You'll see a mess of data, pipe character and dashes. That's because this dataset has many columns and they won't all fit in the terminal at once. You have two options:
1. Pipe the output to ``less -S`` to display the lines without wrapping and use the arrow keys to scroll left and right:
.. code-block:: bash
csvlook data.csv | less -S
2. Reduce which columns of our dataset are displayed before we look at it. This is what will do in the next section.
csvcut: data scalpel
====================
:doc:`/scripts/csvcut` is the original csvkit tool. It inspired the rest. With it, we can select, delete and reorder the columns in our CSV. First, let's just see what columns are in our data:
.. code-block:: console
$ csvcut -n data.csv
1: state
2: county
3: fips
4: nsn
5: item_name
6: quantity
7: ui
8: acquisition_cost
9: total_cost
10: ship_date
11: federal_supply_category
12: federal_supply_category_name
13: federal_supply_class
14: federal_supply_class_name
As you can see, our dataset has fourteen columns. Let's take a look at just columns ``2``, ``5`` and ``6``:
.. code-block:: bash
csvcut -c 2,5,6 data.csv
Now we've reduced our output CSV to only three columns.
We can also refer to columns by their names to make our lives easier:
.. code-block:: bash
csvcut -c county,item_name,quantity data.csv
Putting it together with pipes
==============================
Now that we understand :doc:`/scripts/in2csv`, :doc:`/scripts/csvlook` and :doc:`/scripts/csvcut` we can demonstrate the power of csvkit's when combined with the standard command-line "pipe". Try this command:
.. code-block:: bash
csvcut -c county,item_name,quantity data.csv | csvlook | head
In addition to specifying filenames, all csvkit tools accept an input file via "standard in". This means that, using the ``|`` ("pipe") character we can use the output of one csvkit tool as the input of the next.
In the example above, the output of :doc:`/scripts/csvcut` becomes the input to :doc:`/scripts/csvlook`. This also allow us to pipe output to standard Unix commands such as ``head``, which prints only the first ten lines of its input. Here, the output of :doc:`/scripts/csvlook` becomes the input of ``head``.
Piping is a core feature of csvkit. Of course, you can always write the output of each command to a file using ``>``. However, it's often faster and more convenient to use pipes to chain several commands together.
We can also pipe :doc:`/scripts/in2csv`, allowing us to combine all our previous operations into one:
.. code-block:: bash
in2csv ne_1033_data.xlsx | csvcut -c county,item_name,quantity | csvlook | head
Summing up
==========
All the csvkit tools work with standard input and output. Any tool can be piped into another and into another. The output of any tool can be redirected to a file. In this way they form a data processing "pipeline" of sorts, allowing you to do non-trivial, repeatable work without creating dozens of intermediary files.
Make sense? If you think you've got it figured out, you can move on to :doc:`2_examining_the_data`.
|