File: comparison_with_sql.rst

package info (click to toggle)
pandas 2.2.3%2Bdfsg-9
  • links: PTS, VCS
  • area: main
  • in suites: forky, sid, trixie
  • size: 66,784 kB
  • sloc: python: 422,228; ansic: 9,190; sh: 270; xml: 102; makefile: 83
file content (512 lines) | stat: -rw-r--r-- 11,361 bytes parent folder | download
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
.. _compare_with_sql:

{{ header }}

Comparison with SQL
********************
Since many potential pandas users have some familiarity with
`SQL <https://en.wikipedia.org/wiki/SQL>`_, this page is meant to provide some examples of how
various SQL operations would be performed using pandas.

.. include:: includes/introduction.rst

Most of the examples will utilize the ``tips`` dataset found within pandas tests.  We'll read
the data into a DataFrame called ``tips`` and assume we have a database table of the same name and
structure.

.. ipython:: python

    url = (
        "https://raw.githubusercontent.com/pandas-dev"
        "/pandas/main/pandas/tests/io/data/csv/tips.csv"
    )
    tips = pd.read_csv(url)
    tips


Copies vs. in place operations
------------------------------

.. include:: includes/copies.rst


SELECT
------
In SQL, selection is done using a comma-separated list of columns you'd like to select (or a ``*``
to select all columns):

.. code-block:: sql

    SELECT total_bill, tip, smoker, time
    FROM tips;

With pandas, column selection is done by passing a list of column names to your DataFrame:

.. ipython:: python

    tips[["total_bill", "tip", "smoker", "time"]]

Calling the DataFrame without the list of column names would display all columns (akin to SQL's
``*``).

In SQL, you can add a calculated column:

.. code-block:: sql

    SELECT *, tip/total_bill as tip_rate
    FROM tips;

With pandas, you can use the :meth:`DataFrame.assign` method of a DataFrame to append a new column:

.. ipython:: python

    tips.assign(tip_rate=tips["tip"] / tips["total_bill"])

WHERE
-----
Filtering in SQL is done via a WHERE clause.

.. code-block:: sql

    SELECT *
    FROM tips
    WHERE time = 'Dinner';

.. include:: includes/filtering.rst

Just like SQL's ``OR`` and ``AND``, multiple conditions can be passed to a DataFrame using ``|``
(``OR``) and ``&`` (``AND``).

Tips of more than $5 at Dinner meals:

.. code-block:: sql

    SELECT *
    FROM tips
    WHERE time = 'Dinner' AND tip > 5.00;

.. ipython:: python

    tips[(tips["time"] == "Dinner") & (tips["tip"] > 5.00)]

Tips by parties of at least 5 diners OR bill total was more than $45:

.. code-block:: sql

    SELECT *
    FROM tips
    WHERE size >= 5 OR total_bill > 45;

.. ipython:: python

    tips[(tips["size"] >= 5) | (tips["total_bill"] > 45)]

NULL checking is done using the :meth:`~pandas.Series.notna` and :meth:`~pandas.Series.isna`
methods.

.. ipython:: python

    frame = pd.DataFrame(
        {"col1": ["A", "B", np.nan, "C", "D"], "col2": ["F", np.nan, "G", "H", "I"]}
    )
    frame

Assume we have a table of the same structure as our DataFrame above. We can see only the records
where ``col2`` IS NULL with the following query:

.. code-block:: sql

    SELECT *
    FROM frame
    WHERE col2 IS NULL;

.. ipython:: python

    frame[frame["col2"].isna()]

Getting items where ``col1`` IS NOT NULL can be done with :meth:`~pandas.Series.notna`.

.. code-block:: sql

    SELECT *
    FROM frame
    WHERE col1 IS NOT NULL;

.. ipython:: python

    frame[frame["col1"].notna()]


GROUP BY
--------
In pandas, SQL's ``GROUP BY`` operations are performed using the similarly named
:meth:`~pandas.DataFrame.groupby` method. :meth:`~pandas.DataFrame.groupby` typically refers to a
process where we'd like to split a dataset into groups, apply some function (typically aggregation)
, and then combine the groups together.

A common SQL operation would be getting the count of records in each group throughout a dataset.
For instance, a query getting us the number of tips left by sex:

.. code-block:: sql

    SELECT sex, count(*)
    FROM tips
    GROUP BY sex;
    /*
    Female     87
    Male      157
    */


The pandas equivalent would be:

.. ipython:: python

    tips.groupby("sex").size()

Notice that in the pandas code we used :meth:`.DataFrameGroupBy.size` and not
:meth:`.DataFrameGroupBy.count`. This is because
:meth:`.DataFrameGroupBy.count` applies the function to each column, returning
the number of ``NOT NULL`` records within each.

.. ipython:: python

    tips.groupby("sex").count()

Alternatively, we could have applied the :meth:`.DataFrameGroupBy.count` method
to an individual column:

.. ipython:: python

    tips.groupby("sex")["total_bill"].count()

Multiple functions can also be applied at once. For instance, say we'd like to see how tip amount
differs by day of the week - :meth:`.DataFrameGroupBy.agg` allows you to pass a dictionary
to your grouped DataFrame, indicating which functions to apply to specific columns.

.. code-block:: sql

    SELECT day, AVG(tip), COUNT(*)
    FROM tips
    GROUP BY day;
    /*
    Fri   2.734737   19
    Sat   2.993103   87
    Sun   3.255132   76
    Thu  2.771452   62
    */

.. ipython:: python

    tips.groupby("day").agg({"tip": "mean", "day": "size"})

Grouping by more than one column is done by passing a list of columns to the
:meth:`~pandas.DataFrame.groupby` method.

.. code-block:: sql

    SELECT smoker, day, COUNT(*), AVG(tip)
    FROM tips
    GROUP BY smoker, day;
    /*
    smoker day
    No     Fri      4  2.812500
           Sat     45  3.102889
           Sun     57  3.167895
           Thu    45  2.673778
    Yes    Fri     15  2.714000
           Sat     42  2.875476
           Sun     19  3.516842
           Thu    17  3.030000
    */

.. ipython:: python

    tips.groupby(["smoker", "day"]).agg({"tip": ["size", "mean"]})

.. _compare_with_sql.join:

JOIN
----
``JOIN``\s can be performed with :meth:`~pandas.DataFrame.join` or :meth:`~pandas.merge`. By
default, :meth:`~pandas.DataFrame.join` will join the DataFrames on their indices. Each method has
parameters allowing you to specify the type of join to perform (``LEFT``, ``RIGHT``, ``INNER``,
``FULL``) or the columns to join on (column names or indices).

.. warning::

    If both key columns contain rows where the key is a null value, those
    rows will be matched against each other. This is different from usual SQL
    join behaviour and can lead to unexpected results.

.. ipython:: python

    df1 = pd.DataFrame({"key": ["A", "B", "C", "D"], "value": np.random.randn(4)})
    df2 = pd.DataFrame({"key": ["B", "D", "D", "E"], "value": np.random.randn(4)})

Assume we have two database tables of the same name and structure as our DataFrames.

Now let's go over the various types of ``JOIN``\s.

INNER JOIN
~~~~~~~~~~
.. code-block:: sql

    SELECT *
    FROM df1
    INNER JOIN df2
      ON df1.key = df2.key;

.. ipython:: python

    # merge performs an INNER JOIN by default
    pd.merge(df1, df2, on="key")

:meth:`~pandas.merge` also offers parameters for cases when you'd like to join one DataFrame's
column with another DataFrame's index.

.. ipython:: python

    indexed_df2 = df2.set_index("key")
    pd.merge(df1, indexed_df2, left_on="key", right_index=True)

LEFT OUTER JOIN
~~~~~~~~~~~~~~~

Show all records from ``df1``.

.. code-block:: sql

    SELECT *
    FROM df1
    LEFT OUTER JOIN df2
      ON df1.key = df2.key;

.. ipython:: python

    pd.merge(df1, df2, on="key", how="left")

RIGHT JOIN
~~~~~~~~~~

Show all records from ``df2``.

.. code-block:: sql

    SELECT *
    FROM df1
    RIGHT OUTER JOIN df2
      ON df1.key = df2.key;

.. ipython:: python

    pd.merge(df1, df2, on="key", how="right")

FULL JOIN
~~~~~~~~~
pandas also allows for ``FULL JOIN``\s, which display both sides of the dataset, whether or not the
joined columns find a match. As of writing, ``FULL JOIN``\s are not supported in all RDBMS (MySQL).

Show all records from both tables.

.. code-block:: sql

    SELECT *
    FROM df1
    FULL OUTER JOIN df2
      ON df1.key = df2.key;

.. ipython:: python

    pd.merge(df1, df2, on="key", how="outer")


UNION
-----

``UNION ALL`` can be performed using :meth:`~pandas.concat`.

.. ipython:: python

    df1 = pd.DataFrame(
        {"city": ["Chicago", "San Francisco", "New York City"], "rank": range(1, 4)}
    )
    df2 = pd.DataFrame(
        {"city": ["Chicago", "Boston", "Los Angeles"], "rank": [1, 4, 5]}
    )

.. code-block:: sql

    SELECT city, rank
    FROM df1
    UNION ALL
    SELECT city, rank
    FROM df2;
    /*
             city  rank
          Chicago     1
    San Francisco     2
    New York City     3
          Chicago     1
           Boston     4
      Los Angeles     5
    */

.. ipython:: python

    pd.concat([df1, df2])

SQL's ``UNION`` is similar to ``UNION ALL``, however ``UNION`` will remove duplicate rows.

.. code-block:: sql

    SELECT city, rank
    FROM df1
    UNION
    SELECT city, rank
    FROM df2;
    -- notice that there is only one Chicago record this time
    /*
             city  rank
          Chicago     1
    San Francisco     2
    New York City     3
           Boston     4
      Los Angeles     5
    */

In pandas, you can use :meth:`~pandas.concat` in conjunction with
:meth:`~pandas.DataFrame.drop_duplicates`.

.. ipython:: python

    pd.concat([df1, df2]).drop_duplicates()


LIMIT
-----

.. code-block:: sql

    SELECT * FROM tips
    LIMIT 10;

.. ipython:: python

    tips.head(10)


pandas equivalents for some SQL analytic and aggregate functions
----------------------------------------------------------------

Top n rows with offset
~~~~~~~~~~~~~~~~~~~~~~

.. code-block:: sql

    -- MySQL
    SELECT * FROM tips
    ORDER BY tip DESC
    LIMIT 10 OFFSET 5;

.. ipython:: python

    tips.nlargest(10 + 5, columns="tip").tail(10)

Top n rows per group
~~~~~~~~~~~~~~~~~~~~

.. code-block:: sql

    -- Oracle's ROW_NUMBER() analytic function
    SELECT * FROM (
      SELECT
        t.*,
        ROW_NUMBER() OVER(PARTITION BY day ORDER BY total_bill DESC) AS rn
      FROM tips t
    )
    WHERE rn < 3
    ORDER BY day, rn;


.. ipython:: python

    (
        tips.assign(
            rn=tips.sort_values(["total_bill"], ascending=False)
            .groupby(["day"])
            .cumcount()
            + 1
        )
        .query("rn < 3")
        .sort_values(["day", "rn"])
    )

the same using ``rank(method='first')`` function

.. ipython:: python

    (
        tips.assign(
            rnk=tips.groupby(["day"])["total_bill"].rank(
                method="first", ascending=False
            )
        )
        .query("rnk < 3")
        .sort_values(["day", "rnk"])
    )

.. code-block:: sql

    -- Oracle's RANK() analytic function
    SELECT * FROM (
      SELECT
        t.*,
        RANK() OVER(PARTITION BY sex ORDER BY tip) AS rnk
      FROM tips t
      WHERE tip < 2
    )
    WHERE rnk < 3
    ORDER BY sex, rnk;

Let's find tips with (rank < 3) per gender group for (tips < 2).
Notice that when using ``rank(method='min')`` function
``rnk_min`` remains the same for the same ``tip``
(as Oracle's ``RANK()`` function)

.. ipython:: python

    (
        tips[tips["tip"] < 2]
        .assign(rnk_min=tips.groupby(["sex"])["tip"].rank(method="min"))
        .query("rnk_min < 3")
        .sort_values(["sex", "rnk_min"])
    )


UPDATE
------

.. code-block:: sql

    UPDATE tips
    SET tip = tip*2
    WHERE tip < 2;

.. ipython:: python

    tips.loc[tips["tip"] < 2, "tip"] *= 2

DELETE
------

.. code-block:: sql

    DELETE FROM tips
    WHERE tip > 9;

In pandas we select the rows that should remain instead of deleting them:

.. ipython:: python

    tips = tips.loc[tips["tip"] <= 9]