File: 10_text_data.rst

package info (click to toggle)
pandas 1.5.3%2Bdfsg-2
  • links: PTS, VCS
  • area: main
  • in suites: bookworm
  • size: 56,516 kB
  • sloc: python: 382,477; ansic: 8,695; sh: 119; xml: 102; makefile: 97
file content (248 lines) | stat: -rw-r--r-- 6,900 bytes parent folder | download
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
.. _10min_tut_10_text:

{{ header }}

.. ipython:: python

    import pandas as pd

.. raw:: html

    <div class="card gs-data">
        <div class="card-header">
            <div class="gs-data-title">
                Data used for this tutorial:
            </div>
        </div>
        <ul class="list-group list-group-flush">
            <li class="list-group-item">
.. include:: includes/titanic.rst

.. ipython:: python

    titanic = pd.read_csv("data/titanic.csv")
    titanic.head()

.. raw:: html

            </li>
        </ul>
    </div>

How to manipulate textual data?
-------------------------------

.. raw:: html

    <ul class="task-bullet">
        <li>

Make all name characters lowercase.

.. ipython:: python

    titanic["Name"].str.lower()

To make each of the strings in the ``Name`` column lowercase, select the ``Name`` column
(see the :ref:`tutorial on selection of data <10min_tut_03_subset>`), add the ``str`` accessor and
apply the ``lower`` method. As such, each of the strings is converted element-wise.

.. raw:: html

        </li>
    </ul>

Similar to datetime objects in the :ref:`time series tutorial <10min_tut_09_timeseries>`
having a ``dt`` accessor, a number of
specialized string methods are available when using the ``str``
accessor. These methods have in general matching names with the
equivalent built-in string methods for single elements, but are applied
element-wise (remember :ref:`element-wise calculations <10min_tut_05_columns>`?)
on each of the values of the columns.

.. raw:: html

    <ul class="task-bullet">
        <li>

Create a new column ``Surname`` that contains the surname of the passengers by extracting the part before the comma.

.. ipython:: python

    titanic["Name"].str.split(",")

Using the :meth:`Series.str.split` method, each of the values is returned as a list of
2 elements. The first element is the part before the comma and the
second element is the part after the comma.

.. ipython:: python

    titanic["Surname"] = titanic["Name"].str.split(",").str.get(0)
    titanic["Surname"]

As we are only interested in the first part representing the surname
(element 0), we can again use the ``str`` accessor and apply :meth:`Series.str.get` to
extract the relevant part. Indeed, these string functions can be
concatenated to combine multiple functions at once!

.. raw:: html

        </li>
    </ul>

.. raw:: html

    <div class="d-flex flex-row gs-torefguide">
        <span class="badge badge-info">To user guide</span>

More information on extracting parts of strings is available in the user guide section on :ref:`splitting and replacing strings <text.split>`.

.. raw:: html

   </div>

.. raw:: html

    <ul class="task-bullet">
        <li>

Extract the passenger data about the countesses on board of the Titanic.

.. ipython:: python

    titanic["Name"].str.contains("Countess")

.. ipython:: python

    titanic[titanic["Name"].str.contains("Countess")]

(*Interested in her story? See* `Wikipedia <https://en.wikipedia.org/wiki/No%C3%ABl_Leslie,_Countess_of_Rothes>`__\ *!*)

The string method :meth:`Series.str.contains` checks for each of the values in the
column ``Name`` if the string contains the word ``Countess`` and returns
for each of the values ``True`` (``Countess`` is part of the name) or
``False`` (``Countess`` is not part of the name). This output can be used
to subselect the data using conditional (boolean) indexing introduced in
the :ref:`subsetting of data tutorial <10min_tut_03_subset>`. As there was
only one countess on the Titanic, we get one row as a result.

.. raw:: html

        </li>
    </ul>

.. note::
    More powerful extractions on strings are supported, as the
    :meth:`Series.str.contains` and :meth:`Series.str.extract` methods accept `regular
    expressions <https://docs.python.org/3/library/re.html>`__, but out of
    scope of this tutorial.

.. raw:: html

    <div class="d-flex flex-row gs-torefguide">
        <span class="badge badge-info">To user guide</span>

More information on extracting parts of strings is available in the user guide section on :ref:`string matching and extracting <text.extract>`.

.. raw:: html

   </div>

.. raw:: html

    <ul class="task-bullet">
        <li>

Which passenger of the Titanic has the longest name?

.. ipython:: python

    titanic["Name"].str.len()

To get the longest name we first have to get the lengths of each of the
names in the ``Name`` column. By using pandas string methods, the
:meth:`Series.str.len` function is applied to each of the names individually
(element-wise).

.. ipython:: python

    titanic["Name"].str.len().idxmax()

Next, we need to get the corresponding location, preferably the index
label, in the table for which the name length is the largest. The
:meth:`~Series.idxmax` method does exactly that. It is not a string method and is
applied to integers, so no ``str`` is used.

.. ipython:: python

    titanic.loc[titanic["Name"].str.len().idxmax(), "Name"]

Based on the index name of the row (``307``) and the column (``Name``),
we can do a selection using the ``loc`` operator, introduced in the
:ref:`tutorial on subsetting <10min_tut_03_subset>`.

.. raw:: html

        </li>
    </ul>

.. raw:: html

    <ul class="task-bullet">
        <li>

In the "Sex" column, replace values of "male" by "M" and values of "female" by "F".

.. ipython:: python

    titanic["Sex_short"] = titanic["Sex"].replace({"male": "M", "female": "F"})
    titanic["Sex_short"]

Whereas :meth:`~Series.replace` is not a string method, it provides a convenient way
to use mappings or vocabularies to translate certain values. It requires
a ``dictionary`` to define the mapping ``{from : to}``.

.. raw:: html

        </li>
    </ul>

.. warning::
    There is also a :meth:`~Series.str.replace` method available to replace a
    specific set of characters. However, when having a mapping of multiple
    values, this would become:

    ::

        titanic["Sex_short"] = titanic["Sex"].str.replace("female", "F")
        titanic["Sex_short"] = titanic["Sex_short"].str.replace("male", "M")

    This would become cumbersome and easily lead to mistakes. Just think (or
    try out yourself) what would happen if those two statements are applied
    in the opposite order…

.. raw:: html

    <div class="shadow gs-callout gs-callout-remember">
        <h4>REMEMBER</h4>

-  String methods are available using the ``str`` accessor.
-  String methods work element-wise and can be used for conditional
   indexing.
-  The ``replace`` method is a convenient method to convert values
   according to a given dictionary.

.. raw:: html

   </div>

.. raw:: html

    <div class="d-flex flex-row gs-torefguide">
        <span class="badge badge-info">To user guide</span>

A full overview is provided in the user guide pages on :ref:`working with text data <text>`.

.. raw:: html

   </div>