File: string_series.tex

package info (click to toggle)
gretl 2019a-1
  • links: PTS, VCS
  • area: main
  • in suites: buster
  • size: 53,708 kB
  • sloc: ansic: 367,137; sh: 4,416; makefile: 2,636; cpp: 2,499; xml: 580; perl: 364
file content (438 lines) | stat: -rw-r--r-- 16,162 bytes parent folder | download
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
\chapter{String-valued series}
\label{chap:strval-series}

\section{Introduction}

Gretl's support for data series with string values has gone through
three phases:
\begin{enumerate}
\item No support: we simply rejected non-numerical values when reading
  data from file.
\item Numeric encoding only: we would read a string-valued series from
  a delimited text data file (provided the series didn't mix numerical
  values and strings) but the representation of the data within gretl
  was purely numerical. We printed a ``string table'' showing the
  mapping between the original strings and gretl's encoding and it was
  up to the user to keep track of this mapping.
\item Preservation of string values: the string table that we
  construct in reading a string-valued series is now stored as a
  component of the dataset so it's possible to display and manipulate
  these values within gretl.
\end{enumerate}

The third phase has now been in effect for several years, with a
series of gradual refinements. This chapter gives an account of the
status quo. It explains how to create string-valued series and
describes the operations that are supported for such series.

\section{Creating a string-valued series}

This can be done in two ways: first, by reading such a series from a
suitable source file and second, by taking a suitable numerical series
within gretl and adding string values using the \cmd{stringify()}
function. In either case string values will now be preserved when such
a series is saved in a gretl-native data file.

\subsection{Reading string-valued series}
\label{sec:reading}

The primary ``suitable source'' for string-valued series is a
delimited text data file (but see section\ref{sec:other-imports}
below). Here's a little example. The following is the content of a
file named \texttt{gc.csv}:
%
\begin{code}
city,year
"Bilbao",2009
"Toruń",2011
"Oklahoma City",2013
"Berlin",2015
"Athens",2017
\end{code}
%
and here's a script:
%
\begin{code}
open gc.csv --quiet
print --byobs
print city --byobs --numeric
printf "The third gretl conference took place in %s.\n", city[3]
\end{code}

The output from the script is:
%
\begin{code}
? print --byobs

          city         year

1       Bilbao         2009
2        Toruń         2011
3 Oklahoma C..         2013
4       Berlin         2015
5       Athens         2017

? print city --byobs --numeric

          city

1            1
2            2
3            3
4            4
5            5

The third gretl conference took place in Oklahoma City.
\end{code}

From this we can see a few things. 
\begin{itemize}
\item By default the \cmd{print} command shows us the string values
  of the series \texttt{city}, and it handles non-ASCII characters
  provided they're in UTF-8 (but it doesn't handle longer strings
  very elegantly).
\item The \verb|--numeric| option to \cmd{print} exposes the
  numeric codes for a string-valued series.
\item The syntax \texttt{seriesname[obs]} gives a string when a series
  is string-valued.
\end{itemize}

Suppose you want to access the numeric code for a particular
string-valued observation: you can get that by ``casting'' the series
to a vector. Thus
\begin{code}
printf "The code for '%s' is %d.\n", city[3], {city}[3]
\end{code}
gives
\begin{code}
The code for 'Oklahoma City' is 3.
\end{code}

The numeric codes for string-valued series are always assigned thus:
reading the data file row by row, the first string value is assigned
1, the next \textit{distinct} string value is assigned 2, and so on.

\subsection{Assigning string values to an existing series}
\label{sec:stringify}

This is done via the \cmd{stringify()} function, which takes two
arguments, the name of a series and an array of strings. For this to
work two conditions must be met:

\begin{enumerate}
\item The series must have only integer values and the smallest value
  must be 1 or greater.
\item The array of strings must have at least $n$ members, where $n$
  is the largest value found in the series.
\end{enumerate}

The logic of these conditions is that we're looking to create a
mapping as described above, from a 1-based sequence of integers to a
set of strings. However, we're allowing for the possibility that the
series in question is an incomplete sample from an associated
population. Suppose we have a series that goes 2, 3, 5, 9, 10. This is
taken to be a sample from a population that has at least 10 discrete
values, 1, 2, \dots{}, 10, and so requires at least 10 value-strings.

One aspect of \cmd{stringify()} is debatable. At present the
function returns 0 on success, otherwise an integer error code; it
doesn't explicitly ``fail'' if the required conditions are not met,
and it's up to the user to check if things went OK. Maybe it should
just fail on error?

Here's (a simplified version of) an example that one of the authors
has had cause to use: deriving US-style ``letter grades'' from a
series containing percentage scores for students. Call the percentage
series $x$, and say we want to create a series with values \texttt{A}
for $x \geq 90$, \texttt{B} for $80 \leq x <90$, and so on down to
\texttt{F} for $x<60$. Then we can do:
\begin{code}
series grade = 1 # F, the least value
grade += x >= 60 # D
grade += x >= 70 # C
grade += x >= 80 # B
grade += x >= 90 # A
stringify(grade, strsplit("F D C B A"))
\end{code}
%
The way the \texttt{grade} series is constructed is not the most
compact, but it's nice and explicit, and easy to amend if one wants to
adjust the threshold values. Note the use of \cmd{strsplit()} to
create an on-the-fly array of strings from a string literal; this is
convenient when the array contains a moderate number of elements with
no embedded spaces. An alternative way to get the same result is to
define the array of strings via the \cmd{defarray()} function, as in
\begin{code}
stringify(grade,defarray("F","D","C","B","A"))
\end{code}



We should also mention that we have a function to perform the inverse
operation of \cmd{stringify()}: the \cmd{strvals()} function
retrieves the array of string values from a series. (It returns an
empty array if the series is not string-valued.)

\section{Permitted operations}

One question that arises with string-valued series is, what are you
allowed to do with them and what is banned? This may be another
debatable point, but here we set out the current state of things.

\subsection{Setting values per observation}

You can set particular values in a string-valued series either by
string or numeric code. For example, suppose (in relation to the
example in section~\ref{sec:stringify}) that for some reason student
number 31 with a percentage score of 88 nonetheless merits an
\texttt{A} grade. We could do
\begin{code}
grade[31] = "A"
\end{code}
or, if we're confident about the mapping,
\begin{code}
grade[31] = 5
\end{code}
Or to raise the student's grade by one letter:
\begin{code}
grade[31] += 1
\end{code}

What you're \textit{not} allowed to do here is make a numerical
adjustment that would put the value out of bounds in relation to the
set of string values. For example, if we tried \texttt{grade[31] = 6}
we'd get an error. 

On the other hand, you \textit{can} implicitly extend the set of
string values. This wouldn't make sense for the letter grades example
but it might for, say, city names. Returning to the example in
section~\ref{sec:reading} suppose we try
%
\begin{code}
dataset addobs 1
year[6] = 2019
city[6] = "Naples?"
\end{code}
%
This will work OK: we're implicitly adding another member to the
string table for \texttt{city}; the associated numeric code will be
the next available integer.\footnote{Admittedly there is a downside to
  this feature: one may inadvertently add a new string value by
  mistyping a string that's already present.}

\subsection{Assignment to an entire series}

This is disallowed: you can't execute an assignment of any sort with
the name of a string-valued series \textit{per se} on the left-hand
side. Put differently, you cannot overwrite an entire string-valued
series at once. This may be debatable, but it's much the easiest way
of ensuring that we never end up with a broken mapping. If anyone can
come up with a really good reason for wanting to do this we might
reconsider.

Besides assigning an out-of-bounds numerical value to a particular
observation, this sort of assignment is in fact the only operation
that is banned for string-valued series.

\subsection{Missing values}

We support one exception to the general rule, never break the mapping
between strings and numeric codes for string-valued series: you can
mark particular observations as missing. This is done in the usual
way, e.g.,
\begin{code}
grade[31] = NA
\end{code}
Note, however, that on importing a string series from a delimited text
file any non-blank strings (including ``NA'') will be interpreted as
valid values; any missing values in such a file should therefore be
represented by blank cells.

\subsection{Copying a string-valued series}

If you make a copy of a string-valued series, as in
\begin{code}
series foo = city
\end{code}
the string values are \textit{not} copied over: you get a purely
numerical series holding the codes of the original series. But if you
want a full copy with the string values that can easily be arranged:
\begin{code}
series citycopy = city
stringify(citycopy, strvals(city))
\end{code}

\subsection{String-valued series in other contexts}

String-valued series can be used on the right-hand side of assignment
statements at will, and in that context their numerical values are
taken. For example,
%
\begin{code}
series y = sqrt(city)
\end{code}
%
will elicit no complaint and generate a numerical series 1, 1.41421,
\dots{}. It's up to the user to judge whether this sort of thing
makes any sense.

Similarly, it's up to the user to decide if it makes sense to use a
string-valued series ``as is'' in a regression model, whether as
regressand or regressor---again, the numerical values of the series
are taken. Often this will not make sense, but sometimes it may: the
numerical values may by design form an ordinal, or even a cardinal,
scale (as in the ``grade'' example in section~\ref{sec:stringify}).

More likely, one would want to use \cmd{dummify} on a string-valued
series before using it in statistical modeling. In that context
gretl's series labels are suitably informative. For example, suppose
we have a series \texttt{race} with numerical values 1, 2 and 3 and
associated strings ``White'', ``Black'' and ``Other''. Then the hansl
code
\begin{code}
list D = dummify(race)
labels
\end{code}
will show these labels:
\begin{code}
Drace_2: dummy for race = 'Black'
Drace_3: dummy for race = 'Other'
\end{code}

Given such a series you can use string values in a sample restriction,
as in
\begin{code}
smpl race == "Black" --restrict
\end{code}
(although \texttt{race == 2} would also be acceptable).

There may be other contexts that we haven't yet thought of where it
would be good to have string values displayed and/or accepted on
input; suggestions are welcome.

\section{String-valued series and functions}

User-defined hansl functions can deal with string-valued series,
although there are a few points to note.

If you supply such a series as an argument to a hansl function its
string values will be accessible within the function. One can test
whether a given series \texttt{arg} is string-valued as follows:
\begin{code}
if nelem(strvals(arg)) > 0
  # yes
else
  # no
endif
\end{code}

Now suppose one wanted to put something like the code that generated
the \texttt{grade} series in section~\ref{sec:stringify} into a
function. That can be done, but \textit{not} in the form of a function
that directly returns the desired series---that is, something like
\begin{code}
function series letter_grade (series x)
  series grade
  # define grade based on x and stringify it, as shown above
  return grade
end function
\end{code}
%
Unfortunately the above will \emph{not} work: the caller will get the
\texttt{grade} series OK but it won't be string-valued. At first sight
this may seem to be a bug but it's defensible as a consequence of the
way series work in gretl.

The point is that series have, so to speak, two grades of
existence. They can exist as fully-fledged members of a dataset, or
they can have a fleeting existence as simply anonymous arrays of
numbers that are of the same length as dataset series. Consider the
statement
\begin{code}
series rootx1 = sqrt(x+1)
\end{code}
On the right-hand side we have the ``series'' \texttt{x+1}, which is
called into existence as part of a calculation but has no name and
cannot have string values. Similarly, consider
\begin{code}
series grade = letter_grade(x)
\end{code}
The return value from \verb|letter_grade()| is likewise an anonymous
array,\footnote{A proper named series, with string values, existed
  while the function was executing but it ceased to exist as soon as
  the function was finished.} incapable of holding string values
\textit{until} it gets assigned to the named series
\texttt{grade}. The solution is to define \texttt{grade} as a series,
at the level of the caller, before calling \verb|letter_grade()|, as
in
%
\begin{code}
function void letter_grade (series x, series *grade)
  # define grade based on x and stringify it
  # this version will work!
end function

# caller
...
series grade
letter_grade(x, &grade)
\end{code}

As you'll see from the account above, we don't offer any very fancy
facilities for string-valued series. We'll read them from suitable
sources and we'll create them natively via \cmd{stringify}---and
we'll try to ensure that they retain their integrity---but we don't,
for example, take the specification of a string-valued series as a
regressor as an implicit request to include the dummification of its
distinct values. Besides laziness, this reflects the fact that in
gretl a string-valued series \textit{may} be usable ``as is'',
depending on how it's defined; you can use \cmd{dummify} if you
need it.

\section{Other import formats}
\label{sec:other-imports}

In section~\ref{sec:reading} we illustrated the reading of
string-valued series with reference to a delimited text data
file. Gretl can also handle several other sources of string-valued
data, including the spreadsheet formats \texttt{xls}, \texttt{xlsx},
\texttt{gnumeric} and \texttt{ods} and (to a degree) the formats of
\textsf{Stata}, \textsf{SAS} and \textsf{SPSS}.

\subsection{Stata files}

Stata supports two relevant sorts of variables: (1) those that are of
``string type'' and (2) variables of one or other numeric type that
have ``value labels'' defined. Neither of these is exactly equivalent
to what we call a ``string-valued series'' in gretl.

Stata variables of string type have no numeric representation; their
values are literally strings, and that's all. Stata's numeric
variables with value labels do not have to be integer-valued and their
least value does not have to be 1; however, you can't define a label
for a value that is not an integer. Thus in Stata you can have a
series that comprises both integer and non-integer values, but only
the integer values can be labeled.\footnote{Verified in Stata 12.}

This means that on import to gretl we can readily handle variables of
string type from Stata's \texttt{dta} files. We give them a 1-based
numeric encoding; this is arbitrary but does not conflict with any
information in the \texttt{dta} file. On the other hand, in general
we're not able to handle Stata's numeric variables with value labels;
currently we report the value labels to the user but do not attempt to
store them in the gretl dataset. We could check such variables and
import them as string-valued series if they satisfy the criteria
stated in section~\ref{sec:stringify} but we don't at present.

\subsection{SAS and SPSS files}

Gretl is able to read and preserve string values associated with
variables from SAS ``export'' (\texttt{xpt}) files, and also from SPSS
\texttt{sav} files. Such variables seem to be on the same pattern as
Stata variables of string type.

%%% Local Variables:
%%% mode: latex
%%% TeX-master: "gretl-guide"
%%% End: