1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352
|
\chapter{Character Array}
\label{cha:character-array}
\declaremodule{extension}{numarray.strings}
\index{character array}
\index{string array}
\section{Introduction}
\label{sec:chararray-intro}
\code{numarray}, like \code{Numeric}, has support for arrays of character data
(provided by the \code{numarray.strings} module) in addition to arrays of
numbers. The support for character arrays in \code{Numeric} is relatively
limited, restricted to arrays of single characters. In contrast,
\code{numarray} supports arrays of fixed length strings. As an additional
enhancement, the \code{numarray} design supports interleaving arrays of
characters with arrays of numbers, with both occupying the same memory buffer.
This provides basic infrastructure for building the arrays of heterogenous
records as provided by \code{numarray.records} (see chapter
\ref{cha:record-array}). Currently, neither \code{Numeric} nor \code{numarray}
provides support for unicode.
Each character array is a \index{CharArray} \code{CharArray} object in the
\code{numarray.strings} module. The easiest way to construct a character array
is to use the \code{numarray.strings.array()} function. For example:
\begin{verbatim}
>>> import numarray.strings as str
>>> s = str.array(['Smith', 'Johnson', 'Williams', 'Miller'])
>>> print s
['Smith', 'Johnson', 'Williams', 'Miller']
>>> s.itemsize()
8
\end{verbatim}
In this example, this string array has 4 elements. The maximum string length
is automatically determined from the data. In this case, the created array will
support fixed length strings of 8 characters (since the longest name is 8
characters long).
The character array is just like an array in numarray, except that now each
element is conceptually a Python string rather than a number. We can do the
usual indexing and slicing:
\begin{verbatim}
>>> print s[0]
'Smith'
>>> print s[:2]
['Smith', 'Johnson']
>>> s[:2] = 'changed'
>>> print s
['changed', 'changed', 'Williams', 'Miller']
\end{verbatim}
\section{Character array stripping, padding, and truncation}
\label{sec:chararray-clip-pad-truncate}
CharArrays are designed to store fixed length strings of visible ASCII text.
You may have noticed that although a \code{CharArray} stores fixed length
strings, it displays variable length strings. This is a result of the
stripping and padding policies of the CharArray class.
When an element of a \code{CharArray} is fetched trailing whitespace is
stripped off. The sole exception to this rule is that a single whitespace is
never stripped down to the empty string. \code{numarray.strings} defines
whitespace as an ASCII space, formfeed, newline, carriage return, tab, or
vertical tab.
When a string is assigned to a \code{CharArray}, the string is considered
terminated by the first of any NULL characters it contains and is padded with
spaces to the full length of the \code{CharArray} itemsize. Thus, the memory
image of a \code{CharArray} element does not include anything at or after the
first NULL in an assigned string; instead, there are spaces, and no terminating
NULL character at all.
When a string which is longer than the \code{itemsize()} is assigned to a
\code{CharArray}, it is silently truncated.
The \code{RawCharArray} baseclass of \code{CharArray} implements transparent
\code{strip()} and \code{pad()} methods, enabling the storage and retrieval of
arbitrary ASCII values within array elements. For \code{RawCharArray}, all
array elements are identical in percieved length. Alternate
stripping and padding policies can be implemented by subclassing
\code{CharArray} or \code{RawCharArray}.
\section{Character array functions}
\label{sec:chararray-func}
\begin{funcdesc}{array}{buffer=None, itemsize=None, shape=None, byteoffset=0,
bytestride=None, kind=CharArray}
\label{func:str.array}
The function \code{array} is, for most practical purposes, all a user needs
to know to construct a character array.
The first argument, \code{buffer}, may be any one of the following:
(1) \code{None} (default). The constructor will allocate a writeable memory
buffer which will be uninitialized. The user must assign valid data before
trying to read the contents or before writing the character array to a disk
file.
(2) a Python string containing binary data. For example:
\begin{verbatim}
>>> print str.array('abcdefg'*10, itemsize=10)
['abcdefgabc', 'defgabcdef', 'gabcdefgab', 'cdefgabcde', 'fgabcdefga',
'bcdefgabcd', 'efgabcdefg']
\end{verbatim}
(3) a Python file object for an open file. The data will be copied from
the file, starting at the current position of the read pointer.
(4) a character array. This results in a deep copy of the input character
array; any other arguments to \code{array()} will be silently ignored.
\begin{verbatim}
>>> print str.array(s)
['abcdefgabc', 'defgabcdef', 'gabcdefgab', 'cdefgabcde', 'fgabcdefga',
'bcdefgabcd', 'efgabcdefg']
\end{verbatim}
(5) a nested sequence of strings. The sequence nesting implies the
shape of the string array unless shape is specified.
\begin{verbatim}
>>> print str.array([['Smith', 'Johnson'], ['Williams', 'Miller']])
[['Smith', 'Johnson'],
['Williams', 'Miller']]
\end{verbatim}
\code{itemsize} can be used to increase or decrease the fixed size of an
array element relative to the natural itemsize implied by any literal data
specified by the \code{buffer} parameter.
\begin{verbatim}
>>> print str.array([['Smith', 'Johnson'], ['Williams', 'Miller']],
itemsize=2)
[['Sm', 'Jo'],
['Wi', 'Mi']])
>>> print str.array([['Smith', 'Johnson'], ['Williams', 'Miller']],
itemsize=20)
[['Smith', 'Johnson'],
['Williams', 'Miller']]
\end{verbatim}
\code{shape} is the shape of the character array. It can be an integer, in
which case it is equivalent to the number of \var{rows} in a table. It can
also be a tuple implying the character array is an N-D array with fixed
length strings as its elements. \code{shape} should be consistent with
the number of elements implied by the data buffer and itemsize.
\code{byteoffset} indicates an offset, specified in bytes, from the start
of the array buffer to where the array data actually begins.
\code{byteoffset} enables the character array to be offset from the
beginning of a table record. This is mainly useful for implementing
record arrays.
\code{bytestride} indicates the separation, specified in bytes, between
successive elements in the last dimension of the character array.
\code{bytestride} is used in the implementation of record arrays to space
character array elements with the size of the total record rather than the
size of a single string.
\code{kind} is used to specify the class of the created array, and should be
\code{RawCharArray}, \code{CharArray}, or a subclass of either.
\end{funcdesc}
\begin{funcdesc}{num2char}{n, format, itemsize=32}
\label{func:str.num2char}
\code{num2char} formats the numarray \code{n} using the Python string format
\code{format} and stores the result in a character array with the specified
\code{itemsize}
\begin{verbatim}
>>> num2char(num.arange(0.0,5), '%2.2f')
CharArray(['0.00', '1.00', '2.00', '3.00', '4.00'])
\end{verbatim}
\end{funcdesc}
\section{Character array methods}
\label{sec:recarray-methods}
CharArray object has these public methods:
\begin{methoddesc}[RawCharArray]{tolist}{}
\code{tolist()} returns a nested list of strings corresponding to all the
elements in the array.
\end{methoddesc}
\begin{methoddesc}[RawCharArray]{copy}{}
\code{copy()} returns a deep copy of the character array.
\end{methoddesc}
\begin{methoddesc}[RawCharArray]{raw}{}
\code{raw()} returns the corresponding \code{RawCharArray} view.
\begin{verbatim}
>>> c=str.array(["this","that","another"])
>>> c.raw()
RawCharArray(['this ', 'that ', 'another'])
\end{verbatim}
\end{methoddesc}
\begin{methoddesc}{resized}{n, fill=' '}
\code{resized(n)} returns a copy of the array, resized so that each element
is of length \code{n} characters. Extra characters are filled with value
\code{fill}. Caution: do not confuse this method with \code{resize()} which
changes the number of elements rather than the size of each element.
\begin{verbatim}
>>> c = str.array(["this","that","another"])
>>> c.itemsize()
7
>>> d = c.resized(20)
>>> print d
['this', 'that', 'another']
>>> d.itemsize()
20
\end{verbatim}
\end{methoddesc}
\begin{methoddesc}[RawCharArray]{concatenate}{other}
\code{concatenate(other)} returns a new array which corresponds to the
element by element concatenation of \code{other} to \code{self}. The
addition operator is also overloaded to perform concatenation.
\begin{verbatim}
>>> print map(str, range(3)) + array(["this","that","another one"])
['0this', '1that', '2another one']
>>> print "prefix with trailing whitespace " + array(["."])
['prefix with trailing whitespace .']
\end{verbatim}
\end{methoddesc}
\begin{methoddesc}[RawCharArray]{sort}{}
\code{sort} modifies the \code{CharArray} inplace so that its elements are in
sorted order. \code{sort} only works for 1D character arrays. Like the
\code{sort()} for the Python list, \code{CharArray.sort()} returns nothing.
\begin{verbatim}
>>> a=str.array(["other","this","that","another"])
>>> a.sort()
>>> print a
['another', 'other', 'that', 'this']
\end{verbatim}
\end{methoddesc}
\begin{methoddesc}[RawCharArray]{argsort}{}
\code{argsort} returns a numarray corresponding to the permutation which will
put the character array \code{self} into sorted order. \code{argsort} only
works for 1D character arrays.
\begin{verbatim}
>>> a=str.array(["other","that","this","another"])
>>> a.argsort()
array([3, 0, 1, 2])
>>> print a[ a.argsort ]
['another', 'other', 'that', 'this']
\end{verbatim}
\end{methoddesc}
\begin{methoddesc}[RawCharArray]{amap}{f}
\code{amap} applies the function \code{f} to every element of \code{self} and
returns the nested list of the results. The function \code{f} should operate
on a single string and may return any Python value.
\end{methoddesc}
\begin{verbatim}
>>> c = str.array(['this','that','another'])
>>> print c.amap(lambda x: x[-2:])
['is', 'at', 'er']
\end{verbatim}
\begin{methoddesc}[RawCharArray]{match}{pattern, flags=0}
\code{match} uses Python regular expression matching over all elements of a
character array and returns a tuple of numarrays corresponding to the indices
of \code{self} where the pattern matches. \code{flags} are passed directly to
the Python pattern matcher defined in the \code{re} module of the standard
library.
\begin{verbatim}
>>> a=str.array([["wo","what"],["wen","erewh"]])
>>> print a.match("wh[aebd]")
(array([0]), array([1]))
>>> print a[ a.match("wh[aebd]") ]
['what']
\end{verbatim}
\end{methoddesc}
\begin{methoddesc}[RawCharArray]{search}{pattern,flags=0}
\code{search} uses Python regular expression searching over all elements of a
character array and returns a tuple of numarrays corresponding to the indices
of \code{self} where the pattern was found. \code{flags} are passed directly
to the Python pattern \code{search} method defined in the \code{re} module of
the standard library. \code{flags} should be an or'ed combination (use the
$\vert$ operator) of the following \code{re} variables: \code{IGNORECASE},
\code{LOCALE}, \code{MULTILINE}, \code{DOTALL}, \code{VERBOSE}. See the
\code{re} module documentation for more details.
\end{methoddesc}
\begin{methoddesc}[RawCharArray]{sub}{pattern,replacement,flags=0,count=0}
\code{sub} performs Python regular expression pattern substitution
to all elements of a character array. \code{flags} and \code{count} work
as they do for \code{re.sub()}.
\begin{verbatim}
>>> a=str.array([["who","what"],["when","where"]])
>>> print a.sub("wh", "ph")
[['pho', 'phat'],
['phen', 'phere']])
\end{verbatim}
\end{methoddesc}
\begin{methoddesc}[RawCharArray]{grep}{pattern, flags=0}
\code{grep} is intended to be used interactively to search a \code{CharArray}
for the array of strings which match the given \code{pattern}.
\code{pattern} should be a Python regular expression (see the \code{re}
module in the Python standard library, which can be as simple as a string
constant as shown below.
\begin{verbatim}
>>> a=str.array([["who","what"],["when","where"]])
>>> print a.grep("whe")
['when', 'where']
\end{verbatim}
\end{methoddesc}
\begin{methoddesc}[RawCharArray]{eval}{}
\code{eval} executes the Python eval function on each element of a character
array and returns the resulting numarray. \code{eval} is intended for use
converting character arrays to the corresponding numeric arrays. An
exception is raised if any string element fails to evaluate.
\begin{verbatim}
>>> print str.array([["1","2"],["3","4."]]).eval()
[[1., 2.],
[3., 4.]]
\end{verbatim}
\end{methoddesc}
\begin{methoddesc}[RawCharArray]{maxLen}{}
\code{maxLen} returns the minimum element length required to store the
stripped elements of the array \code{self}.
\begin{verbatim}
>>> print str.array(["this","there"], itemsize=20).maxLen()
5
\end{verbatim}
\end{methoddesc}
\begin{methoddesc}[RawCharArray]{truncated}{}
\code{truncated} returns an array corresponding to \code{self} resized
so that it uses a minimum amount of storage.
\begin{verbatim}
>>> a = str.array(["this ","that"])
>>> print a.itemsize()
6
>>> print a.truncated().itemsize()
4
\end{verbatim}
\end{methoddesc}
\begin{methoddesc}[RawCharArray]{count}{s}
\code{count} counts the occurences of string \code{s} in array \code{self}.
\begin{verbatim}
>>> print array(["this","that","another","this"]).count("this")
2
\end{verbatim}
\end{methoddesc}
\begin{methoddesc}[RawCharArray]{info}{}
This will display key attributes of the character array.
\end{methoddesc}
%% Local Variables:
%% mode: LaTeX
%% mode: auto-fill
%% fill-column: 79
%% indent-tabs-mode: nil
%% ispell-dictionary: "american"
%% reftex-fref-is-default: nil
%% TeX-auto-save: t
%% TeX-command-default: "pdfeLaTeX"
%% TeX-master: "numarray"
%% TeX-parse-self: t
%% End:
|