1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297
|
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2 Final//EN">
<HTML>
<HEAD>
<META NAME="GENERATOR" CONTENT="SGML-Tools 1.0.9">
<TITLE>NoSQL: Data formats</TITLE>
<LINK HREF="NoSQL-3.html" REL=next>
<LINK HREF="NoSQL-1.html" REL=previous>
<LINK HREF="NoSQL.html#toc2" REL=contents>
</HEAD>
<BODY BGCOLOR="#fff0e0">
<A HREF="NoSQL-3.html">Next</A>
<A HREF="NoSQL-1.html">Previous</A>
<A HREF="NoSQL.html#toc2">Contents</A>
<HR>
<H2><A NAME="s2">2. Data formats</A> </H2>
<H2><A NAME="sec-dataformats"></A> <A NAME="ss2.1">2.1 NoSQL table (relation) structure.</A>
</H2>
<P>A table (or <EM>relation</EM>) is an ordinary ASCII file, with
some additional rules that make it possible to use it as a database
table. The file has records (rows) and fields (columns).
The relation, or table structure is achieved by separating
the columns with ASCII TAB characters, and terminating the
rows with ASCII NEWLINE characters. That is, each row of data
in a file contains the data values (a data field) separated
by TAB characters and terminated with a NEWLINE character.
Therefore a fundamental rule is that data values must NOT
contain TAB characters.
<P>The first section of the file, called the header, contains the
file structure information used by the operators. The rest of the
file, called the body, contains the actual data values. A file of
data, so structured, is said to be a 'table'.
<P>The header consists of exactly two lines that contain the structure
information: the column name row and the <EM>dashline</EM>.
The fields in the column name row contain the names of each column,
and are separated from each other by a single TAB character.
The dashline is a set of dashed lines, one set for each column,
separated by single TAB characters. The dashline signals the start
of the actual data rows and its sole purpose is to make the
header visually easy to find.
<P>The column names are case sensitive, i.e. 'COUNT' is different
from 'Count'. The guideline for characters that may be used in
column names is that alphabetic, numeric, and the underscore (_)
are good choices. Numeric-only column names are not allowed.
No rows, except the dashline, should contain only dashes and TABs.
<P>The TAB character must
never be used in column names, nor should spaces or
UNIX I/O redirection characters (<,>,|) be used.
To be on the safe side, column names should always start with a
letter and contain only upper and lower case letters, numbers and
the underscore (_).
The following names are reserved to the awk programming
language, and should not be used to indicate column names:
<P><EM>BEGIN, END, break, continue, else, exit, exp, for, getline, gsub,
if, in, index, int, length, log, next, print, printf, split, sprintf,
sqrt, sub, substr, while</EM>, and possibly others. Refer to
the mawk(1) man page. Furthermore, the '_nosql_'
prefix is reserved for NoSQL internal use, and should never be used
at the beginning of column names.
<P>For instance, suppose you have a
table that maps names to nicknames, then its two columns could be
called <EM>Name</EM> and <EM>Nickname</EM>. Some NoSQL operators
create new columns that have the same name as pre-existing
table columns, with lower-case letters prepended to them.
This is why you <EM>really</EM> should stick to these rules.
<P>Not abiding by these naming rules may still work, but there
may be unexpected results.
<P>A sample table (named SAMPLE) that will be used in later
examples is shown in Table 1. The picture in Table 1 is for
illustrative purposes; what the file would actually look like
is shown in Table 2, where a TAB character is represented by
'<T>' and a NEWLINE character is represented by '<N>'.
<P>
<BLOCKQUOTE><CODE>
<PRE>
Table 1
table (SAMPLE)
NAME COUNT TYP AMT
---- ----- --- ---
Bush 44 A 133
Hansen 44 A 23
Jones 77 X 77
Perry 77 B 244
Hart 77 D 1111
Holmes 65 D 1111
Table 2
table (SAMPLE) actual content
NAME<T>COUNT<T>TYP<T>AMT<N>
----<T>-----<T>---<T>---<N>
Bush<T>44<T>A<T>133<N>
Hansen<T>44<T>A<T>23<N>
Jones<T>77<T>X<T>77<N>
Perry<T>77<T>B<T>244<N>
Hart<T>77<T>D<T>1111<N>
Holmes<T>65<T>D<T>1111<N>
</PRE>
</CODE></BLOCKQUOTE>
<P>It is important to note that only actual data is stored in the
data fields, with no leading or trailing space characters. This
fact can (and usually does) have a major effect on the size of
the resulting datafiles (tables) compared to data stored in
"fixed field width" systems. The datafiles in NoSQL are almost
always smaller, sometimes dramatically smaller.
<P>A table can also be represented in a different format, called
<EM>'list format'</EM>. The <EM>list</EM> format of the above
SAMPLE table is:
<P>
<BLOCKQUOTE><CODE>
<PRE>
NAME Bush
COUNT 44
TYP A
AMT 133
NAME Hansen
COUNT 44
TYP A
AMT 23
NAME Jones
COUNT 77
TYP X
AMT 77
NAME Perry
COUNT 77
TYP B
AMT 244
NAME Hart
COUNT 77
TYP D
AMT 1111
NAME Holmes
COUNT 65
TYP D
AMT 1111
</PRE>
</CODE></BLOCKQUOTE>
<P>The actual contents of a table in 'list' format, showing newlines
and TABs is:
<P>
<BLOCKQUOTE><CODE>
<PRE>
<N>
NAME<T>Bush<N>
COUNT<T>44<N>
TYP<T>A<N>
AMT<T>133<N>
<N>
NAME<T>Hansen<N>
COUNT<T>44<N>
TYP<T>A<N>
AMT<T>23<N>
<N>
NAME<T>Jones<N>
COUNT<T>77<N>
TYP<T>X<N>
AMT<T>77<N>
<N>
NAME<T>Perry<N>
COUNT<T>77<N>
TYP<T>B<N>
AMT<T>244<N>
<N>
NAME<T>Hart<N>
COUNT<T>77<N>
TYP<T>D<N>
AMT<T>1111<N>
<N>
NAME<T>Holmes<N>
COUNT<T>65<N>
TYP<T>D<N>
AMT<T>1111<N>
<N>
</PRE>
</CODE></BLOCKQUOTE>
<P>Long lines, i.e. lines that are too long to fit in the width of
the screen, may be folded over multiple rows in the 'list'
format, provided that each continuation row starts with one or
more spaces (blanks, not TABs). Field (column) names need to
be separated by the associated data by exactly one
TAB characters. The data part may contain physical TABs and
newlines, which will be turned into '\t' and '\n'
escapes respectively by the 'listtotable' operator when the list is
turned into a table.
<P>
<BLOCKQUOTE><CODE>
<PRE>
COMMENTS This is a very looong comment, that I want to fold over
multiple lines.
</PRE>
</CODE></BLOCKQUOTE>
<P>and the actual content is :
<P>
<BLOCKQUOTE><CODE>
<PRE>
<N>
COMMENTS<T>This is a very looong comment, that I want to fold over<N>
<T>multiple lines.<N>
<N>
</PRE>
</CODE></BLOCKQUOTE>
<P>As we will see, there are NoSQL operators that convert back and
forth between 'table' and 'list' formats.
<P>It is suggested, though not required, that table file names be
given the filename extension '.rdb', to make them recognizable
right away.
<H2><A NAME="sec-datatypes"></A> <A NAME="ss2.2">2.2 NoSQL and Data-Types.</A>
</H2>
<P>Unlike most other database systems, NoSQL knows nothing
about data-types. Everything is just a string, that occurs
between one TAB character and the next one. This was done on purpose,
of course, as NoSQL tables can be accessed in a number of ways,
even directly with a text editor. NoSQL has no way of enforcing
any data-typing that we may possibly establish, so why bother
about types at all. This model goes well with the
plethora of text utilities that come with most Unices, and with Linux
in particular, and is a very natural way of representing
data, more on the human-level than other conventions. The drawback
is that it is up to the application to enforce datatypes if necessary.
<P>As I have already pointed-out, NoSQL should be seen just as
a simple <EM>data-dictionary toolkit</EM>.
Its main purpose is to attach names to slices of an
otherwise flat-data file. Having a dictionary means
that you can reference individual pieces of data by name
rather than by their physical position in the file,
thus attaining a basic level of information abstraction.
<P>A table column can contain anything except physical tabs and
newlines. The data itself can be anything that is considered
to be text according to the local character-set (mine is iso-8859-1,
or Latin1). A field can even contain an entire text-encoded
file (a <EM>BLOB</EM>). Common encodings are <EM>uuencode</EM>,
<EM>base64</EM> and <EM>quoted-printable</EM>.
Large fields may of course break AWK or the other utilities,
but that must be seen as a limitation in those programs or
in the operating system, not something pertaining to the paradigm.
<P>A valid NoSQL table needs always to contain the header. Keeping
the latter in a separate file is possible but strongly deprecated.
<P>Table editing/writing/locking/unlocking/versioning should not be
seen as core NoSQL features, but simply add-on facilities.
In real applications the locking policy may become quite
complicated and should be provided by the
application program itself, according to its needs. The same
is true for modifying/versioning a table and ensuring
overall database consistency.
<P>The structure of a NoSQL table is record-oriented, so that it
can easily be acted upon with the wealth of existing Unix
utilities, which are mostly record-oriented. This does not mean
that a table cannot map a more complicated structure, like an XML
document or any other hierarchical tree-like structure.
Such <EM>higher-order</EM> dictionaries will not pertain to the
paradigm though, but rather to the application that uses the table.
<H2><A NAME="ss2.3">2.3 Notes on similar database packages.</A>
</H2>
<P>Besides NoSQL and RDB there are other UNIX DBMS's, both commercial
and free, that are based on ASCII tables. A commercial
implementation is /rdb, by
<A HREF="http://www.rsw.com">Revolutionary Software</A>,
while among the free ones there are
<A HREF="http://cfa-www.harvard.edu/~john/starbase/starbase.html">Starbase</A>, developed at the Harvard
Smithsonian Astrophysical Observatory, and Gunnar Stefansson's
<EM>reldb</EM>, a collection of interesting tools
available at sites that
carry archives of the <EM>comp.sources.unix</EM> Usenet newsgroup.
<P>The ASCII table format of those database engines is very
close to that of NoSQL, therefore data can easily be converted
back and forth between them and NoSQL.
<HR>
<A HREF="NoSQL-3.html">Next</A>
<A HREF="NoSQL-1.html">Previous</A>
<A HREF="NoSQL.html#toc2">Contents</A>
</BODY>
</HTML>
|