File: NoSQL-5.html

package info (click to toggle)
nosql 3.1-4
links: PTS
area: main
in suites: woody
size: 1,448 kB
ctags: 267
sloc: cpp: 1,028; ansic: 915; awk: 732; perl: 502; tcl: 292; sh: 289; makefile: 44
file content (233 lines) | stat: -rw-r--r-- 9,080 bytes
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2 Final//EN">
<HTML>
<HEAD>
 <META NAME="GENERATOR" CONTENT="SGML-Tools 1.0.9">
 <TITLE>NoSQL: Generating or modifing NoSQL tables</TITLE>
 <LINK HREF="NoSQL-6.html" REL=next>
 <LINK HREF="NoSQL-4.html" REL=previous>
 <LINK HREF="NoSQL.html#toc5" REL=contents>
</HEAD>
<BODY BGCOLOR="#fff0e0">
<A HREF="NoSQL-6.html">Next</A>
<A HREF="NoSQL-4.html">Previous</A>
<A HREF="NoSQL.html#toc5">Contents</A>
<HR>
<H2><A NAME="s5">5. Generating or modifing NoSQL tables</A>  </H2>

<H2><A NAME="ss5.1">5.1 Generating new tables</A>
    </H2>

<P>Any editor may be used to construct or modify a table,
since it is a regular UNIX file, and this 'direct editing'
method is occasionally used, especially for small amounts
of data.  However, avoid using an editor that destroys
TAB characters.
<P>To generate a new table the best plan (and usually the
safest one) is to first generate a template file, then
convert it to table format and add the rows of data.
Any convenient editor may be used to generate a template
file.  To convert it to a table the command 'maketable'
may be used, which will produce an empty table.
Next use the operator 'edittable' to enter rows of data.
<P>A typical template file is shown below:
<P>
<BLOCKQUOTE><CODE>
<PRE>

      # These are lines of table documentation. They can be of any length,
      # and any number of such lines may exist.
      # Each line must start correctly, e.g with "#", surrounded by any
      # number of spaces and/or TABs.

      Name      Name of item
      Type      Type: 1,2,3,7,9,A,B,X
      Count     Number of items
      K         Constant modifier
      SS7       Special status for type 7
      Size      In Kilobytes

    
</PRE>
</CODE></BLOCKQUOTE>
<P>The above template file
contains the necessary elements to describe a table of six
columns: table documentation (the comment lines that
each start with a sharp sign '#'), column name
("Name", "Type", "Count", ...), and column documentation for
each column (the text at the end of each  column line).
<P>To build the final table header, use the command:
<P>
<BLOCKQUOTE><CODE>
<PRE>
      maketable &lt; table.tpl
    
</PRE>
</CODE></BLOCKQUOTE>
<P>where <CODE>table.tpl</CODE> is the template file described above.
The command will produce the correct table header to STDOUT,
(that can be redirected to a file as usual) :
<P>
<BLOCKQUOTE><CODE>
<PRE>
                Name    Type    Count   K   SS7 Size
                ----    ----    -----   -   --- ----
    
</PRE>
</CODE></BLOCKQUOTE>
<H2><A NAME="ss5.2">5.2 Modifying existing tables</A>
    </H2>

<P>I will now explain how NoSQL tables can be modified, both manually and
with automated programs.
<P>Basically there are two ways to manually modify an existing
table: either direct editing with a TAB-conscious text editor, like
<EM>vi(1)</EM>, or by using the 'edittable' command. The latter
method is recommended, especially in a multi-user environment
where multiple concurrent editing sessions may occur against the
same table file.
<P>The 'edittable' utility can be used to add new rows, change
or delete existing rows of data in a table. The optional '--list'
switch tells 'edittable' to convert the table to 'list' format
for editing, which makes working with wide tables
much more comfortable. The editor called by 'edittable' defaults
to <EM>vi</EM>, or to whatever
is set in the environment variable EDITOR.
<P>After editing, the table is automatically checked for validity
by 'edittable', using 'istable'. If structure errors are
detected,
the program prompts you for what to do (re-edit, quit, etc.).
<H2><A NAME="sec-bigtables"></A> <A NAME="ss5.3">5.3 Big tables </A>
    </H2>

<P>One thing to point out here is that the only way to update
a table, both manually and with a program, is to capture the
SDTOUT stream of an operator and write the results back to a
table file, paying attention to not overwriting the original input
file while it is being read (familiarity with the Unix shell is
expected from the user). For instance, you have better not do:
<P>
<BLOCKQUOTE><CODE>
<PRE>
      operator &lt; table > table
    
</PRE>
</CODE></BLOCKQUOTE>
<P>or you will end up with 'table' being zero'ed! The correct procedure
is:
<P>
<BLOCKQUOTE><CODE>
<PRE>
      operator &lt; table > table.tmp
      mv table.tmp table
    
</PRE>
</CODE></BLOCKQUOTE>
<P>At the moment NoSQL does not provide a record-level update facility.
This may be inefficient if you need to
do frequent changes to a large table, for instance
with a CGI program behind a Web server, as the whole table needs to
be rewitten upon each change.
But don't be disappointed by this. There are ways to circumvent
this apparently major limitation of the NoSQL paradigm.
<P>The obvious "trick" is to try and keep your tables small. Silly
a suggestion as it may sound, before laughing at it please
have a look at section 2.9 of the included
<A HREF="4gl.ps">4gl.ps</A> or
<A HREF="4gl.txt">4gl.txt</A>
paper. Remember that NoSQL works <EM>with</EM> UNIX, not
<EM>in addition</EM> to it. The underlying UNIX file system,
that most commercial databases tend to disregard, if
used creatively can provide an extremely powerful way of
pre-organizing your tables (relations) efficiently and keep
them small (where small means roughly within a few hundred
kilobytes in size).
<P>If you really <EM>must</EM> do frequent modifications to a big indexed
table, then you can still do it efficiently by applying your
changes to a separate file rather than to the actual table, and
merging the changes back into the big table (as well as re-build
its indices) only every now and again, with a batch job that
can be run in the background, overnight or when the system activity
is low. The following example will try to explain this better.
<P>Suppose we have the large indexed table <CODE>bigtable.rdb</CODE> and
we need to insert/change/delete one or more records. What we can
do is:
<P>
<UL>
<LI>Create a <EM>table journaling buffer</EM>, say
<CODE>bigtable.tjb</CODE>, with exactly the same header as
<CODE>bigtable</CODE>, but containing only those records that we want
to insert/update into or remove from <CODE>bigtable.rdb</CODE>.
The entries in <CODE>bigtable.tjb</CODE> will have to be in a format
suitable for the 'update' operator.</LI>
<LI>Whenever we fetch data from <CODE>bigtable.rdb</CODE> we will have
to do it in three logical steps. The first step is to
use 'search' on
<CODE>bigtable</CODE> to take advantage of the indices. This will
produce an intermediate output that will then have to be merged
into <CODE>bigtable.tjb</CODE>, and the final output will undergo the
original query statement again.</LI>
<LI>Any updates to <CODE>bigtable.rdb</CODE> will rather be done
to <CODE>bigtable.tjb</CODE>, with the sintax described in the
documentation of the 'update' operator. Unlike <CODE>bigtable.rdb</CODE>,
where records must be kept sorted on the primary key field (i.e.
on the leftmost table column), the records in
<CODE>bigtable.rdb</CODE> do not need to be sorted in any particular
order.</LI>
</UL>
<P>This may seem complicated, but it isn't much, really.
Say <CODE>bigtable.rdb</CODE>
contains two columns, SSN (for Social Security No.) and SURNAME, where
the unique
table key is on SSN, and we have built a secondary index on the SURNAME
field. If we want to extract all the rows which SURNAME field is
equal to "Smith", the query necessary to take advantage of a fast
search method on <CODE>bigtable.rdb</CODE> and to account for the
presence of <CODE>bigtable.tjb</CODE> is:
<P>
<BLOCKQUOTE><CODE>
<PRE>
      printf 'SURNAME\n-------\nSmith\n' |
        search --index bigtable.rdb.x.SURNAME | update bigtable.tjb |
        row 'SURNAME=="Smith"'
    
</PRE>
</CODE></BLOCKQUOTE>
<P>As you can see, the trick is:
<P>
<UL>
<LI>Perform an indexed 'search' on <CODE>bigtable.rdb</CODE> to quickly
obtain a much smaller (possibly empty) subset of its data.</LI>
<LI>Merge the first output with <CODE>bigtable.tjb</CODE>
<EM>on-the-fly</EM> during the query.</LI>
<LI>Do a sequential post-query on the final output.</LI>
</UL>
<P>As shown, these logical steps can be performed in one single
line of UNIX shell code!
<H2><A NAME="ss5.4">5.4 Concatenating tables</A>
    </H2>

<P>The need to concatenate tables comes up every
so often and although it is simple to do it may not
be obvious.  The UNIX 'cat' command can not be used
as it would result in duplicating the header and
thus make an invalid table.  And of course, only
tables with the same header, i.e. tables that are said
to be ``union-compatible'', should be concatenated,
otherwise an invalid table would result (in
this case it could be a gross inconsistency if the number
of columns were different). If we have two tables,
TABA and TABB, then to concatenate TABB onto the end of
TABA we use the command:
<P>
<BLOCKQUOTE><CODE>
<PRE>
      union TABA TABB
    
</PRE>
</CODE></BLOCKQUOTE>
<HR>
<A HREF="NoSQL-6.html">Next</A>
<A HREF="NoSQL-4.html">Previous</A>
<A HREF="NoSQL.html#toc5">Contents</A>
</BODY>
</HTML>