
|
LOOKUP(1) LOOKUP(1)
April 22nd, 1994
NNAAMMEE
lookup - interactive file search and display
SSYYNNOOPPSSIISS
llooookkuupp [ args ] [ _f_i_l_e _._._. ]
DDEESSCCRRIIPPTTIIOONN
_L_o_o_k_u_p allows the quick interactive search of text files. It
supports ASCII, JIS-ROMAN, and Japanese EUC Packed formated
text, and has an integrated romaji/c_akana converter.
TTHHIISS MMAANNUUAALL
_L_o_o_k_u_p is flexible for a variety of applications. This manual
will, however, focus on the application of searching Jim
Breen's _e_d_i_c_t (Japanese-English dictionary) and _k_a_n_j_i_d_i_c
(kanji database). Being familiar with the content and format
of these files would be helpful. See the INFO section near the
end of this manual for information on how to obtain these
files and their documentation.
OOVVEERRVVIIEEWW OOFF MMAAJJOORR FFEEAATTUURREESS
The following just mentions some major features to whet your
appetite to actually read the whole manual (-:
Romaji-to-Kana Converter
_L_o_o_k_u_p can convert romaji to kana for you, even,i`Eon the
fly,i'Eas you type.
Fuzzy Searching
Searches can be a bit,i`Evague,i'Eor,i`Efuzzy,i'E, so that you'll
be able to find,i`EoA`i,upb,i'Eeven if you try to search
for,i`Eox`Eox-ox,c,i'E(the proper yomikata being,i`Eox`Eox|ox-ox,cox|,i'E).
Regular Expressions
Uses the powerful and expressive _r_e_g_u_l_a_r _e_x_p_r_e_s_s_i_o_n for
searching. One can easily specify complex searches that
affect,i`EI want lines that look like such-and-such, but not
like this-and-that, but that also have this particular
characteristic....,i'E
Wildcard ``Glob'' Patterns
Optionally, can use well-known filename wildcard patterns
instead of full-fledged regular expressions.
Filters
You can have _l_o_o_k_u_p not list certain lines that would oth-
erwise match your search, yet can optionally save them for
quick review. For example, you could have all name-only
entries from _e_d_i_c_t filtered from normal output.
1
LOOKUP(1) LOOKUP(1)
Automatic Modifications
Similarly, you can do a standard search-and-replace on
lines just before they print, perhaps to remove information
you don't care to see on most searches. For example, if
you're generally not interested in _k_a_n_j_i_d_i_c's info on Chi-
nese readings, you can have them removed from lines before
printing.
Smart Word-Preference Mode
You can have _l_o_o_k_u_p list only entries with _w_h_o_l_e _w_o_r_d_s that
match your search (as opposed to an _e_m_b_e_d_d_e_d match, such as
finding,i`Ethe,i'Einside,i`Ethem,i'E), but if no whole-word matches
exist, will go ahead and list any entry that matches the
search.
Handy Features
Other handy features include a dynamically settable and
parameterized prompt, automatic highlighting of that part
of the line that matches your search, an output pager,
readline-like input with horizontal scrolling for long
input lines, a,i`E.lookup,i'Estartup file, automated programa-
bility, and much more. Read on!
RREEGGUULLAARR EEXXPPRREESSSSIIOONNSS
_L_o_o_k_u_p makes liberal use of _r_e_g_u_l_a_r _e_x_p_r_e_s_s_i_o_n_s (or _r_e_g_e_x for
short) in controlling various aspects of the searches. If you
are not familiar with the important concepts of regexes, read
the tutorial appendix of this manual before continuing.
JJAAPPAANNEESSEE CCHHAARRAACCTTEERR EENNCCOODDIINNGG MMEETTHHOODDSS
Internally, _l_o_o_k_u_p works with Japanese packed-format EUC, and
all files loaded must be encoded similarly. If you have files
encoded in JIS or Shift-JIS, you must first convert them to
EUC before loading (see the INFO section for programs that can
do this).
Interactive input and output encoding, however, may be be
selected via the -jis, -sjis, and -euc invocation flags
(default is -euc), or by various commands to the program
(described later).
Make sure to use the encoding appropriate for your system. If
you're using kterm under the X Window System, you can use
_l_o_o_k_u_p's -jis flag to match kterm's default JIS encoding. Or,
you might use kterm's,i`E-km euc,i'Estartup option (or menu selec-
tion) to put kterm into EUC mode. Also, I have found kterm's
scrollbar (,i`E-sb -sl 500,i'E) to be quite useful.
With many,i`EEnglish,i'Efonts in Japan, the character that nor-
mally prints as a backslash (halfwidth version of ,i`A) in The
States appears as a yen symbol (the half-width version of ,i"i).
How it will appear on your system is a function of what font
you use and what output encoding method you choose, which may
be different from the font and method that was used to print
2
LOOKUP(1) LOOKUP(1)
this manual (both of which may be different from what's
printed on your keyboard's appropriate key). Make sure to
keep this in mind while reading.
SSTTAARRTTUUPP
Let's assume that your copy of _e_d_i_c_t is in ~/lib/edict. You
can start the program simply with
lookup ~/lib/edict
You'll note that _l_o_o_k_u_p spends some time building an index
before the default,i`Elookup> ,i'Eprompt appears.
_L_o_o_k_u_p gains much of its search speed by constructing an index
of the file(s) to be searched. Since building the index can be
time consuming itself, you can have _l_o_o_k_u_p write the built
index to a file that can be quickly loaded the next time you
run the program. Index files will be given a,i`E.jin,i'E(Jef-
frey's Index) ending.
Let's build the indices for _e_d_i_c_t and _k_a_n_j_i_d_i_c now:
lookup -write ~/lib/edict ~/lib/kanjidic
This will create the index files
~/lib/edict.jin
~/lib/kanjidic.jin
and exit.
You can now re-start _l_o_o_k_u_p _, automatically using the pre-com-
puted index files as:
lookup ~/lib/edict ~/lib/kanjidic
You should then be presented with the prompt without having to
wait for the index to be constructed (but see the section on
Operating System concerns for possible reasons of delay).
IINNPPUUTT
There are basically two types of input: searches and commands.
Commands do such things as tell _l_o_o_k_u_p to load more files or
set flags. Searches report lines of a file that match some
search specifier (where lines to search for are specified by
one or more regular expressions).
The input syntax may perhaps at first seem odd, but has been
designed to be powerful and concise. A bit of time invested to
learn it well will pay off greatly when you need it.
BBRRIIEEFF EEXXAAMMPPLLEE
Assuming you've started _l_o_o_k_u_p with _e_d_i_c_t and _k_a_n_j_i_d_i_c as
noted above, let's try a few searches. In these examples, the
,i`Esearch [edict]> ,i'E
3
LOOKUP(1) LOOKUP(1)
is the prompt. Note that the space after the,iAE>,i,Cis part of
the prompt.
Given the input:
search [edict]> tranquil
_l_o_o_k_u_p will report all lines with the string,i`Etranquil,i'Ein
them. There are currently about a dozen such lines, two of
which look like:
o^Aox'eox<< [ox"aox1ox'eox<<] /peaceful (an)/tranquil/calm/restful/
o^Aox'eox(R) [ox"aox1ox'eox(R)] /peace/tranquility/
Notice that lines with,i`Etranquil,i'E_a_n_d,i`Etranquility,i'Ematched?
This is because,i`Etranquil,i'Ewas embedded in the word,i`Etranquil-
ity,i'E. You could restrict the search to only the _w_o_r_d,i`Etran-
quil,i'Eby prepending the special,i`Estart of word,i'Esym-
bol,iAE<,i,Cand appending the special,i`Eend of word,i'Esym-
bol,iAE>,i,Cto the regex, as in:
search [edict]> <tranquil>
This is the regular expression that says,i`Ethe beginning of a
word, followed by a,iAEt,i,C,,iAEr,i,C, ...,,iAEl,i,C, which is at the
end of a word.,i'EThe current version of _e_d_i_c_t has just three
matching entries.
Let's try another:
search [edict]> fukushima
This is a search for the,i`EEnglish,i'Efukushima -- ways to search
for kana or kanji will be explored later. Note that among the
several lines selected and printed are:
_
'E^uoA,c [ox~Oox ox.oxIb] /Fukus_hima (pn,pl)/
`I'U'A3/4^E,ioA,c [ox-ox1/2ox~Oox ox.oxIb] /Kisofukushima (pl)/
By default, searches are done in a case-insensitive manner
--,iAEF,i,Cand,iAEf,i,Care treated the same by _l_o_o_k_u_p, at least so
far as the matching goes. This is called _c_a_s_e _f_o_l_d_i_n_g.
Let's give a command to turn this option off, so
that,iAEf,i,Cand,iAEF,i,Cwon't be considered the same. Here's an
odd point about _l_o_o_k_u_p_'_s input syntax: the default setting is
that all command lines must begin with a space. The space is
the (default) command-introduction character and tells the
input parser to expect a command rather than a search regular
expression. _I_t _i_s _a _c_o_m_m_o_n _m_i_s_t_a_k_e _a_t _f_i_r_s_t _t_o _f_o_r_g_e_t _t_h_e
_l_e_a_d_i_n_g _s_p_a_c_e _w_h_e_n issuing a command. Be careful.
Try the command,i`E fold,i'Eto report the current status of case-
folding. Notice that as soon as you type the space, the
4
LOOKUP(1) LOOKUP(1)
prompt changes to
,i`Elookup command> ,i'E
as a reminder that now you're typing a command rather than a
search specification.
lookup command> fold
The reply should be,i`Efile #0's case folding is on,i'E
You can actually turn it off with,i`E fold off,i'E. Now try the
search for,i`Efukushima,i'Eagain. Notice that this time the
entries with,i`EFukushima,i'Earen't listed? Now try the search
string,i`EFukushima,i'Eand see that the entries
with,i`Efukushima,i'Earen't listed.
Case folding is usually very convenient (it also makes corre-
sponding katakana and hiragana match the same), so don't for-
get to turn it back on:
lookup command> fold on
JJAAPPAANNEESSEE IINNPPUUTT
_L_o_o_k_u_p has an automatic romaji/c_akana converter. A lead-
ing,iAE/,i,Cindicates that romaji is to follow. Try typ-
ing,i`E/tokyo,i'Eand you'll see it convert to,i`E/ox`Eox-ox,c,i'Eas you
type. When you hit return, _l_o_o_k_u_p will list all lines that
have a,i`Eox`Eox-ox,c,i'Esomewhere in them. Well, sort of. Look care-
fully at the lines which match. Among them (if you had case
folding back on) you'll see:
=Y-=Y^e=Y1=Y`E9|,u [=Y-=Y^e=Y1=Y`Eox-ox,cox|] /Christianity/
oA`i,upb [ox`Eox|ox-ox,cox|] /Toukyou (pl)/Tokyo/current capital of Japan/
AE`I9|`A [ox`Eox~Aox-ox,cox|] /convex lens/
The first one has,i`Eox`Eox-ox,c,i'Ein it (as,i`E=Y`Eox-ox,c,i'E, where the
katakana,i`E=Y`E,i'Ematches in a case-insensitive manner to the
hiragana,i`Eox`E,i'E), but you might consider the others unexpected,
since they don't have,i`Eox`Eox-ox,c,i'Ein them. They're close
(,i`Eox`Eox|ox-ox,c,i'Eand,i`Eox`Eox~Aox-ox,c,i'E), but not exact. This is the
result of _l_o_o_k_u_p's,i`Efuzzification,i'E. Try the
command,i`E fuzz,i'E(again, don't forget the command-introduction
space). You'll see that fuzzification is turned on. Turn it
off with,i`E fuzz off,i'Eand try,i`E/tokyo,i'E(which will convert as
you type) again. This time you only get the lines which
have,i`Eox`Eox-ox,c,i'Eexactly (well, case folding is still on, so it
might match katakana as well).
In a fuzzy search, length of vowels is ignored --,i`Eox`E,i'Eis con-
sidered the same as,i`Eox`Eox|,i'E, for example. Also, the presence
or absence of any,i`Eox~A,i'Echaracter is ignored, and the pairs ox,
ox^A, ox_o oxoA, ox" ox~n, and ox_a ox`o are considered identical in a
fuzzy search.
5
LOOKUP(1) LOOKUP(1)
It might be convenient to consider a fuzzy search to be
a,i`Epronunciation search,i'E. Special note: fuzzification will
not be performed if a regular expres-
sion,i`E*,i'E,,i`E+,i'E,or,i`E?,i'Emodifies a non-ASCII character. This is
not an issue when input patterns are filename-like wildcard
patterns (discussed below).
In addition to kana fuzziness, there's one special case for
kanji when fuzziness is on. The kanji repeater mark,i`E,i1,i'Ewill
be recognized such that,i`E>>pb,i1,i'Eand,i`E>>pb>>pb,i'Ewill match each-
other.
Turn fuzzification back on (,i`Efuzz on,i'E), and search for all
_w_h_o_l_e _w_o_r_d_s which sound like,i`Etokyo,i'E. That search would be
specified as:
search [edict]> /<tokyo>
(again, the,i`Etokyo,i'Ewill be converted to,i`Eox`Eox-ox,c,i'Eas you
type). My copy of _e_d_i_c_t has the three lines
oA`i,upb [ox`Eox|ox-ox,cox|] /Toukyou (pl)/Tokyo/current capital of Japan/
AE~A,u"o [ox`Eox~Aox-ox,c] /special permission/patent/
AE`I9|`A [ox`Eox~Aox-ox,cox|] /convex lens/
This kind of whole-word romaji-to-kana search is so common,
there's a special short cut. Instead of typing,i`E/<tokyo>,i'E,
you can type,i`E[tokyo],i'E. The leading,iAE[,i,Cmeans,i`Estart
romaji,i'E_a_n_d,i`Estart of word,i'E. Were you to
type,i`E<tokyo>,i'Einstead (without a leading,iAE/,i,Cor,iAE[,i,Cto
indicate romaji-to-kana conversion), you would get all lines
with the _E_n_g_l_i_s_h whole-word,i`Etokyo,i'Ein them. That would be a
reasonable request as well, but not what we want at the
moment.
Besides the kana conversion, you can use any cut-and-paste
that your windowing system might provide to get Japanese text
onto the search line. Cut,i`Eox`Eox-ox,c,i'Efrom somewhere and paste
onto the search line. When hitting enter to run the search,
you'll notice that it is done without fuzzification (even if
the fuzzification flag was,i`Eon,i'E). That's because there's no
leading,iAE/,i,C. Not only does a leading,iAE/,i,Cndicate that you
want the romaji-to-kana conversion, but that you want it done
fuzzily.
So, if you'd like fuzzy cut-and-paste, just type a lead-
ing,iAE/,i,Cefore pasting (or go back and prepend one after past-
ing).
These examples have all been pretty simple, but you can use
all the power that regexes have to offer. As a slightly more
complex example, the search,i`E<gr[ea]y>,i'Ewould look for all
lines with the words,i`Egrey,i'Eor,i`Egray,i'Ein them. Since
6
LOOKUP(1) LOOKUP(1)
the,iAE[,i,Cisn't the first character of the line, it doesn't
mean what was mentioned above (start-of-word romaji). In this
case, it's just the regular-expression,i`Eclass,i'Eindicator.
If you feel more comfortable using filename-like,i`E*.txt,i'Ewild-
card patterns, you can use the,i`Ewildcard on,i'Ecommand to have
patterns be considered this way.
This has been a quick introduction to the basics of _l_o_o_k_u_p.
It can be very powerful and much more complex. Below is a
detailed description of its various parts and features.
RREEAADDLLIINNEE IINNPPUUTT
The actual keystrokes are read by a readline-ish package that
is pretty standard. In addition to just typing away, the fol-
lowing keystrokes are available:
^B / ^F move left/right one character on the line
^A / ^E move to the start/end of the line
^H / ^G delete one character to the left/right of the cursor
^U / ^K delete all characters to the left/right of the cursor
^P / ^N previous/next lines on the history list
^L or ^R redraw the line
^D delete char under the cursor, or EOF if line is empty
^space force romaji conversion (^@ on some systems)
If automatic romaji-to-kana conversion is turned on (as it is
by default), there are certain situations where the conversion
will be done, as we saw above. Lower-case romaji will be con-
verted to hiragana, while upper-case romaji to katakana. This
usually won't matter, though, as case folding will treat hira-
gana and katakana the same in the searches.
In exactly what situations the automatic conversion will be
done is intended to be rather intuitive once the basic idea is
learned. However, at _a_n_y _t_i_m_e, one can use control-space to
convert the ASCII to the left of the cursor to kana. This can
be particularly useful when needing to enter kana on a command
line (where auto conversion is never done; see below)
RROOMMAAJJII FFLLAAVVOORR
Most flavors of romaji are recognized. Special or non-obvious
items are mentioned below. Lowercase are converted to hira-
gana, uppercase to katakana.
Long vowels can be entered by repeating the vowel, or
with,iAE-,i,Cor,iAE^,i,C.
In situations where an,i`En,i'Ecould be vague, as in,i`Ena,i'Ebeing ox^E
or ox'oox/c, use a single quote to force ox'o. There-
fore,,i"Okenichi,ix/c_aox+-ox"Eox'A while,i"Oken'ichi,ix/c_aox+-ox'ooxoxox'A.
7
LOOKUP(1) LOOKUP(1)
The romaji has been richly extended with many non-standard
combinations such as ox~Oox,i or ox'AoxS, which are represented in
intuitive ways:,i"Ofa,ix/c_aox~Oox,i,,i"Oche,ix/c_aox'AoxS. etc.
Various other mappings of interest:
wo /c_aox`o we/c_aox~n wi/c_aox`'o
VA /c_a=Y^o=Y,i VI/c_a=Y^o=Y-L VU/c_a=Y^o VE/c_a=Y^o=YS VO/c_a=Y^o=Y(C)
di /c_aox^A dzi/c_aox^A dya/c_aox^Aox~a dyu/c_aox^Aoxoa dyo/c_aox^Aox,c
du /c_aoxoA tzu/c_aoxoA dzu/c_aoxoA
(the following kana are all smaller versions of the regular kana)
xa /c_aox,i xi/c_aox-L xu/c_aox=Y xe/c_aoxS xo/c_aox(C)
xu /c_aox=Y xtu/c_aox~A xwa/c_aox^i xka/c_a=Y~o xke/c_a=Y"o
xya/c_aox~a xyu/c_aoxoa xyo/c_aox,c
IINNPPUUTT SSYYNNTTAAXX
Any input line beginning with a space (or whichever character
is set as the command-introduction character) is processed as
a command to _l_o_o_k_u_p rather than a search spec. _A_u_t_o_m_a_t_i_c kana
conversion is never done on these lines (but _f_o_r_c_e_d conversion
with control-space may be done at any time).
Other lines are taken as search regular expressions, with the
following special cases:
? A line consisting of a single question mark will report the
current command-introduction character (the default is a
space, but can be changed with the,i`Ecmdchar,i'Ecommand).
= If a line begins with,iAE=,i,C, the line (without the,iAE=,i,C)
is taken as a search regular expression, and no automatic
(or internal -- see below) kana conversion is done anywhere
on the line (although again, conversion can always be
forced with control-space). This can be used to initiate a
search where the beginning of the regex is the command-
introduction character, or in certain situations where
automatic kana conversion is temporarily not desired.
/ A line beginning with,iAE/,i,Cindicates romaji input for the
whole line. If automatic kana conversion is turned on, the
conversion will be done in real-time, as the romaji is
typed. Otherwise it will be done internally once the line
is entered. _R_e_g_a_r_d_l_e_s_s, the presence of the lead-
ing,iAE/,i,Cindicates that any kana (either converted or cut-
and-pasted in) should be,i`Efuzzified,i'Eif fuzzification is
turned on.
As an addition to the above, if the line doesn't begin
with,iAE=,i,Cor the command-introduction character (and auto-
matic conversion is turned on),,iAE/,i,C _a_n_y_w_h_e_r_e on the line
initiates automatic conversion for the following word.
8
LOOKUP(1) LOOKUP(1)
[ A line beginning with,iAE[,i,Cis taken to be romaji (just as a
line beginning with,iAE/,i,C, and the converted romaji is sub-
ject to fuzzification (if turned on). However, if,iAE[,i,Cis
used rather than,iAE/,i,C, an implied,iAE<,i,C,i`Ebeginning of
word,i'Eis prepended to the resulting kana regex. Also, any
ending,iAE],i,Con such a line is converted to the,i`Eending of
word,i'Especifier,iAE>,i,Cin the resulting regex.
In addition to the above, lines may have certain prefixes and
suffixes to control aspects of the search or command:
! Various flags can be toggled for the duration of a particu-
lar search by prepending a,i`E!!,i'Esequence to the input line.
Sequences are shown below, along with commands related to
each:
!F! ,i"A Filtration is toggled for this line (filter)
!M! ,i"A Modification is toggled for this line (modify)
!w! ,i"A Word-preference mode is toggled for this line (word)
!c! ,i"A Case folding is toggled for this line (fold)
!f! ,i"A Fuzzification is toggled for this line (fuzz)
!W! ,i"A Wildcard-pattern mode is toggled for this line (wildcard)
!r! ,i"A Raw. Force fuzzification off for this line
!h! ,i"A Highlighting is toggled for this line (highlight)
!t! ,i"A Tagging is toggled for this line (tag)
!d! ,i"A Displaying is on for this line (display)
The letters can be combined, as in,i`E!cf!,i'E.
The final,iAE!,i,C can be omitted if the first character after
the sequence is not an ASCII letter.
If no letters are given (,i`E!!,i'E).,i`E!f!,i'Eis the default.
These last two points can be conveniently combined in the
common case of,i`E!/romaji,i'Ewhich would be the same
as,i`E!f!/romaji,i'E.
The special sequence,i`E!?,i'Elists the above, as well as indi-
cates which are currently turned on.
Note that the letters accepted in a,i`E!!,i'Esequence are many
of the indicators shown by the,i`Efiles,i'Ecommand.
+ A,iAE+,i,Cprepended to anything above will cause the final
search regex to be printed. This can be useful to see when
and what kind of fuzzification and/or internal kana conver-
sion is happening. Consider:
search [edict]> +/ox"iox<<ox"e
a match is,i`Eox"i[ox,iox/c,i1/4]*ox~A?ox<<[ox,iox/c,i1/4]*ox"e[ox=Yox|ox_aox(C),i1/4]*,i'E
Due to the,i`Eleading,i'E/ the kana is fuzzified, which
9
LOOKUP(1) LOOKUP(1)
explains the somewhat complex resulting regex. For compari-
son, note:
search [edict]> +ox"iox<<ox"e
a match is,i`Eox"iox<<ox"e,i'E
search [edict]> +!/ox"iox<<ox"e
a match is,i`Eox"iox<<ox"e,i'E
As the,iAE+,i,Cshows, these are not fuzzified. The first one
has no leading,iAE/,i,Cor,iAE[,i,Cto induce fuzzification, while
the second has the,iAE!,i,Cline prefix (which is the default
version of,i`E!f!,i'E), which toggles fuzzification mode
to,i`Eoff,i'Efor that line.
, The default of all searches and most commands is to work
with the first file loaded (_e_d_i_c_t in these examples). One
can change this default (see the,i`Eselect,i'Ecommand) or, by
appending a comma+digit sequence at the end of an input
line, force that line to work with another previously-
loaded file. An appended,i`E,1,i'Eworks with first extra file
loaded (in these examples, _k_a_n_j_i_d_i_c). An
appended,i`E,2,i'Eworks with the 2nd extra file loaded, etc.
An appended,i`E,0,i'Eworks with the original first file (and
can be useful if the default file has been changed via
the,i`Eselect,i'Ecommand).
The following sequence shows a common usage:
search [edict]> [ox`Eox-ox,cox`E]
oA`i,upboA^O [ox`Eox|ox-ox,cox|ox`E] /Tokyo Metropolitan area/
cutting and pasting the oA^O from above, and adding a,i`E,1,i'Eto
search _k_a_n_j_i_d_i_c:
search [edict]> oA^O,1
oA^O 4554 N4769 S11 ..... =Y`E =Y"A oxBox"aox3 {metropolis} {capital}
FFIILLEENNAAMMEE--LLIIKKEE WWIILLDDCCAARRDD MMAATTCCHHIINNGG
When wildcard-pattern mode is selected, patterns are consid-
ered as extended.Q "*.txt" "-like" patterns. This is often
more convenient for users not familiar with regular expres-
sions. To have this mode selected by default, put
default wildcard on
into your,i`E.lookup,i'Efile (see,i`ESTARTUP FILE,i'Ebelow).
When wildcard mode is on, only ,i`E*,i'E,,i`E?,i'E,,i`E+,i'E,and,i`E.,i'E,are
effected. See the entry for the ,i`Ewildcard,i'Ecommand below for
details.
10
LOOKUP(1) LOOKUP(1)
Other features, such as the multiple-pattern searches
(described below) and other regular-expression metacharacters
are available.
MMUULLTTIIPPLLEE--PPAATTTTEERRNN SSEEAARRCCHHEESS
You can put multiple patterns in a single search specifier.
For example consider
search [edict]> china||japan
The first part (,i`Echina,i'E) will select all lines that
have,i`Echina,i'Ein them. Then, _f_r_o_m _a_m_o_n_g _t_h_o_s_e _l_i_n_e_s, the second
part will select lines that have,i`Ejapan,i'Ein them. The,i`E||,i'Eis
not part of any pattern -- it is _l_o_o_k_u_p's,i`Epipe,i'Emechanism.
The above example is very different from the single pattern
,i`Echina|japan,i'Ewhich would select any line that had
either,i`Echina,i'E_o_r,i`Ejapan,i'E. With,i`Echina||japan,i'E, you get
lines that have,i`Echina,i'E_a_n_d _t_h_e_n _a_l_s_o have,i`Ejapan,i'Eas well.
Note that it is also different from the regular expres-
sion,i`Echina.*japan,i'E(or the wildcard pat-
tern,i`Echina*japan,i'E)which would select lines having,i`Echina,
then maybe some stuff, then japan,i'E. But consider the case
when,i`Ejapan,i'Ecomes on the line before,i`Echina,i'E. Just for your
comparison, the multiple-pattern specifier,i`Echina||japan,i'Eis
pretty much the same as the single regular expres-
sion,i`Echina.*japan|japan.*china,i'E.
If you use,i`E|!|,i'Einstead of,i`E||,i'E, it will mean,i`E...and then
lines _n_o_t matching...,i'E.
Consider a way to find all lines of _k_a_n_j_i_d_i_c that do have a
Halpern number, but don't have a Nelson number:
search [edict]> <H\d+>|!|<N\d+>
If you then wanted to restrict the listing to those that _a_l_s_o
had a,i`Ejinmeiyou,i'Emarking (_k_a_n_j_i_d_i_c's,i`EG9,i'Efield) and had a
reading of ox/cox-, you could make it:
search [edict]> <H\d+>|!|<N\d+>||<G9>||<ox/cox->
A prepended,iAE+,i,Cwould explain:
a match is,i`E<H\d+>,i'E
and not,i`E<N\d+>,i'E
and,i`E<G9>,i'E
and,i`E<ox/cox->,i'E
The,i`E|!|,i'Eand,i`E||,i'Ecan be used to make up to ten separate reg-
ular expressions in any one search specification.
11
LOOKUP(1) LOOKUP(1)
Again, it is important to stress that,i`E||,i'Edoes not
mean,i`Eor,i'E(as it does in a C program, or as,iAE|,i,Cdoes within a
regular expression). You might find it convenient to
read,i`E||,i'Eas,i`E_a_n_d also,i'E, while reading,i`E|!|,i'Eas,i`Ebut _n_o_t,i'E.
It is also important to stress that any whitespace around
the,i`E||,i'Eand,i`E|!|,i'Econstruct is _n_o_t ignored, but kept as part
of the regex on either side.
CCOOMMBBIINNAATTIIOONN SSLLOOTTSS
Each file, when loaded, is assigned to a,i`Eslot,i'Evia which sub-
sequent references to the file are then made. The slot may
then be searched, have filters and flags set, etc.
A special kind of slot, called a,i`Ecombination slot,i'E,rather
than representing a single file, can represent multiple previ-
ously-loaded slots. Searches against a combination slot
(or,i`Ecombo slot,i'Efor short) search all those previously-loaded
slots associated with it (called,i`Ecomponent slots,i'E). Combo
slots are set up with the _c_o_m_b_i_n_e command.
A Combo slot has no filter or modify spec, but can have a
local prompt and flags just like normal file slots. The
flags, however, have special meanings with combo slots. Most
combo-slot flags act as a mask against the component-slot
flags; when acted upon as a member of the combo, a component-
slot's flag will be disabled if the corresponding combo-slot's
flag is disabled.
Exceptions to this are the _a_u_t_o_k_a_n_a, _f_u_z_z, and _t_a_g flags.
The _a_u_t_o_k_a_n_a and _f_u_z_z flags governs a combo slot exactly the
same as a regular file slot. When a slot is searched as a
component of a combination slot, the component slot's _f_u_z_z
(and _a_u_t_o_k_a_n_a) flags, or lack thereof, are ignored.
The _t_a_g flag is quite different altogether; see the _t_a_g com-
mand for complete information.
Consider the following output from the _f_i_l_e_s command:
"(R)"~"3"~"~"~"~","~"~"3"~"~"~"3"~"~"~"~"~"~"~"~"~"~"~"~"~"~
"- 0"-F wcfh d"/ca I "- 2762k"-/usr/jfriedl/lib/edict
"- 1"-FM cf d"/ca I "- 705k"-/usr/jfriedl/lib/kanjidic
"- 2"-F cfh@d"/ca "- 1k"-/usr/jfriedl/lib/local.words
"-*3"-FM cfhtd"/ca "- combo"-kotoba (#2, #0)
"+-"~",u"~"~"~"~"_o"~"~",u"~"~"~",u"~"~"~"~"~"~"~"~"~"~"~"~"~"~
See the discussion of the _f_i_l_e_s command below for basic expla-
nation of the output.
As can be seen, slot #3 is a _c_o_m_b_i_n_a_t_i_o_n _s_l_o_t with the
name,i`Ekotoba,i'Ewith _c_o_m_p_o_n_e_n_t _s_l_o_t_s two and zero. When a search
is initiated on this slot, first slot #2,i`Elocal.words,i'Ewill be
12
LOOKUP(1) LOOKUP(1)
searched, then slot #0,i`Eedict,i'E. Because the combo slot's
_f_i_l_t_e_r flag is _o_n, the component slots' _f_i_l_t_e_r flag will
remain on during the search. The combo slot's _w_o_r_d flag is
_o_f_f, however, so slot #0's _w_o_r_d flag will be forced off during
the search.
See the _c_o_m_b_i_n_e command for information about creating combo
slots.
PPAAGGEERR
_L_o_o_k_u_p has a built in pager (a'la _m_o_r_e). Upon filling a
screen with text, the string
--MORE [space,return,c,q]--
is shown. A space will allow another screen of text; a return
will allow one more line. A,iAEc,i,C will allow output text to
continue unpaged until the next command. A,iAEq,i,C will flush
output of the current command.
If supported by the OS, _l_o_o_k_u_p_'_s idea of the screen size is
automatically set upon startup and window resize. _L_o_o_k_u_p must
know the width of the screen in doing both the horizontal
input-line scrolling, and for knowing when a long line wraps
on the screen.
The pager parameters can be set manually with the,i`Epager,i'Ecom-
mand.
CCOOMMMMAANNDDSS
Any line intended to be a command must begin with the command-
introduction character (the default is a space, but can be set
via the,i`Ecmdchar,i'Ecommand). However, that character is not
part of the command itself and won't be shown in the following
list of commands.
There are a number of commands that work with the _s_e_l_e_c_t_e_d
_f_i_l_e or _s_e_l_e_c_t_e_d _s_l_o_t (both meaning the same thing). The
selected file is the one indicated by an appended comma+digit,
as mentioned above. If no such indication is given, the
default _s_e_l_e_c_t_e_d _f_i_l_e is used (usually the first file loaded,
but can be changed with the,i`Eselect,i'Ecommand).
Some commands accept a _b_o_o_l_e_a_n argument, such as to turn a
flag on or off. In all such cases, a,i`E1,i'Eor,i`Eon,i'Emeans to turn
the flag on, while a,i`E0,i'Eor,i`Eoff,i'Eis used to turn it off.
Some flags are per-file (,i`Efuzz,i'E,,i`Efold,i'E, etc.), and a com-
mand to set such a flag normally sets the flag for the
selected file only. However, the default value inherited by
subsequently loaded files can be set by prepend-
ing,i`Edefault,i'Eto the command. This is particularly useful in
the startup file before any files are loaded (see the section
STARTUP FILE).
Items separated by,iAE|,i,Care mutually exclusive possibilities
(i.e. a boolean argument is,i`E1|on|0|off,i'E).
13
LOOKUP(1) LOOKUP(1)
Items shown in brackets (,iAE[,i,Cand,iAE],i,C) are optional. All
commands that accept a boolean argument to set a flag or mode
do so optionally -- with no argument the command will report
the current status of the mode or flag.
Any command that allows an argument in quotes (such as load,
etc.) allow the use of single or double quotes.
The commands:
[default] autokana [_b_o_o_l_e_a_n]
Automatic romaji /c_a kana conversion for the _s_e_l_e_c_t_e_d _f_i_l_e
is turned on or off (default is on). However,
if,i`Edefault,i'Eis specified, the value to be inherited as the
default by subsequently-loaded files is set (or reported).
Can be temporarily disabled by a prepended,iAE=,i,C,as
described in the INPUT SYNTAX section.
clear|cls
Attempts to clear the screen. If you're using a kterm it'll
just output the appropriate tty control sequence. Otherwise
it'll try to run the,i`Eclear,i'Ecommand.
cmdchar ['_o_n_e_-_b_y_t_e_-_c_h_a_r']
The default command-introduction character is a space, but
it may be changed via this command. The single quotes sur-
rounding the character are required. If no argument is
given, the current value is printed.
An input line consisting of a single question mark will
also print the current value (useful for when you don't
know the current value).
Woe to the one that sets the command-introduction character
to one of the other special input-line characters, such
as,iAE+,i,C,,iAE/,i,C, etc.
combine ["name"] [ _n_u_m += ] _s_l_o_t_n_u_m ...
Creates or adds file slots to a combination slot (see the
COMBINATION SLOTS section for general information). Note
that,i`Ecombo,i'Emay be used as the command as well.
Assuming for this example that slots 0-2 are loaded with
the files _c_u_r_l_y, _m_o_e, and _l_a_r_r_y, we can create a combina-
tion slot that will reference all three:
combo "three stooges" 2, 0, 1
The command will report
creating combo slot #3 (three stooges): 2 0 1
14
LOOKUP(1) LOOKUP(1)
The _n_a_m_e is optional, and will appear in the _f_i_l_e_s list,
and also maybe be used to specify the slot as an argument
to the _s_e_l_e_c_t command.
A search via the newly created combo slot would search in
the order specified on the _c_o_m_b_o command line: first _l_a_r_r_y,
then _c_u_r_l_y, and finally _m_o_e.
If you later load another file (say, _j_e_f_f_r_e_y to slot #4),
you can then add it to the previously made combo:
combo 3 += 4
(the,i`E+=,i'Ewording comes from the C programming language
where it means,i`Eadd on to,i'E). Adding to a combination
always adds slots to the end of the list.
You can take the opportunity of adding the slot to also
change the name, if you like:
combo "four stooges" 3 += 4
The reply would be
adding to combo slot #3(four stooges): 4
A file slot can be a component of any particular combo slot
only once. When reporting the created or added slot num-
bers, the number will appear in parenthesis if it had
already been a member of the list.
Furthermore, only _f_i_l_e slots can be component members of
_c_o_m_b_o slots. Attempting to combine combo slot _X to combo
slot _Y will result in having _X's component file slots
(rater than the combo slot itself) added to _Y.
command debug [_b_o_o_l_e_a_n]
Sets the internal command parser debugging flag on or off
(default is off).
debug [_b_o_o_l_e_a_n]
Sets the internal general-debugging flag on or off (default
is off).
describe _s_p_e_c_i_f_i_e_r
This command will tell you how a character (or each charac-
ter in a string) is encoded in the various encoding meth-
ods:
lookup command> describe ",uox"
,i`E,uox,i'Eas EUC is 0xb5a4 (181 164; 265 \244)
as JIS is 0x3524 ( 53 36; 65 \044 "5$")
as KUTEN is 2104 ( 0x1504; 25 \004)
as S-JIS is 0x8b1f (139 31; 213 \037)
15
LOOKUP(1) LOOKUP(1)
The quotes surrounding the character or string to describe
are optional. You can also give a regular ASCII character
and have the double-width version of the character
described.... indicating,i`EA,i'E, for example, would
describe,i`E-L'A,i'E. _S_p_e_c_i_f_i_e_r can also be a four-digit kuten
value, in which case the character with that kuten will be
described.
If a four-digit _s_p_e_c_i_f_i_e_r has a hex digit in it, or if it
is preceded by,i`E0x,i'E, the value is taken as a JIS code. You
can precede the value by,i`Ejis,i'E,,i`Esjis,i'E,,i`Eeuc,i'E,
or,i`Ekuten,i'Eto force interpretation to the requested code.
Finally, _s_p_e_c_i_f_i_e_r can be a string of stripped JIS (JIS w/o
the kanji-in and kanji-out codes, or with the codes but
without the escape characters in them). For exam-
ple,i`EF|K\,i'Ewould describe the two characters AE"u and "E"U.
encoding [euc|sjis|jis]
The same as the -euc, -jis, and -sjis command-line options,
sets the encoding method for interactive input and output
(or reports the current status). More detail over the out-
put encoding can be achieved with the _o_u_t_p_u_t _e_n_c_o_d_i_n_g com-
mand. A separate encoding for input can be set with the
_i_n_p_u_t _e_n_c_o_d_i_n_g command.
files [ - | long ]
Lists what files are loaded in what slots, and some status
information about them, as with:
"-*0"-F wcfh d"/ca I "- 3749k"-/usr/jeff/lib/edict
"- 1"-FM cf d"/ca I "- 754k"-/usr/jeff/lib/kanjidic
"(R)"~"3"~"~"~"~"~","~"~"3"~"~"~"3"~"~"~"~"~"~"~"~"~"~"~"~"~"~
"- 0"-F wcf h d "/ca I "- 2762k"-/usr/jfriedl/lib/edict
"- 1"-FM cf d "/ca I "- 705k"-/usr/jfriedl/lib/kanjidic
"- 2"-F cfWh@d "/ca "- 1k"-/usr/jfriedl/lib/local.words
"-*3"-FM cf htd "/ca "- combo"-kotoba (#2, #0)
"- 4"- cf d "/ca "- 205k"-/usr/dict/words
"+-"~",u"~"~"~"~"~"_o"~"~",u"~"~"~",u"~"~"~"~"~"~"~"~"~"~"~"~"~"~
The first section is the slot number, with a,i`E*,i'Ebeside the
_d_e_f_a_u_l_t _s_l_o_t (as set by the _s_e_l_e_c_t command).
The second section shows per-slot flags and status. Letters
are shown if the flag is on, omitted if off. In the list
below, related commands are given for each item:
F ,i"A if there is a filter {but '#' if disabled}. (filter)
M ,i"A if there is a modify spec {but '%' if disabled}. (modify)
w ,i"A if word-preference mode is turned on. (word)
c ,i"A if case folding is turned on. (fold)
f ,i"A if fuzzification is turned on. (fuzz)
W ,i"A if wildcard-pattern mode is turned on (wildcard)
16
LOOKUP(1) LOOKUP(1)
h ,i"A if highlighting is turned on. (highlight)
t ,i"A if there is a tag {but @ if disabled} (tag)
d ,i"A if found lines should be displayed (display)
",i",i",i",i",i",i",i",i",i",i",i",i",i",i",i",i",i",i",i",i",i",i",i",i",i",i",i",i",i",i",i",i",i
a ,i"A if autokana is turned on (autokana)
P ,i"A if there is a file-specific local prompt (prompt)
I ,i"A if the file is loaded with a precomputed index (load)
d ,i"A if the display flag is on (display)
Note that the letters in the upper section directly corre-
spond to the,i`E!!,i'Esequence characters described in the
INPUT SYNTAX section.
If there is a digit at the end of the flag section, it
indicates that only #/10 of the file is actually loaded
into memory (as opposed to the file having been completely
loaded). Unloaded files will be loaded while _l_o_o_k_u_p is
idle, or when first used.
If the slot is a combination slot (as slot #3 is in the
example above), that is noted in the third section, and the
combination name and component slot numbers are noted in
the fourth. Also, for combination slots (which have no _f_i_l_-
_t_e_r or _m_o_d_i_f_y specifications, only the flags), _F and/or _M
are shown if the corresponding mode is allowed during
searches via the combo slot. See the _t_a_g command for info
about _t with respect to combination slots.
If an argument (either,i`E-,i'Eor,i`Elong,i'Ewill work) is given to
the command, a short message about what the flags mean is
also printed.
filter ["_l_a_b_e_l"] [!] /_r_e_g_e_x/[i]
Sets the filter for the _s_e_l_e_c_t_e_d _s_l_o_t (which must contain a
file and not a combination). If a filter is set and active
for a file, any line matching the given _r_e_g_e_x is filtered
from the output (if the,iAE!,i,Cis put before the _r_e_g_e_x, any
line _n_o_t matching the regex is filtered). The _l_a_b_e_l _,
which isn't required, merely acts as documentation in vari-
ous diagnostics.
As an example, consider that _e_d_i_c_t lines often
have,i`E(pn),i'Eon them to indicate that the given English is a
place name. Often these place names can be a bother, so it
would be nice to elide them from the output unless specifi-
cally requested. Consider the example:
lookup command> filter "name" /(pn)/
search [edict]> [ox-ox^I]
,u,i,C1/2 [ox-ox^Iox|] /function/faculty/
,u/c,C1/4 [ox-ox^Iox|] /inductive/
_o`oAE"u [ox-ox^Iox|] /yesterday/
/c~a3 "name" lines filtered/c"a
In the example,,iAE/,i,Ccharacters are used to delimit the
17
LOOKUP(1) LOOKUP(1)
start and stop of the regex (as is common with many pro-
grams). However, any character can be used. A final,iAEi,i,C,
if present, indicates that the regex should be applied in a
case-insensitive manner.
The filter, once set, can be enabled or disabled with the
other form of the,i`Efilter,i'Ecommand (described below). It
can also be temporarily turned off (or, if disabled, tem-
porarily turned on) by the,i`E!F!,i'Eline prefix.
Filtered lines can optionally be saved and then displayed
if you so desire. See the,i`Esaved list
size,i'Eand,i`Eshow,i'Ecommands.
Note that if you have saving enabled and only one line
would be filtered, it is simply printed at the end (rather
than print a one line message about how one line was fil-
tered).
By the way, a better,i`Ename,i'Efilter for _e_d_i_c_t would be:
filter "name" #^[^/]+/[^/]*<p[ln]>[^/]*/$#
as it would filter all entries that had only one English
section, that section being a name. It is also an example
of using something other than,iAE/,i,Cto delimit a regex, as
it makes things a bit easier to read.
filter [_b_o_o_l_e_a_n]
Enables or disables the filter for the _s_e_l_e_c_t_e_d _s_l_o_t. If
no argument is given, displays the current filter and sta-
tus.
[default] fold [_b_o_o_l_e_a_n]
The _s_e_l_e_c_t_e_d _s_l_o_t's case folding is turned on or off
(default is on), or reported if no argument given. How-
ever, if,i`Edefault,i'Eis specified, the value to be inherited
as the default by subsequently-loaded files is set (or
reported).
Can be temporarily toggled by the,i`E!c!,i'Eline prefix.
[default] fuzz [_b_o_o_l_e_a_n]
The _s_e_l_e_c_t_e_d _s_l_o_t's fuzzification is turned on or off
(default is on), or reported if no argument given. How-
ever, if,i`Edefault,i'Eis specified, the value to be inherited
as the default by subsequently-loaded files is set (or
reported).
Can be temporarily toggled by the,i`E!f!,i'Eline prefix.
help [_r_e_g_e_x]
Without an argument gives a short help list. With an argu-
ment, lists only commands whose help string is picked up by
18
LOOKUP(1) LOOKUP(1)
the given _r_e_g_e_x.
[default] highlight [_b_o_o_l_e_a_n]
Sets matched-string highlighting on or off for the _s_e_l_e_c_t_e_d
_s_l_o_t (default off), or reports the current status if no
argument is given. However, if,i`Edefault,i'Eis specified, the
value to be inherited as the default by subsequently-loaded
files is set (or reported).
If on, shows in bold or reverse video (see below) that part
of the line which was matched by the search _r_e_g_e_x. If mul-
tiple regexes were given, that part matched by the first
regex is show.
Note that a regex might match a portion of a line which is
later removed by a _m_o_d_i_f_y parameter. In this case, no high-
lighting is done.
Can be temporarily toggled by the,i`E!h!,i'Eline prefix.
highlight style [_b_o_l_d | _i_n_v_e_r_s_e | _s_t_a_n_d_o_u_t | _<_______>]
Sets the style of highlighting for when highlighting is
done. _I_n_v_e_r_s_e (inverse video) and _s_t_a_n_d_o_u_t are the same.
The default is _b_o_l_d. You can also give an HTML tag, such
as,i`E<BOLD>,i'Eand items will be wrapped by <BOLD>...</BOLD>.
This would be particularly useful when the output is going
to a CGI, as when lookup has been built in a server config-
uration.
Note that the highlighting is affected by using raw
VT100/xterm control sequences. This isn't particularly very
nice if your terminal doesn't understand them. Sorry.
if {_e_x_p_r_e_s_s_i_o_n} _c_o_m_m_a_n_d_._._.
If the evaluated _e_x_p_r_e_s_s_i_o_n is non-zero, the _c_o_m_m_a_n_d will
be executed.
Note that {} rather than () surround the _e_x_p_r_e_s_s_i_o_n.
_E_x_p_r_e_s_s_i_o_n may be comprised of numbers, operators, paren-
thesis, etc. In addition to the normal +, -, *, and /,
are:
!_x ,i"A yields 0 if _x is non-zero, 1 if _x is zero.
_x && _y ,i"A
!_x ,i"A,iAEnot,i,CYields 1 if _x is zero, 0 if non-zero.
_x & _y ,i"A,iAEand,i,CYields 1 if both _x and _y are non-zero, 0 otherwise.
_x | _y ,i"A,iAEor,i,C Yields 1 if _x or _y (or both) is non-zero, 0 otherwise
19
LOOKUP(1) LOOKUP(1)
There may also be the special tokens _t_r_u_e and _f_a_l_s_e which
are 1 and 0 respectively.
There are also _c_h_e_c_k_e_d, _m_a_t_c_h_e_d, _p_r_i_n_t_e_d, _n_o_n_w_o_r_d, and _f_i_l_-
_t_e_r_e_d which correspond to the values printed by the _s_t_a_t_s
command.
An example use might be the following kind of thing in an
computer-generated script:
!d!expect this line
if {!printed} msg Oops! couldn't find "expect this line"
input encoding [ euc | sjis ]
Used to set (or report) what encoding to use when 8-bit
bytes are found in the interactive input (all flavors of
JIS are always recognized). Also see the _e_n_c_o_d_i_n_g and _o_u_t_-
_p_u_t _e_n_c_o_d_i_n_g commands.
limit [_v_a_l_u_e]
Sets the number of lines to print during any search before
aborting (or reports the current number if no value given).
Default is 100.
Output limiting is disabled if set to zero.
log [ to [+] _f_i_l_e ]
Begins logging the program output to _f_i_l_e (the Japanese
encoding method being the same as for screen output).
If,i`E+,i'Eis given, the log is appended to any text that might
have previously been in _f_i_l_e, in which case a leading
dashed line is inserted into the file.
If no arguments are given, reports the current logging sta-
tus.
log - | off
If only,i`E-,i'Eor _o_f_f is given, any currently-opened log file
is closed.
load [-now|-whenneeded] "_f_i_l_e_n_a_m_e"
Loads the named file to the next available slot. If a pre-
computed index is found (as,i`E_f_i_l_e_n_a_m_e.jin,i'E)it is loaded as
well. Otherwise, an index is generated internally.
The file to be loaded (and the index, if loaded) will be
loaded during idle times. This allows a startup file to
list many files to be loaded, but not have to wait for each
20
LOOKUP(1) LOOKUP(1)
of them to load in turn. Using the ,i`E-now,i'Eflag causes the
load to happen immediately, while using the ,i`E-when-
needed,i'Eoption (can be shortened to ,i`E-wn,i'E)causes the load
to happen only when the slot is first accessed.
Invoke _l_o_o_k_u_p as
% lookup -writeindex _f_i_l_e_n_a_m_e
to generate and write an index file, which will then be
automatically used in the future.
If the file has already been loaded, the file is not re-
read, but the previously-read file is shared. The new slot
will, however, have its own separate flags, prompt, filter,
etc.
modify /_r_e_g_e_x/_r_e_p_l_a_c_e/[ig]
Sets the _m_o_d_i_f_y parameter for the _s_e_l_e_c_t_e_d _f_i_l_e. If a file
has a modify parameter associated with it, each line
selected during a search will have that part of the line
which matches _r_e_g_e_x (if any) replaced by the _r_e_p_l_a_c_e_m_e_n_t
string before being printed.
Like the _f_i_l_t_e_r command, the delimiter need not be,iAE/,i,C;
any non-space character is fine. If a final,iAEi,i,Cis given,
the regex is applied in a case-insensitive manner. If a
final,iAEg,i,Cis given, the replacement is done to all matches
in the line, not just the first part that might match
_r_e_g_e_x.
The _r_e_p_l_a_c_e_m_e_n_t may have embedded,i`E1,i'E, etc. in it to refer
to parts of the matched text (see the tutorial on regular
expressions).
The modify parameter, once set, may be enabled or disabled
with the other form of the modify command (described
below). It may also be temporarily toggled via
the,i`E!m!,i'Eline prefix.
A silly example for the ultra-nationalist might be:
modify /<Japan>/Dainippon Teikoku/g
So that a line such as
AE"u9|"a [ox"Eox'Aox(R)ox'o] /Bank of Japan/
would come out as
AE"u9|"a [ox"Eox'Aox(R)ox'o] /Bank of Dainippon Teikoku/
As a real example of the modify command with _k_a_n_j_i_d_i_c, con-
sider that it is likely that one is not interested in all
the various fields each entry has. The following can be
used to remove the info on the U, N, Q, M, E, B, C, and Y
fields from the output:
modify /( [UNQMECBY]\S+)+//g,1
It's sort of complex, but works. Note that here the
21
LOOKUP(1) LOOKUP(1)
_r_e_p_l_a_c_e_m_e_n_t part is empty, meaning to just remove those
parts which matched. The result of such a search of AE"u
would normally print
AE"u 467c U65e5 N2097 B72 B73 S4 G1 H3027 F1 Q6010.0 MP5.0714 ,i`A
MN13733 E62 Yri4 P3-3-1 =Y"E=Y'A =Y,=Y"A ox`O -ox'O -ox<< {day}
but with the above modify spec, appears more simply as
AE"u 467c S4 G1 H3027 F1 P3-3-1 =Y"E=Y'A =Y,=Y"A ox`O -ox'O -ox<< {day}
modify [_b_o_o_l_e_a_n]
Enables or disables the modify parameter for the _s_e_l_e_c_t_e_d
_f_i_l_e, or report the current status if no argument is given.
msg _s_t_r_i_n_g
The given _s_t_r_i_n_g is printed.
Most likely used in a script as the target command of an _i_f
command.
output encoding [ euc | sjis | jis...]
Used to set exactly what kind of encoding should be used
for program output (also see the _i_n_p_u_t _e_n_c_o_d_i_n_g command).
Used when the _e_n_c_o_d_i_n_g command is not detailed enough for
one's needs.
If no argument is given, reports the current output encod-
ing. Otherwise, arguments can usually be any reasonable
dash-separated combination of:
euc
Selects EUC for the output encoding.
sjis
Selects Shift-JIS for the output encoding.
jis[78|83|90][-ascii|-roman]
Selects JIS for the output encoding. If no year (78,
83, or 90) given, 78 is used. Can optionally specify
that,i`EEnglish,i'Eshould be encoded as regular _A_S_C_I_I (the
default when JIS selected) or as _J_I_S_-_R_O_M_A_N.
212
Indicates that JIS X0212-1990 should be supported
(ignored for Shift-JIS output).
no212
Indicates that JIS X0212-1990 should be not be sup-
ported (default setting). This places JIS X0212-1990
characters under the domain of _d_i_s_p, _n_o_d_i_s_p, _c_o_d_e, or
_m_a_r_k (described below).
22
LOOKUP(1) LOOKUP(1)
hwk
Indicates that _half _width _kana should be left as-is
(default setting).
nohwk
Indicates that _half _width _kana should be stripped from
the output. _(_n_o_t _y_e_t _i_m_p_l_e_m_e_n_t_e_d_)_.
foldhwk
Indicates that _half _width _kana should be folded to
their full-width counterparts. _(_n_o_t _y_e_t _i_m_p_l_e_m_e_n_t_e_d_)_.
disp
Indicates that _n_o_n_-_d_i_s_p_l_a_y_a_b_l_e characters (such as JIS
X0212-1990 while the output encoding method is Shift-
JIS) should be passed along anyway (most likely
resulting in screen garbage).
nodisp
Indicates that _n_o_n_-_d_i_s_p_l_a_y_a_b_l_e characters should be
quietly stripped from the output.
code
Indicates that _n_o_n_-_d_i_s_p_l_a_y_a_b_l_e characters should be
printed as their octal codes (default setting).
mark
Indicates that _n_o_n_-_d_i_s_p_l_a_y_a_b_l_e characters should be
printed as,i`E,i'u,i'E.
Of course, not all options make sense in all combina-
tions, or at all times. When the current (or new) output
encoding is reported, a complete and exact specifier rep-
resenting the output encoding selected. An example might
be,i`Ejis78-ascii-no212-hwk-code,i'E.
pager [ _b_o_o_l_e_a_n | _s_i_z_e ]
Turns on or off an output pager, sets it's idea of the
screen size, or reports the current status.
_S_i_z_e can be a single number indicating the number of lines
to be printed between,i`EMORE?,i'Eprompts (usually a few lines
less than the total screen height, the default being 20
lines). It can also be two numbers in the form,i`E#x#,i'Ewhere
the first number is the width (in half-width characters;
default 80) and the second is the lines-per-page as above.
If the pager is on, every page of output will result in
a,i`EMORE?,i'Eprompt, at which there are four possible
responses. A space will allow one more full page to print.
A return will allow one more line. A,iAEc,i,C(for,i`Econ-
tinue,i'E) will all the rest of the output (for the current
command) to proceed without pause, while
a,iAEq,i,C(for,i`Equit,i'E) will flush the output for the current
23
LOOKUP(1) LOOKUP(1)
command.
If supported by the OS, the pager size parameters are set
appropriately from the window size upon startup or window
resize.
The default pager status is,i`Eoff,i'E.
[local] prompt "_s_t_r_i_n_g"
Sets the prompt string. If,i`Elocal,i'Eis indicated, sets the
prompt string for the _s_e_l_e_c_t_e_d _s_l_o_t only. Otherwise, sets
the global default prompt string.
Prompt strings may have the special %-sequences shown
below, with related commands given in parenthesis:
%N ,i"A the _d_e_f_a_u_l_t _s_l_o_t's file or combo name.
%n ,i"A like %N, but any leading path is not shown if a filename.
%# ,i"A the _d_e_f_a_u_l_t _s_l_o_t's number.
%S ,i"A the,i`Ecommand-introduction,i'Echaracter (cmdchar)
%0 ,i"A the running program's name
%F='_s_t_r_i_n_g' ,i"A _s_t_r_i_n_g shown if filtering enabled (filter)
%M='_s_t_r_i_n_g' ,i"A _s_t_r_i_n_g shown if modification enabled (modify)
%w='_s_t_r_i_n_g' ,i"A _s_t_r_i_n_g shown if word mode on (word)
%c='_s_t_r_i_n_g' ,i"A _s_t_r_i_n_g shown if case folding on (fold)
%f='_s_t_r_i_n_g' ,i"A _s_t_r_i_n_g shown if fuzzification on (fuzz).
%W='_s_t_r_i_n_g' ,i"A _s_t_r_i_n_g shown if wildcard-pat. mode on (wildcard).
%d='_s_t_r_i_n_g' ,i"A _s_t_r_i_n_g shown if displaying on (display).
%C='_s_t_r_i_n_g' ,i"A _s_t_r_i_n_g shown if currently entering a command.
%l='_s_t_r_i_n_g' ,i"A _s_t_r_i_n_g shown if logging is on (log).
%L ,i"A the name of the current output log, if any (log)
For the tests (%f, etc), you can put,iAE!,i,Cjust after
the,iAE%,i,Cto reverse the sense of the test (i.e. %!f="no
fuzz"). The reverse of %F is if a filter is installed but
disabled (i.e. _s_t_r_i_n_g will never be shown if there is no
filter for the default file). The modify %M works compara-
bly.
Also, you can use an alternative form for the items that
take an argument string. Replacing the quotes with paren-
theses will treat _s_t_r_i_n_g as a recursive prompt specifier.
For example, the specifier
%C='command'%!C(%f='fuzzy 'search:)
would result in a,i`Ecommand,i'Eprompt if entering a command,
while it would result in either a,i`Efuzzy search:,i'Eor
a,i`Esearch:,i'Eprompt if not entering a command. The paren-
thesized constructs may be nested.
Note that the letters of the test constructs are the same
as the letters for the,i`E!!,i'Esequences described in INPUT
SYNTAX.
24
LOOKUP(1) LOOKUP(1)
An example of a nice prompt command might be:
prompt "%C(%0 command)%!C(%w'*'%!f'raw '%n)> "
With this prompt specification, the prompt would normally
appear as,i`E_f_i_l_e_n_a_m_e> ,i'Ebut when fuzzification is turned off
as,i`Eraw _f_i_l_e_n_a_m_e> ,i'E. And if word-preference mode is on,
the whole thing has a,i`E*,i'Eprepended. However if a command
is being entered, the prompt would then become,i`E_n_a_m_e com-
mand,i'E, where _n_a_m_e was the program's name (system depen-
dent, but most likely,i`Elookup,i'E).
The default prompt format string is,i`E%C(%0 com-
mand)%!C(search [%n])> ,i'E.
regex debug [_b_o_o_l_e_a_n]
Sets the internal regex debugging flag (turn on if you want
billions of lines of stuff spewed to your screen).
saved list size [_v_a_l_u_e]
During a search, lines that match might be elided from the
output due to filters or word-preference mode. This com-
mand sets the number of such lines to remember during any
one search, such that they may be later displayed (before
the next search) by the _s_h_o_w command.
The default is 100.
select [ _n_u_m | _n_a_m_e | . ]
If _n_u_m is given, sets the _d_e_f_a_u_l_t _s_l_o_t to that slot number.
If _n_a_m_e is given, sets the _d_e_f_a_u_l_t _s_l_o_t to the first slot
found with a file (or combination) loaded with that name.
The incantation,i`Eselect .,i'Emerely sets the default slot to
itself, which can be useful in script files where you want
to indicate that any subsequent flags changes should work
with whatever file was the default at the time the script
was _s_o_u_r_c_ed.
If no argument is given, simply reports the current _d_e_f_a_u_l_t
_s_l_o_t (also see the _f_i_l_e_s command).
In command files loaded via the _s_o_u_r_c_e command, or as the
startup file, commands dealing with per-slot items (flags,
local prompt, filters, etc.) work with the file or slot
last _s_e_l_e_c_ted. The last such selected slot remains
selected once the load is complete.
Interactively, the default slot will become the _s_e_l_e_c_t_e_d
_s_l_o_t for subsequent searches and commands that aren't aug-
mented with an appended,i`E,#,i'E(as described in the INPUT
SYNTAX section).
show
Shows any lines elided from the previous search (either due
25
LOOKUP(1) LOOKUP(1)
to a _f_i_l_t_e_r or _w_o_r_d_-_p_r_e_f_e_r_e_n_c_e _m_o_d_e).
Will apply any modifications (see the,i`Emodify,i'Ecommand) if
modifications are enabled for the file. You can use
the,i`E!m!,i'Eline prefix as well with this command (in this
case, put the,i`E!m!,i'E_b_e_f_o_r_e the command-indicator charac-
ter).
The length of the list is controlled by the,i`Esaved list
size,i'Ecommand.
source "_f_i_l_e_n_a_m_e"
Commands are read from _f_i_l_e_n_a_m_e and executed.
In the file, all lines beginning with,i`E#,i'Eare ignored as
comments (note that comments must appear on a line by them-
selves, as,i`E#,i'Eis a reasonable character to have within
commands).
Lines whose first non-blank characters
is,i`E=,i'E,,i`E!,i'E,or,i`E+,i'Eare considered searches, while all
other non-blank lines are considered _l_o_o_k_u_p commands.
Therefore, there is no need for lines to begin with the
command-introduction character. However, leading whitespace
is always OK.
For search lines, take care that any trailing whitespace is
deleted if undesired, as trailing whitespace (like all non-
leading whitespace) is kept as part of the regular expres-
sion.
Within a command file, commands that modify per-file flags
and such always work with the most-recently loaded (or
selected) file. Therefore, something along the lines of
load "my.word.list"
set word on
load "my.kanji.list"
set word off
set local prompt "enter kanji> "
would word as might make intuitive sense.
Since a script file must have a _l_o_a_d, or _s_e_l_e_c_t before any
per-slot flag is set, one can use,i`Eselect .,i'Eto facilitate
command scripts that are to work with,i`Ethe current slot,i'E.
spinner [_v_a_l_u_e]
Set the value of the spinner (A silly little feature). If
set to a non-zero value, will cause a spinner to spin while
a file is being checked, one increment per _v_a_l_u_e lines in
26
LOOKUP(1) LOOKUP(1)
the file actually checked against the search specifier.
Default is off (i.e. zero).
stats
Shows information about how many lines of the text file
were checked against the last search specifier, and how
many lines matched and were printed.
tag [_b_o_o_l_e_a_n] ["_s_t_r_i_n_g"]
Enable, disable, or set the tag for the _s_e_l_e_c_t_e_d _s_l_o_t.
If the slot is not a combination slot, a tag _s_t_r_i_n_g may be
set (the quotes are required).
If a tag string is set and enabled for a file, the string
is prepended to each matching output line printed.
Unlike the _f_i_l_t_e_r and _m_o_d_i_f_y commands which automatically
enable the function when a parameter is set, a _t_a_g is not
automatically enabled when set. It can be enabled while
being set via,i`E'tag,i'Eonor could be enabled subsequently via
just,i`Etag on,i'E If the selected slot is a combination slot,
only the enable/disable status may be changed (on by
default). No tag string may be set.
The reason for the special treatment lies in the special
nature of how tags work in conjunction with combination
files.
During a search when the selected slot is a combination
slot, each file which is a member of the combination has
its per-file flags disabled if their corresponding flag is
disabled in the original combination slot. This allows the
combination slot's flags to act as a,i`Emask,i'Eto blot out
each component file's per-file flags.
The tag flag, however, is special in that the component
file's tag flag is turned _o_n if the combination slot's tag
flag is turned on (and, of course, the component file has a
tag string registered).
The intended use of this is that one might set a (disabled)
tag to a file, yet _d_i_r_e_c_t searches against that file will
have no prepended tag. However, if the file is searched as
part of a combination slot (and the combination slot's tag
flag is on), the tag _w_i_l_l be prepended, allowing one to
easily understand from which file an output line comes.
verbose [_b_o_o_l_e_a_n]
Sets verbose mode on or off, or reports the current status
(default on). Many commands reply with a confirmation if
verbose mode is turned on.
27
LOOKUP(1) LOOKUP(1)
version
Reports the current version of the program.
[default] wildcard [_b_o_o_l_e_a_n]
The _s_e_l_e_c_t_e_d _s_l_o_t's patterns are considerd wildcard pat-
terns if turned on, regular expressions if turned off. The
current status is reported if no argument given. However,
if,i`Edefault,i'Eis specified, the pattern-type to be inherited
as the default by subsequently-loaded files is set (or
reported).
Can be temporarily toggled by the,i`E!W!,i'Eline prefix.
When wildcard patterns are selected, the changed metachar-
acters are:,i`E*,i'Emeans,i`Eany stuff,i'E,,i`E?,i'Emeans,i`Eany one
character,i'E,while,i`E+,i'Eand,i`E.,i'Ebecome unspecial. Other regex
items such as,i`E|,i'E,,i`E(,i'E,,i`E[,i'E,etc. are unchanged.
What,i`E*,i'Eand,i`E?,i'Ewill actually match depends upon the sta-
tus of word-mode, as well as on the pattern itself. If
word-mode is on, or if the pattern begins with the start-
of-word,i`E<,i'Eor,i`E[,i'E,only non-spaces will be matched. Other-
wise, any character will be matched.
In summary,when wildcard mode is on, the input pattern is
effected in the following ways:
* is changed to the regular expression .* or
? is changed to the regular expression . or + is changed to the regular expression +
. is changed to the regular expression .
Because filename patterns are often called,i`Efilename
globs,i'E,the command,i`Eglob,i'Ecan be used in place of,i`Ewild-
card,i'E.
[default] word|wordpreference [_b_o_o_l_e_a_n]
The selected file's word-preference mode is turned on or
off (default is off), or reports the current setting if no
argument is specified. However, if,i`Edefault,i'Eis specified,
the value to be inherited as the default by subsequently-
loaded files is set (or reported).
In word-preference mode, entries are searched for _a_s _i_f the
search regex had a leading,iAE<,i,Cand a trailing,iAE>,i,C,
resulting in a list of entries with a whole-word match of
the regex. However, if there are none, but there _a_r_e non-
word entries, the non-word entries are shown (the,i`Esaved
list,i'Eis used for this -- see that command). This make it
an,i`Eif there are whole words like this, show me, otherwise
show me whatever you've got,i'Emode.
If there are both word and non-word entries, the non-word
28
LOOKUP(1) LOOKUP(1)
entries are remembered in the saved list (rather than any
possible filtered entries being remembered there).
One caveat: if a search matches a line in more than one
place, and the first is _n_o_t a whole-word, while one of the
others _i_s, the line will be listed considered non-whole
word. For example, the search,i"Ojapan,ixwith word-preference
mode on will not list an entry such as,i`E/Japanese/language
in Japan/,i'E, as the first,i`EJapan,i'Eis part of,i`EJapanese,i'Eand
not a whole word. If you really need just whole-word
entries, use the,iAE<,i,Cand,iAE>,i,Cyourself.
The mode may be temporarily toggled via the,i`E!w!,i'Eline pre-
fix.
The rules defining what lines are filtered, remembered,
discarded, and shown for each permutation of search are
rather complex, but the end result is rather intuitive.
quit | leave | bye | exit
Exits the program.
SSTTAARRTTUUPP FFIILLEE
If the file,i`E~/.lookup,i'Eis present, commands are read from it
during _l_o_o_k_u_p startup.
The file is read in the same way as the _s_o_u_r_c_e command reads
files (see that entry for more information on file format,
etc.)
However, if there had been files loaded via command-line argu-
ments, commands within the startup file to load files (and
their associated commands such as to set per-file flags) are
ignored.
Similarly, any use of the command-line flags -euc, -jis, or
-sjis will disable in the startup file the commands dealing
with setting the input and/or output encodings.
The special treatment mentioned in the above two paragraphs
only applies to commands within the startup file itself, and
does not apply to commands in command-files that might be
_s_o_u_r_c_ed from within the startup file.
The following is a reasonable example of a startup file:
## turn verbose mode off during startup file processing
verbose off
prompt "%C([%#]%0)%!C(%w'*'%!f'raw '%n)> "
spinner 200
pager on
## The filter for edict will hit for entries that
## have only one English part, and that English part
29
LOOKUP(1) LOOKUP(1)
## having a pl or pn designation.
load ~/lib/edict
filter "name" #^[^/]+/[^/]*<p[ln]>[^/]*/$#
highlight on
word on
## The filter for kanjidic will hit for entries without a
## frequency-of-use number. The modify spec will remove
## fields with the named initial code (U,N,Q,M,E, and Y)
load ~/lib/kanjidic
filter "uncommon" !/<F\d+>/
modify /( [UNQMEY])+//g
## Use the same filter for my local word file,
## but turn off by default.
load ~/lib/local.words
filter "name" #^[^/]+/[^/]*<p[ln]>[^/]*/$#
filter off
highlight on
word on
## Want a tag for my local words, but only when
## accessed via the combo below
tag off ",i~O"
combine "words" 2 0
select words
## turn verbosity back on for interactive use.
verbose on
CCOOMMMMAANNDD--LLIINNEE AARRGGUUMMEENNTTSS
With the use of a startup file, command-line arguments are
rarely needed. In practical use, they are only needed to cre-
ate an index file, as in:
lookup -write _t_e_x_t_f_i_l_e
Any command line arguments that aren't flags are taken to be
files which are loaded in turn during startup. In this case,
any,i`Eload,i'E,,i`Efilter,i'E, etc. commands in the startup file are
ignored.
The following flags are supported:
-help
Reports a short help message and exits.
-write Creates index files for the named files and exits. No
_s_t_a_r_t_u_p _f_i_l_e is read.
-euc
Sets the input and output encoding method to EUC (currently
the default). Exactly the same as the,i`Eencoding
30
LOOKUP(1) LOOKUP(1)
euc,i'Ecommand.
-jis
Sets the input and output encoding method to JIS. Exactly
the same as the,i`Eencoding jis,i'Ecommand.
-sjis
Sets the input and output encoding method to Shift-JIS.
Exactly the same as the,i`Eencoding sjis,i'Ecommand.
-v -version
Prints the version string and exits.
-norc
Indicates that the startup file should not be read.
-rc _f_i_l_e
The named file is used as the startup file, rather than the
default,i`E~/.lookup,i'E. It is an error for the file not to
exist.
-percent _n_u_m
When an index is built, letters that appear on more than
_n_u_m percent (default 50) of the lines are elided from the
index. The thought is that if a search will have to check
most of the lines in a file anyway, one may as well save
the large amount of space in the index file needed to rep-
resent that information, and the time/space tradeoff
shifts, as the indexing of oft-occurring letters provides a
diminishing return.
Smaller indexes can be made by using a smaller number.
-noindex
Indicates that any files loaded via the command line should
not be loaded with any precomputed index, but recalculated
on the fly.
-verbose
Has metric tons of stats spewed whenever an index is cre-
ated.
-port ###
For the (undocumented) server configuration only, tells
which port to listen on.
OOPPEERRAATTIINNGG SSYYSSTTEEMM CCOONNSSIIDDEERRAATTIIOONNSS
I/O primitives and behaviors vary with the operating system.
On my operating system, I can,i`Eread,i'Ea file by mapping it into
memory, which is a pretty much instant procedure regardless of
the size of the file. When I later access that memory, the
appropriate sections of the file are automatically read into
memory by the operating system as needed.
31
LOOKUP(1) LOOKUP(1)
This results in _l_o_o_k_u_p starting up and presenting a prompt
very quickly, but causes the first few searches that need to
check a lot of lines in the file to go more slowly (as lots of
the file will need to be read in). However, once the bulk of
the file is in, searches will go very fast. The win here is
that the rather long file-load times are amortized over the
first few (or few dozen, depending upon the situation)
searches rather than always faced right at command startup
time.
On the other hand, on an operating system without the mapping
ability, _l_o_o_k_u_p would start up very slowly as all the files
and indexes are read into memory, but would then search
quickly from the beginning, all the file already having been
read.
To get around the slow startup, particularly when many files
are loaded, _l_o_o_k_u_p uses _l_a_z_y _l_o_a_d_i_n_g if it can: a file is not
actually read into memory at the time the _l_o_a_d command is
given. Rather, it will be read when first actually accessed.
Furthermore, files are loaded while _l_o_o_k_u_p is idle, such as
when waiting for user input. See the _f_i_l_e_s command for more
information.
RREEGGUULLAARR EEXXPPRREESSSSIIOONNSS,, AA BBRRIIEEFF TTUUTTOORRIIAALL
_R_e_g_u_l_a_r _e_x_p_r_e_s_s_i_o_n_s (,i`Eregex,i'Efor short) are a,i`Ecode,i'Eused to
indicate what kind of text you're looking for. They're how
one searches for things in the editors,i`Evi,i'E,,i`Este-
vie,i'E,,i`Emifes,i'Eetc., or with the grep commands. There are
differences among the various regex flavors in use -- I'll
describe the flavor used by _l_o_o_k_u_p here. Also, in order to be
clear for the common case, I might tell a few lies, but noth-
ing too heinous.
The regex,i"Oa,ixmeans,i`Eany line with an,iAEa,i,Cin it.,i'E Simple
enough.
The regex,i"Oab,ixmeans,i`Eany line with an,iAEa,i,Cimmediately fol-
lowed by a,iAEb,i,C,i'E. So the line
I am feeling flabby
would,i`Ematch,i'Ethe regex,i"Oab,ixbecause, indeed, there's
an,i`Eab,i'Eon that line. But it wouldn't match the line
this line has no a followed _immediately_ by a b
because, well, what the lines says is true.
In most cases, letters and numbers in a regex just mean that
you're looking for those letters and numbers in the order
given. However, there are some special characters used within
a regex.
A simple example would be a period. Rather than indicate that
you're looking for a period, it means,i`Eany character,i'E. So
32
LOOKUP(1) LOOKUP(1)
the silly regex,i"O.,ixwould mean,i`Eany line that has any charac-
ter on it.,i'EWell, maybe not so silly... you can use it to find
non-blank lines.
But more commonly it's used as part of a larger regex. Con-
sider the regex,i"Ogray,ix. It wouldn't match the line
The sky was grey and cloudy.
because of the different spelling (grey vs. gray). But the
regex,i"Ogr.y,ixasks for,i`Eany line with a,iAEg,i,C,,iAEr,i,C, some
character, and then a,iAEy,i,C,i'E. So this would
get,i`Egrey,i'Eand,i`Egray,i'E. A special construct somewhat similar
to,iAE.,i,Cwould be the _c_h_a_r_a_c_t_e_r _c_l_a_s_s. A character class
starts with a,iAE[,i,Cand ends with a,iAE],i,C, and will match any
character given in between. An example might be
gr[ea]y
which would match lines with a,iAEg,i,C,,iAEr,i,C, an,iAEe,i,C_o_r
an,iAEa,i,C, and then a,iAEy,i,C. Inside a character class you can
list as many characters as you want to.
For example the simple regex,i"Ox[0123456789]y,ixwould match any
line with a digit sandwiched between an,iAEx,i,Cand a,iAEy,i,C.
The order of the characters within the character class doesn't
really matter...,i"O[513467289],ixwould be the same
as,i"O[0123456789],ix.
But as a short cut, you could put,i"O[0-9],ixinstead
of,i"O[0123456789],ix. So the character class,i"O[a-z],ixwould
match any lower-case letter, while the character
class,i"O[a-zA-Z0-9],ixwould match any letter or digit.
The character,iAE-,i,Cis special within a character class, but
only if it's not the first thing. Another character that's
special in a character class is,iAE^,i,C, if it _i_s the first
thing. It,i`Einverts,i'Ethe class so that it will match any char-
acter _n_o_t listed. The class,i"O[^a-zA-Z0-9],ixwould match any
line with spaces or punctuation on them.
There are some special short-hand sequences for some common
character classes. The sequence,i"O\d,ixmeans,i`Edigit,i'E, and is
the same as,i"O[0-9],ix. ,i"O\w,ixmeans,i`Eword element,i'Eand is the
same as,i"O[0-9a-zA-Z_],ix. ,i"O\s,ixmeans,i`Espace-type thing,i'Eand is
the same as,i"O[ \t],ix(,i"O\t,ixmeans tab).
You can also use,i"O\D,ix,,i"O\W,ix, and,i"O\S,ixto mean things _n_o_t a
digit, word element, or space-type thing.
Another special character would be,iAE?,i,C. This means,i`Emaybe
one of whatever was just before it, not is fine too,i'E. In the
regex ,i"Obikes? for rent,ix, the,i`Ewhatever,i'Ewould be the,iAEs,i,C,
33
LOOKUP(1) LOOKUP(1)
so this would match lines with either,i`Ebikes for
rent,i'Eor,i`Ebike for rent,i'E.
Parentheses are also special, and can group things together.
In the regex
big (fat harry)? deal
the,i`Ewhatever,i'Efor the,iAE?,i,Cwould be,i`Efat harry,i'E. But be
careful to pay attention to details... this regex would match
I don't see what the big fat harry deal is!
but _n_o_t
I don't see what the big deal is!
That's because if you take away the,i`Ewhatever,i'Eof the,iAE?,i,C,
you end up with
big deal
Notice that there are _t_w_o spaces between the words, and the
regex didn't allow for that. The regex to get either line
above would be
big (fat harry )?deal
or
big( fat harry)? deal
Do you see how they're essentially the same?
Similar to,iAE?,i,Cis,iAE*,i,C, which means,i`Eany number, including
none, of whatever's right in front,i'E. It more or less means
that whatever is tagged with,iAE*,i,Cis allowed, but not
required, so something like
I (really )*hate peas
would match,i`EI hate peas,i'E,,i`EI really hate peas!,i'E,,i`EI really
really hate peas,i'E, etc.
Similar to both,iAE?,i,Cand,iAE*,i,Cis,iAE+,i,C, which means,i`Eat least
one of whatever just in front, but more is fine too,i'E. The
regex,i"Omis+pelling,ixwould match,i`Emi_spelling,i'E,,i`Emi_s_-
_spelling,i'E,,i`Emi_s_s_spelling,i'E, etc. Actually, it's just the same
as,i"Omiss*pelling,ixbut more simple to type. The
regex,i"Oss*,ixmeans,i`Ean,iAEs,i,C, followed by zero or more,iAEs,i,C,i'E,
while,i"Os+,ixmeans,i`Eone or more,iAEs,i,C,i'E. Both really the same.
The special character,iAE|,i,Cmeans,i`Eor,i'E. Unlike,iAE+,i,C,,iAE*,i,C,
and,iAE?,i,Cwhich act on the thing _i_m_m_e_d_i_a_t_e_l_y before,
the,iAE|,i,Cis more,i`Eglobal,i'E.
give me (this|that) one
Would match lines that had,i`Egive me this one,i'Eor,i`Egive me that
one,i'Ein them.
You can even combine more than two:
give me (this|that|the other) one
How about:
[Ii]t is a (nice |sunny |bright |clear )*day
34
LOOKUP(1) LOOKUP(1)
Here, the,i`Ewhatever,i'Eimmediately before the,iAE*,i,Cis
(nice |sunny |bright |clear )
So this regex would match all the following lines:
_I_t _i_s _a _d_a_y.
I think _i_t _i_s _a _n_i_c_e _d_a_y.
_I_t _i_s _a _c_l_e_a_r _s_u_n_n_y _d_a_y today.
If _i_t _i_s _a _c_l_e_a_r _s_u_n_n_y _n_i_c_e _s_u_n_n_y _s_u_n_n_y _s_u_n_n_y _b_r_i_g_h_t _d_a_y then....
Notice how the,i"O[Ii]t,ixmatches either,i`EIt,i'Eor,i`Eit,i'E?
Note that the above regex would also match
fru_i_t _i_s _a _d_a_y
because it indeed fulfills all requirements of the regex, even
though the,i`Eit,i'Eis really part of the word,i`Efruit,i'E. To
answer concerns like this, which are common,
are,iAE<,i,Cand,iAE>,i,C, which mean,i`Eword break,i'E. The
regex,i"O<it,ixwould match any line with,i`Eit,i'E_b_e_g_i_n_n_i_n_g _a _w_o_r_d,
while,i"Oit>,ixwould match any line with,i`Eit,i'E_e_n_d_i_n_g _a _w_o_r_d.
And, of course,,i"O<it>,ixwould match any line with _t_h_e
_w_o_r_d,i`Eit,i'Ein it.
Going back to the regex to find grey/gray, that would make
more sense, then, as
<gr[ae]y>
which would match only the _w_o_r_d_s,i`Egrey,i'Eand,i`Egray,i'E. Some-
what similar are,iAE^,i,Cand,iAE$,i,C, which mean,i`Ebeginning of
line,i'Eand,i`Eend of line,i'E, respectively (but, not in a charac-
ter class, of course). So the regex,i"O^fun,ixwould find any
line that begins with the letters,i`Efun,i'E, while,i"O^fun>,ixwould
find any line that begins with the _w_o_r_d,i`Efun,i'E.
,i"O^fun$,ixwould find any line that was exactly,i`Efun,i'E.
Finally,,i"O^\s*fun\s*$,ixwould match any line
that,i`Efun,i'Eexactly, but perhaps also had leading and/or trail-
ing whitespace.
That's pretty much it. There are more complex things, some of
which I'll mention in the list below, but even with these few
simple constructs one can specify very detailed and complex
patterns.
Let's summarize some of the special things in regular expres-
sions:
Items that are basic units:
_c_h_a_r any non-special character matches itself.
\_c_h_a_r special chars, when proceeded by \, become non-special.
. Matches any one character (except \n).
\n Newline
\t Tab.
\r Carriage Return.
\f Formfeed.
\d Digit. Just a short-hand for [0-9].
\w Word element. Just a short-hand for [0-9a-zA-Z_].
\s Whitespace. Just a short-hand for [\t \n\r\f].
35
LOOKUP(1) LOOKUP(1)
\## \### Two or three digit octal number indicating a single byte.
[_c_h_a_r_s] Matches a character if it's one of the characters listed.
[^_c_h_a_r_s] Matches a character if it's not one of the ones listed.
The \_c_h_a_r items above can be used within a character class,
but not the items below.
\D Anything not \d.
\W Anything not \w.
\S Anything not \s.
\a Any ASCII character.
\A Any multibyte character.
\k Any (not half-width) katakana character (including ,i1/4).
\K Any character not \k (except \n).
\h Any hiragana character.
\H Any character not \h (except \n).
(_r_e_g_e_x) Parens make the _r_e_g_e_x one unit.
(?:_r_e_g_e_x) [from perl5] Grouping-only parens -- can't use for \# (below)
\c Any JISX0208 kanji (kuten rows 16-84)
\C Any character not \c (except \n).
\# Match whatever was matched by the #th paren from the left.
With,i`E,i`u,i'Eto indicate one,i`Eunit,i'Eas above, the following may be used:
,i`u? A ,i`u allowed, but not required.
,i`u+ At least one ,i`u required, but more ok.
,i`u* Any number of ,i`u ok, but none required.
There are also ways to match,i`Esituations,i'E:
\b A word boundary.
< Same as \b.
> Same as \b.
^ Matches the beginning of the line.
$ Matches the end of the line.
Finally, the,i`Eor,i'Eis
_r_e_g_1|_r_e_g_2 Match if either _r_e_g_1 or _r_e_g_2 match.
Note that,i`E\k,i'Eand the like aren't allowed in character classes, so
something such as,i"O[\k\h],ixto try to get all kana won't work.
Use ,i"O(\k|\h),ixinstead.
BBUUGGSS
Needs full support for half-width katakana and JIS X
0212-1990.
Non-EUC (JIS & SJIS) items not tested well.
Probably won't work on non-UNIX systems.
Screen control codes (for clear and highlight commands) are
hard-coded for ANSI/VT100/kterm.
36
LOOKUP(1) LOOKUP(1)
AAUUTTHHOORR
Jeffrey Friedl (jfriedl@nff.ncl.omron.co.jp)
IINNFFOO
Jim Breen's text files _e_d_i_c_t and _k_a_n_j_i_d_i_c and their documenta-
tion can be found in,i`Epub/nihongo,i'Eon ftp.cc.monash.edu.au
(130.194.1.106
Information on input and output encoding and codes can be
found in Ken Lunde's _U_n_d_e_r_s_t_a_n_d_i_n_g _J_a_p_a_n_e_s_e _I_n_f_o_r_m_a_t_i_o_n _P_r_o_-
_c_e_s_s_i_n_g (AE"u"E"U,`i3/4`'o^E'o1/2`e'I'y) published by O'Reilly and Asso-
ciates. ISBN 1-56592-043-0. There is also a Japanese edition
published by SoftBank.
A program to convert files among the various encoding methods
is Dr. Ken Lunde's_j_c_o_n_v, which can also be found on
ftp.cc.monash.edu.au. _J_c_o_n_v is also useful for converting
halfwidth katakana (which _l_o_o_k_u_p doesn't yet support well) to
full-width.
37
|