File: README

package info (click to toggle)
javacc 3.2%2B0-1
links: PTS
area: main
in suites: sarge
size: 2,408 kB
ctags: 1,462
sloc: java: 15,927; xml: 233; makefile: 53; sh: 25
file content (105 lines) | stat: -rw-r--r-- 4,964 bytes
parent folder | download | duplicates (5)

/*
 * Copyright © 2002 Sun Microsystems, Inc., 4150 Network Circle, Santa Clara,
 * California 95054, U.S.A. All rights reserved.  Sun Microsystems, Inc. has
 * intellectual property rights relating to technology embodied in the product
 * that is described in this document. In particular, and without limitation,
 * these intellectual property rights may include one or more of the U.S.
 * patents listed at http://www.sun.com/patents and one or more additional
 * patents or pending patent applications in the U.S. and in other countries.
 * U.S. Government Rights - Commercial software. Government users are subject
 * to the Sun Microsystems, Inc. standard license agreement and applicable
 * provisions of the FAR and its supplements.  Use is subject to license terms.
 * Sun,  Sun Microsystems,  the Sun logo and  Java are trademarks or registered
 * trademarks of Sun Microsystems, Inc. in the U.S. and other countries.  This
 * product is covered and controlled by U.S. Export Control laws and may be
 * subject to the export or import laws in other countries.  Nuclear, missile,
 * chemical biological weapons or nuclear maritime end uses or end users,
 * whether direct or indirect, are strictly prohibited.  Export or reexport
 * to countries subject to U.S. embargo or to entities identified on U.S.
 * export exclusion lists, including, but not limited to, the denied persons
 * and specially designated nationals lists is strictly prohibited.
 */

This example illustrates the use of lexical states.

In this directory are two grammar files Digest.jj and Faq.jj.  These
generate parsers that process RMAIL files that are created by the GNU
emacs editor.  A sample RMAIL file called sampleMailFile is also
included.

Digest.jj and Faq.jj are both identical in their grammar and lexical
specifications.  They only differ in their actions.  Digest.jj takes a
mail file as standard input and produces a digest version of the
messages to standard output.  This is what we used (before we moved
over to an automatic mailing list software) to produce the weekly mail
digest of the JavaCC mailing list.  Faq.jj takes a mail file as
standard input and produces a mail FAQ in HTML format.  It produces a
file "index.html" that contains all the mail headers and links to
other HTML files called "1.html", "2.html", etc. that contain the 1st,
2nd, etc. messages.  The parser generated from Faq.jj accepts an
optional integer argument that specifies the message number from where
to begin processing.

Type the following:

	javacc Digest.jj
	javacc Faq.jj
	javac *.java
	java Digest < sampleMailFile > digestFile
	java Faq < sampleMailFile

And take a look at the output files created.


MORE DETAILS ON HOW THE GRAMMARS WORK:

The grammar specification is rather trivial.  It simply says that a
MailFile is is a sequence of MailMessage's followed by an end of file.
And that a MailMessage is a list of one or more <SUBJECT>'s, <FROM>'s,
and <DATE>'s followed by a list of zero or more <BODY>'s followed by
an <END>.

Essentially, there are five lexical tokens:

<SUBJECT>, <FROM, <DATE> : Are the strings of the Subject, From, and
Date fields of the message.

<BODY> : Is a single line of the message body.

<END> : is the end of message token.

The lexical specification starts with a set of reusable private
regular expressions EOL, TWOEOLS, and NOT_EOL.  These are defined to
be portable across different platforms where the end of line
characters are different.

In the <DEFAULT> lexical state, the token manager simply eats up
characters until it sees the beginning of a message which is marked by
<<EOL> "*** EOOH ***" <EOL>>.  See the sample mail file for details.
At this point, it switches to state <MAILHEADER>.

In the state <MAILHEADER>, two consecutive end of line's indicate the
end of the mail header and therefore the token manager goes to the
<MAILBODY> lexical state.  Also, if it sees "Subject: ", it goes to
the <MAILSUBJECT> lexical state, and similarly for "From: " and "Date: ".

The end of message is signified by the "^_" character, of "\u001f".
The state <MAILBODY> returns to the <DEFAULT> state when it sees this
character, otherwise it simply returns message body lines one by one.

The general flow chart for the lexical states is shown below.  It is
useful to make such a diagram when building complex lexical
specifications.


      <DEFAULT> ---> <MAILHEADER> --+--> <MAILSUBJECT> -->+
       ^                |    ^      |                     |
       |                |    |      |                     |
       |                |    |      +--> <MAILFROM> ----->+
       +- <MAILBODY> <--+    |      |                     |
                             |      |                     |
                             |      +--> <MAILDATE> ----->+
                             |                            |
                             |                            |
                             +----------------------------+