1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205
|
<!--
Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements. See the NOTICE file
distributed with this work for additional information
regarding copyright ownership. The ASF licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing,
software distributed under the License is distributed on an
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
KIND, either express or implied. See the License for the
specific language governing permissions and limitations
under the License.
-->
<html>
<body>
<h2>GSF Lexing</h2>
<p>
GSF requires you to provide a lexer for your language. The lexer should
implement the NetBeans Lexer API.
In addition, you have to register the lexer, as well as
color definitions. (I'd like to remove the need for this part
by having GSF do it for you). See the
<a href="#registration">registration section</a> for details on this.
</p>
<p>
Writing a lexer using the NetBeans lexing API is pretty easy.
There is already quite a bit of documentation for the lexer itself,
so I won't repeat any of that here. However, GSF is often used to wrap
languages with existing lexers and parsers which I'll get into next.
</p>
<h2>Wrapping Existing Lexers</h2>
<p>
If you are trying to add language support for a popular language,
changes are you already have a lexer for it - and you don't want to
write one from scratch. After all, if you're trying to support
say Groovy, why duplicate the Groovy compiler's lexer and risk
making mistakes such that your IDE support doesn't 100% correctly
handle exactly the same keywords, commenting rules etc. as the
language? For the Ruby support in NetBeans, I'm using the JRuby
lexer. It turns out lexing Ruby is pretty tricky - you should take
a look at their lexer!
</p>
<p>
If you are wrapping an existing lexer there are two things you
need to worry about. One of them is easy, the other one probably hard:
<ol>
<li>
Most lexers written for these languages (Ruby, JavaScript,
Groovy, PHP, Scala, Python, etc.) were intended for use
by a parser. If you're trying to reuse a parser's lexer,
you'll run into a problem. Parsers don't care about
whitespace and comments! Typically, they'll just throw
them away and only tokenize the rest of the buffer
that is relevant for the parser. That won't do for your
IDE lexer! It must return a TokenId for ALL characters
in the buffer, and in particular, whitespace and comments
too! Thus, you have to modify your lexer to not throw
these things away, but return proper tokens for them
instead. I modified both Rhino (for JavaScript) and JRuby
(for JRuby) to do this. In both cases it involved changing
a "continue" in a for loop (where they had just eaten
whitespace) to a "return whitespace/comment token") and
a little bit of futzing to make sure the parser would
correctly handle coming back from this state.
</li>
<li>
The lexer must be incremental!! This means that your lexer
wrapper needs to be able to restart your wrapped lexer
at any position in the buffer (well, at any token boundary
to be more exact) and continue lexing from there. This
is used heavily in the IDE; if you're editing a 4,000 line
JavaScript file, we don't start lexing from the top
for every character you're typing! The editor is pretty smart
and as soon as your token stream matches the old token
stream it will stop lexing again, which means that it ends
up doing very little work for normal typing, and if you
say type <code>/*</code> to start a comment, it will
immediately relex the rest of the screen to reflect that
it's all a big comment now.
</li>
</ol>
Modifying your lexer to return whitespace and token types should
be pretty trivial. Adding incremental support might not be so
easy. For JRuby, this involved figuring out all the state that
is needed by the lexer, and extracting this into a separate
state object, as space and performance efficiently as possible,
and then stashing away one of these for each token generated.
(The IDE makes this part easy).
There is also really good unit testing support for the Lexer API,
which lets you both easily do token dumps, as well as incremental
lexing tests, where it performs random edits of your documents,
and compares the incrementally lexed token hierarchy for each step
with a token hierarchy obtained by lexing your entire file from
the top and diffs the two.
</p>
<p>
If you want code inspiration, the RubyLexer in the
<code>ruby</code> module and the
JsLexer in the <code>javascript.editing</code> module have examples
of this was done for Ruby and JavaScript.
</p>
<a name="registration"/>
<h3>Lexer Registration and Colors</h3>
<p>
In addition to providing your Lexer language from your language configuration
object (as described in the <a href="registration.html">registration document</a>),
you should probably also register the lexer language with NetBeans. This will allow
language embedding to work more naturally because NetBeans (not just GSF) can
locate the lexer language for a given mime type, which is used in langauge embedding
scenarios. <b>Yes, there is a redundancy here</b> that both GSF and the editor
need you to register the Lexer language. Either GSF should read the information directly
from the editor's location, or GSF should automatically register the lexer language
on your behalf in the editor's location. I'll look into fixing this. But for now,
add the following registration in the Editors/mimetype folder:
<pre style="background: #ffffcc; color: black; border: solid 1px black; padding: 5px">
<folder name="Editors">
<folder name="text">
<folder name="x-ruby">
...
<b><file name="language.instance">
<attr name="instanceCreate" methodvalue="org.netbeans.modules.ruby.lexer.RubyTokenId.language"/>
<attr name="instanceOf" stringvalue="org.netbeans.api.lexer.Language"/>
</file></b>
</folder>
</folder>
</pre>
So note that <code>language.instance</code> here is under the <code>Editors</code> folder,
and refers to a Lexer Language,
whereas the language configuration object, also in <code>language.instance</code> file,
is under the <code>GsfPlugins</code> folder, and refers to a GsfLanguage object.
</p>
<p>
You can also register color definitions (as well as color registrations) for arbitrary
<code>TokenIds</code> that your lexer is creating. Usually you'll probably want to
just inherit as many colors from the defaults as possible, to leave color and font
management up to the defaults supplied by the various themes.
To register colors for the default theme, use a registration like this:
<pre style="background: #ffffcc; color: black; border: solid 1px black; padding: 5px">
<folder name="Editors">
<folder name="text">
<folder name="x-ruby">
...
<b><folder name="FontsColors">
<folder name="NetBeans">
<folder name="Defaults">
<file name="coloring.xml" url="fontsColors.xml">
<attr name="SystemFileSystem.localizingBundle" stringvalue="org.netbeans.modules.ruby.Bundle"/>
</file>
</folder>
</folder></b>
</folder>
</folder>
</folder>
</folder>
</pre>
Here, we are referencing two other files. First, a <code>fontsColors.xml</code> file, which supplies
a set of color definitions for our token types:
<pre style="background: #ffffcc; color: black; border: solid 1px black; padding: 5px">
<fontcolor name="STRING_LITERAL" default="string"/>
<fontcolor name="DOUBLE_LITERAL" default="number"/>
<fontcolor name="BLOCK_COMMENT" default="comment"/>
<fontcolor name="DOCUMENTATION" default="comment"/>
<fontcolor name="LONG_LITERAL" default="number"/>
<fontcolor name="REGEXP_LITERAL" foreColor="9933CC"/>
<fontcolor name="ERROR" default="error"/>
...
</pre>
Here, <code>STRING_LITERAL</code> is the enum-name of the <code>TokenId</code> corresponding
to a String literal, and so on. As you can see, in most cases we are just referring
to logical styles like <code>string</code>, <code>number</code>, and so on. In the
case of regular expressions, there isn't a builtin type for that, so we specify
a custom color. The editor plans to provide a larger set of builtin definitions
such that you shouldn't have to do this.
</p>
<p>
Second, the color registration mentioned a particular <code>Bundle.properties</code> file,
where the color definitions can be named. This is used for the Fonts & Colors options
dialog, where users get to click on the logical names of style definitions and
customize them. In your <code>Bundle.properties</code> file, you need something
like this:
<pre style="background: #ffffcc; color: black; border: solid 1px black; padding: 5px">
STRING_LITERAL=String
DOUBLE_LITERAL=Double
BLOCK_COMMENT=Block Comment
STRING_TEXT=String
QUOTED_STRING_LITERAL=Quoted String
LONG_LITERAL=Long
STRING_ESCAPE=String Escape
DOCUMENTATION=Documentation
...
</pre>
</p>
<br/>
<span style="color: #cccccc">Tor Norbye <tor@netbeans.org></span>
</body>
</html>
|