GSF Lexing

GSF requires you to provide a lexer for your language. The lexer should implement the NetBeans Lexer API. In addition, you have to register the lexer, as well as color definitions. (I'd like to remove the need for this part by having GSF do it for you). See the registration section for details on this.

Writing a lexer using the NetBeans lexing API is pretty easy. There is already quite a bit of documentation for the lexer itself, so I won't repeat any of that here. However, GSF is often used to wrap languages with existing lexers and parsers which I'll get into next.

Wrapping Existing Lexers

If you are trying to add language support for a popular language, changes are you already have a lexer for it - and you don't want to write one from scratch. After all, if you're trying to support say Groovy, why duplicate the Groovy compiler's lexer and risk making mistakes such that your IDE support doesn't 100% correctly handle exactly the same keywords, commenting rules etc. as the language? For the Ruby support in NetBeans, I'm using the JRuby lexer. It turns out lexing Ruby is pretty tricky - you should take a look at their lexer!

If you are wrapping an existing lexer there are two things you need to worry about. One of them is easy, the other one probably hard:

Most lexers written for these languages (Ruby, JavaScript, Groovy, PHP, Scala, Python, etc.) were intended for use by a parser. If you're trying to reuse a parser's lexer, you'll run into a problem. Parsers don't care about whitespace and comments! Typically, they'll just throw them away and only tokenize the rest of the buffer that is relevant for the parser. That won't do for your IDE lexer! It must return a TokenId for ALL characters in the buffer, and in particular, whitespace and comments too! Thus, you have to modify your lexer to not throw these things away, but return proper tokens for them instead. I modified both Rhino (for JavaScript) and JRuby (for JRuby) to do this. In both cases it involved changing a "continue" in a for loop (where they had just eaten whitespace) to a "return whitespace/comment token") and a little bit of futzing to make sure the parser would correctly handle coming back from this state.
The lexer must be incremental!! This means that your lexer wrapper needs to be able to restart your wrapped lexer at any position in the buffer (well, at any token boundary to be more exact) and continue lexing from there. This is used heavily in the IDE; if you're editing a 4,000 line JavaScript file, we don't start lexing from the top for every character you're typing! The editor is pretty smart and as soon as your token stream matches the old token stream it will stop lexing again, which means that it ends up doing very little work for normal typing, and if you say type /* to start a comment, it will immediately relex the rest of the screen to reflect that it's all a big comment now.

Modifying your lexer to return whitespace and token types should be pretty trivial. Adding incremental support might not be so easy. For JRuby, this involved figuring out all the state that is needed by the lexer, and extracting this into a separate state object, as space and performance efficiently as possible, and then stashing away one of these for each token generated. (The IDE makes this part easy). There is also really good unit testing support for the Lexer API, which lets you both easily do token dumps, as well as incremental lexing tests, where it performs random edits of your documents, and compares the incrementally lexed token hierarchy for each step with a token hierarchy obtained by lexing your entire file from the top and diffs the two.

If you want code inspiration, the RubyLexer in the ruby module and the JsLexer in the javascript.editing module have examples of this was done for Ruby and JavaScript.

Lexer Registration and Colors

In addition to providing your Lexer language from your language configuration object (as described in the registration document), you should probably also register the lexer language with NetBeans. This will allow language embedding to work more naturally because NetBeans (not just GSF) can locate the lexer language for a given mime type, which is used in langauge embedding scenarios. Yes, there is a redundancy here that both GSF and the editor need you to register the Lexer language. Either GSF should read the information directly from the editor's location, or GSF should automatically register the lexer language on your behalf in the editor's location. I'll look into fixing this. But for now, add the following registration in the Editors/mimetype folder:

    
    <folder name="Editors">
        <folder name="text">
            <folder name="x-ruby">
                ...
                <file name="language.instance">
                    <attr name="instanceCreate" methodvalue="org.netbeans.modules.ruby.lexer.RubyTokenId.language"/>
                    <attr name="instanceOf" stringvalue="org.netbeans.api.lexer.Language"/>
                </file>
        </folder>
    </folder>

So note that language.instance here is under the Editors folder, and refers to a Lexer Language, whereas the language configuration object, also in language.instance file, is under the GsfPlugins folder, and refers to a GsfLanguage object.

You can also register color definitions (as well as color registrations) for arbitrary TokenIds that your lexer is creating. Usually you'll probably want to just inherit as many colors from the defaults as possible, to leave color and font management up to the defaults supplied by the various themes. To register colors for the default theme, use a registration like this:

    
    <folder name="Editors">
        <folder name="text">
            <folder name="x-ruby">
                ...
                <folder name="FontsColors">
                    <folder name="NetBeans">
                        <folder name="Defaults">
                            <file name="coloring.xml" url="fontsColors.xml">
                                <attr name="SystemFileSystem.localizingBundle" stringvalue="org.netbeans.modules.ruby.Bundle"/>
                            </file>
                        </folder>
                    </folder>
                </folder>
            </folder>
        </folder>
    </folder>

Here, we are referencing two other files. First, a fontsColors.xml file, which supplies a set of color definitions for our token types:

    
    <fontcolor name="STRING_LITERAL" default="string"/>
    <fontcolor name="DOUBLE_LITERAL" default="number"/>
    <fontcolor name="BLOCK_COMMENT" default="comment"/>
    <fontcolor name="DOCUMENTATION" default="comment"/>
    <fontcolor name="LONG_LITERAL" default="number"/>
    <fontcolor name="REGEXP_LITERAL" foreColor="9933CC"/>
    <fontcolor name="ERROR" default="error"/>
    ...

Here, STRING_LITERAL is the enum-name of the TokenId corresponding to a String literal, and so on. As you can see, in most cases we are just referring to logical styles like string, number, and so on. In the case of regular expressions, there isn't a builtin type for that, so we specify a custom color. The editor plans to provide a larger set of builtin definitions such that you shouldn't have to do this.

Second, the color registration mentioned a particular Bundle.properties file, where the color definitions can be named. This is used for the Fonts & Colors options dialog, where users get to click on the logical names of style definitions and customize them. In your Bundle.properties file, you need something like this:

    
STRING_LITERAL=String
DOUBLE_LITERAL=Double
BLOCK_COMMENT=Block Comment
STRING_TEXT=String
QUOTED_STRING_LITERAL=Quoted String
LONG_LITERAL=Long
STRING_ESCAPE=String Escape
DOCUMENTATION=Documentation
...

Tor Norbye <tor@netbeans.org>