File: Feature_Parser_Notes.md

package info (click to toggle)
afdko 4.0.1%2Bdfsg1-2
links: PTS, VCS
area: main
in suites: forky, sid, trixie
size: 16,360 kB
sloc: ansic: 148,322; python: 24,622; cpp: 16,785; yacc: 421; makefile: 76; cs: 47; sh: 13
file content (183 lines) | stat: -rw-r--r-- 7,497 bytes
---
title: MakeOTFEXE Feature File Parser Notes
layout: default
---

MakeOTFEXE Feature File Parser Notes
---

Copyright 2021 Adobe. All Rights Reserved. This software is licensed as
OpenSource, under the Apache License, Version 2.0. This license is available
at: http://opensource.org/licenses/Apache-2.0.

Document version 0.2
Last updated 24 May 2021

## 1. Introduction

In 2021 the code in `makeotfexe` that parses and processes feature files was
upgraded from a pccts (Antlr 1) implementation in C to an Antlr 4
implementation in C++.  One reason for the change was to provide a more
contemporary and better documented context for implementation of future changes
to the format of feature files, including additional commands for variable
fonts. This document discusses the new system, sometimes in contrast to the
previous system, to aid those future changes.

There is a healthy amount of Antlr 4 documentation
[here](https://github.com/antlr/antlr4/blob/master/doc/index.md), and also
a [book](https://pragprog.com/titles/tpantlr2/the-definitive-antlr-4-reference/)
you can buy.

## 2. Antlr Files

The pccts-based parser had a single source file with lexigraphic tokens, the
feature file grammar, and snippets of C code (mostly to functions defined in
`feat.c`) to process the files. In the Antlr 4 implementation the lexer is
defined primarily in `FeatLexerBase.g4` and the file grammar in
`FeatParser.g4`. Neither of these include target-language code and therefore
could be used for feature file parsing in other target languages such as Java
or Python.

The additional file `FeatLexer.g4` imports `FeatLexerBase.g4` and has a small
amount of C++ code to recognize `anon` blocks. It is this file that defines
the actual Lexer and the generated files are accordingly `FeatLexer.h` and
`FeatLexer.cpp`.

The parser is similarly implemented by `FeatParser.h` and `FeatParser.cpp` and
there is also an abstract `FeatParserVisitor` class and a
`FeatParserBaseVisitor` implementation generated by Antlr 4. `FeatParser.h` is
the most useful file to refer to; it has the naming conventions and internal
structure of each of the nodes of the parse tree.

### 2.1 Generating

All of the derived files can currently be regenerated by running `python
BuildGrammar.py` in the `hotconv` source directory. This assuming that `antlr4`
is installed, in your path, and has a version matching the one hard-coded in
the script. The command also has a `-c` option that removes all the generated
files. (However, because the files are tracked in `git` you typically want to
include them in any updates.

### 2.2 Antlr Runtime and Versions

The root `CMakeLists.txt` file has a line like `set(ANTLR4_TAG tags/4.9.2)`.
`hotconv/BuildGrammar.py` has a matching line `antlr_version = "4.9.2"`. These
should be updated together to ensure the runtime (which is pulled down from
the Antlr 4 git repository) matches the generated files. When you update the
version remember to clean and regenerate the grammar.

## 3 Other Files

The new files `FeatCtx.h` and `FeatCtx.cpp` correspond to the old `feat.c`.
This C++ class mostly consists of utility and adapter code that should be
recognizable to people familiar with the previous system. The new files
`FeatVisitor.h` and `FeatVisitor.cpp` correspond to the snippets of C in
the old `featgram.g`, but in contrast with `FeatCtx` the new code is quite
different.

## 4. `FeatVisitor` and the Visitor Semantic

Antlr 4 can be used in different ways but its authors recommend using the
parser to build a parse tree and then traversing that tree with code written in
the target language. Antlr can optionally produce "listeners" and "visitors"
and the `makeotfexe` code uses the latter. In effect there is one virtual
method corresponding to each of the types of node in the tree. The default
implementation for a node just calls the method for each child node passing it
the child context.  One processes the tree by replacing the default
implementation for a given node with one to do the processing.

Here, for example, is the grammar of a `featureBlock`:

```
featureBlock:
    FEATURE starttag=tag USE_EXTENSION? LCBRACE
    featureStatement+
    RCBRACE endtag=tag SEMI
;
```

This is the corresponding `Context` class in `FeatParser.h`:

```
  class  FeatureBlockContext : public antlr4::ParserRuleContext {
  public:
    FeatParser::TagContext *starttag = nullptr;
    FeatParser::TagContext *endtag = nullptr;
    FeatureBlockContext(antlr4::ParserRuleContext *parent, size_t invokingState);
    virtual size_t getRuleIndex() const override;
    antlr4::tree::TerminalNode *FEATURE();
    antlr4::tree::TerminalNode *LCBRACE();
    antlr4::tree::TerminalNode *RCBRACE();
    antlr4::tree::TerminalNode *SEMI();
    std::vector<TagContext *> tag();
    TagContext* tag(size_t i);
    antlr4::tree::TerminalNode *USE_EXTENSION();
    std::vector<FeatureStatementContext *> featureStatement();
    FeatureStatementContext* featureStatement(size_t i);


    virtual antlrcpp::Any accept(antlr4::tree::ParseTreeVisitor *visitor) override;

  };
```

And this is a simplified version of the present visitor method for `featureBlock`
nodes:

```
antlrcpp::Any FeatVisitor::visitFeatureBlock(FeatParser::FeatureBlockContext *ctx) {
    if ( stage == vExtract ) {
        Tag t = checkTag(ctx->starttag, ctx->endtag);
        TOK(ctx);
        fc->startFeature(t);
        if ( ctx->USE_EXTENSION() != nullptr )
            fc->flagExtension(false);
    }

    for (auto i : ctx->featureStatement())
        visitFeatureStatement(i);

    if ( stage == vExtract ) {
        TOK(ctx->endtag);
        fc->endFeature();
    }
    return nullptr;
}
```

First note the `antlr4::tree::TerminalNode` pointers in the context object.
These can be checked for `nullptr` when keywords are optional (as with
`USE_EXTENSION`), otherwise they can generally be ignored.

`checktag` in `visitFeatureBlock()` is a utility method that verifies the start
and end tags are equal (or outputs an error) and returns the tag.
`startFeature()` is a `FeatCtx` method that prepares for new feature statements
and `endFeature()` is a corresponding method that wraps up feature processing.
In between the method calls `visitFeatureStatement()` on each child
`featureStatement` node in order, fulfilling its role as a "visitor".

The `stage` guards represent two stages of tree processing: `vInclude` and
`vExtract`. `vInclude` only involves opening and parsing included feature
files, and therefore the `vInclude` processing stage needs to reach each
include node without doing anything else. The parse tree is processed in
`vExtract`.

The remaining unreferenced syntax (besides the `antlrcpp::Any` return value,
which is an unused Antlr-ism) is the `TOK()` method. This is actually
overloaded with a method and two methods templates in `FeatVisitor.h` that
accept and return tree nodes or tokens. `TOK` should be called on a relevant
child node, and sometimes the current node (as in `TOK(ctx)`) before calling
out to a `FeatCtx()` method to set the token used to report the line number and
character offset of a warning or error.

## 5. Include directives

If you need to add another context that supports an include directive just
follow the model of one of the existing contexts. Don't forget to add a new
`EOF`-including `File` node toward the bottom of `FeatParser.g4`.

## 6. Document revisions

**v0.2 [24 May 2021]: Update when feature complete**

**v0.1 [11 May 2021]: First version**