1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183
|
---
title: MakeOTFEXE Feature File Parser Notes
layout: default
---
MakeOTFEXE Feature File Parser Notes
---
Copyright 2021 Adobe. All Rights Reserved. This software is licensed as
OpenSource, under the Apache License, Version 2.0. This license is available
at: http://opensource.org/licenses/Apache-2.0.
Document version 0.2
Last updated 24 May 2021
## 1. Introduction
In 2021 the code in `makeotfexe` that parses and processes feature files was
upgraded from a pccts (Antlr 1) implementation in C to an Antlr 4
implementation in C++. One reason for the change was to provide a more
contemporary and better documented context for implementation of future changes
to the format of feature files, including additional commands for variable
fonts. This document discusses the new system, sometimes in contrast to the
previous system, to aid those future changes.
There is a healthy amount of Antlr 4 documentation
[here](https://github.com/antlr/antlr4/blob/master/doc/index.md), and also
a [book](https://pragprog.com/titles/tpantlr2/the-definitive-antlr-4-reference/)
you can buy.
## 2. Antlr Files
The pccts-based parser had a single source file with lexigraphic tokens, the
feature file grammar, and snippets of C code (mostly to functions defined in
`feat.c`) to process the files. In the Antlr 4 implementation the lexer is
defined primarily in `FeatLexerBase.g4` and the file grammar in
`FeatParser.g4`. Neither of these include target-language code and therefore
could be used for feature file parsing in other target languages such as Java
or Python.
The additional file `FeatLexer.g4` imports `FeatLexerBase.g4` and has a small
amount of C++ code to recognize `anon` blocks. It is this file that defines
the actual Lexer and the generated files are accordingly `FeatLexer.h` and
`FeatLexer.cpp`.
The parser is similarly implemented by `FeatParser.h` and `FeatParser.cpp` and
there is also an abstract `FeatParserVisitor` class and a
`FeatParserBaseVisitor` implementation generated by Antlr 4. `FeatParser.h` is
the most useful file to refer to; it has the naming conventions and internal
structure of each of the nodes of the parse tree.
### 2.1 Generating
All of the derived files can currently be regenerated by running `python
BuildGrammar.py` in the `hotconv` source directory. This assuming that `antlr4`
is installed, in your path, and has a version matching the one hard-coded in
the script. The command also has a `-c` option that removes all the generated
files. (However, because the files are tracked in `git` you typically want to
include them in any updates.
### 2.2 Antlr Runtime and Versions
The root `CMakeLists.txt` file has a line like `set(ANTLR4_TAG tags/4.9.2)`.
`hotconv/BuildGrammar.py` has a matching line `antlr_version = "4.9.2"`. These
should be updated together to ensure the runtime (which is pulled down from
the Antlr 4 git repository) matches the generated files. When you update the
version remember to clean and regenerate the grammar.
## 3 Other Files
The new files `FeatCtx.h` and `FeatCtx.cpp` correspond to the old `feat.c`.
This C++ class mostly consists of utility and adapter code that should be
recognizable to people familiar with the previous system. The new files
`FeatVisitor.h` and `FeatVisitor.cpp` correspond to the snippets of C in
the old `featgram.g`, but in contrast with `FeatCtx` the new code is quite
different.
## 4. `FeatVisitor` and the Visitor Semantic
Antlr 4 can be used in different ways but its authors recommend using the
parser to build a parse tree and then traversing that tree with code written in
the target language. Antlr can optionally produce "listeners" and "visitors"
and the `makeotfexe` code uses the latter. In effect there is one virtual
method corresponding to each of the types of node in the tree. The default
implementation for a node just calls the method for each child node passing it
the child context. One processes the tree by replacing the default
implementation for a given node with one to do the processing.
Here, for example, is the grammar of a `featureBlock`:
```
featureBlock:
FEATURE starttag=tag USE_EXTENSION? LCBRACE
featureStatement+
RCBRACE endtag=tag SEMI
;
```
This is the corresponding `Context` class in `FeatParser.h`:
```
class FeatureBlockContext : public antlr4::ParserRuleContext {
public:
FeatParser::TagContext *starttag = nullptr;
FeatParser::TagContext *endtag = nullptr;
FeatureBlockContext(antlr4::ParserRuleContext *parent, size_t invokingState);
virtual size_t getRuleIndex() const override;
antlr4::tree::TerminalNode *FEATURE();
antlr4::tree::TerminalNode *LCBRACE();
antlr4::tree::TerminalNode *RCBRACE();
antlr4::tree::TerminalNode *SEMI();
std::vector<TagContext *> tag();
TagContext* tag(size_t i);
antlr4::tree::TerminalNode *USE_EXTENSION();
std::vector<FeatureStatementContext *> featureStatement();
FeatureStatementContext* featureStatement(size_t i);
virtual antlrcpp::Any accept(antlr4::tree::ParseTreeVisitor *visitor) override;
};
```
And this is a simplified version of the present visitor method for `featureBlock`
nodes:
```
antlrcpp::Any FeatVisitor::visitFeatureBlock(FeatParser::FeatureBlockContext *ctx) {
if ( stage == vExtract ) {
Tag t = checkTag(ctx->starttag, ctx->endtag);
TOK(ctx);
fc->startFeature(t);
if ( ctx->USE_EXTENSION() != nullptr )
fc->flagExtension(false);
}
for (auto i : ctx->featureStatement())
visitFeatureStatement(i);
if ( stage == vExtract ) {
TOK(ctx->endtag);
fc->endFeature();
}
return nullptr;
}
```
First note the `antlr4::tree::TerminalNode` pointers in the context object.
These can be checked for `nullptr` when keywords are optional (as with
`USE_EXTENSION`), otherwise they can generally be ignored.
`checktag` in `visitFeatureBlock()` is a utility method that verifies the start
and end tags are equal (or outputs an error) and returns the tag.
`startFeature()` is a `FeatCtx` method that prepares for new feature statements
and `endFeature()` is a corresponding method that wraps up feature processing.
In between the method calls `visitFeatureStatement()` on each child
`featureStatement` node in order, fulfilling its role as a "visitor".
The `stage` guards represent two stages of tree processing: `vInclude` and
`vExtract`. `vInclude` only involves opening and parsing included feature
files, and therefore the `vInclude` processing stage needs to reach each
include node without doing anything else. The parse tree is processed in
`vExtract`.
The remaining unreferenced syntax (besides the `antlrcpp::Any` return value,
which is an unused Antlr-ism) is the `TOK()` method. This is actually
overloaded with a method and two methods templates in `FeatVisitor.h` that
accept and return tree nodes or tokens. `TOK` should be called on a relevant
child node, and sometimes the current node (as in `TOK(ctx)`) before calling
out to a `FeatCtx()` method to set the token used to report the line number and
character offset of a warning or error.
## 5. Include directives
If you need to add another context that supports an include directive just
follow the model of one of the existing contexts. Don't forget to add a new
`EOF`-including `File` node toward the bottom of `FeatParser.g4`.
## 6. Document revisions
**v0.2 [24 May 2021]: Update when feature complete**
**v0.1 [11 May 2021]: First version**
|