File: lang-tutorial.md

package info (click to toggle)
gtksourceview5 5.18.0-3
  • links: PTS, VCS
  • area: main
  • in suites: forky, sid
  • size: 13,780 kB
  • sloc: ansic: 71,161; xml: 1,493; javascript: 866; perl: 216; sh: 144; java: 49; php: 48; yacc: 45; ruby: 38; ml: 36; python: 33; sql: 30; makefile: 23; cobol: 20; objc: 19; lisp: 19; fortran: 14; awk: 9; cpp: 8
file content (573 lines) | stat: -rw-r--r-- 17,033 bytes parent folder | download | duplicates (2)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
Title: Language Definition v2.0 Tutorial

Guide to the GtkSourceView language definition file format

# A language definition for the C language

To describe the syntax of a language GtkSourceView uses an XML format which
defines nested contexts to be highlighted. Each context roughly corresponds
to a portion of the syntax which has to be highlighted (e.g. keywords,
strings, comments), and can contain nested contexts (e.g. escaped
characters).

In this tutorial we will analyze a simple example to highlight a subset of
C, based on the full C language definition.

Like every well formed XML document, the language description starts with a
XML declaration:

```xml
<?xml version="1.0" encoding="UTF-8"?>
```

After the usual preamble, the main tag is the `<language>` element:

```xml
<language id="c" name="C" version="2.0" _section="Source">
```

The attribute `id` is used in external references and defines a standard
way to refer to this language definition, while the `name` attribute is
the name presented to the user.

The attribute `section` (it is translatable using gettext prepending
a `_`), tells the category where this language should be grouped when
it is presented to the user. Currently available categories in GtkSourceView are
"Source", "Script", "Markup", "Scientific" and "Other".

The attribute `version` specifies the version of the xml syntax
used in your language definition file, so it should always be `2.0`.

The `<language>` element contains three sections:
`<metadata>`, `<styles>` and
`<definitions>`

```xml
<metadata>
```

The `<metadata>` element is optional and provides a collection
of properties which specify arbitrary information about the language definition
file itself. It is particularly important to specify the conventional
`mimetypes` and `globs` properties that
GtkSourceView uses to automatically detect which syntax highlighting to use
for a given file. They respectively contain a semi-colon separated list of
mimetypes and filename extensions.

```xml
<metadata>
  <property name="mimetypes">text/x-c;text/x-csrc</property>
  <property name="globs">*.c</property>
</metadata>
```

```xml
<styles>
```

This element contains every association between the styles used in the
description and the defaults stored internally in GtkSourceView.
For each style there is a `<style>` element:

```xml
<style id="comment" name="Comment" map-to="def:comment"/>
```

This defines a `comment` style, which inherits the font
properties from the defaults style `def:comment`.
The `name` attribute is the name to show to the user (that string
could for example be used by a GUI tool to edit or create style schemes).

For each style used in the language definition there is a corresponding
`<style>` element; every style can be used in different
contexts, so they will share the same appearance.

```xml
<style id="string" name="String" map-to="def:string"/>
<style id="escaped-character" name="Escaped Character" map-to="def:special-char"/>
<style id="preprocessor" name="Preprocessor" map-to="def:preprocessor"/>
<style id="included-file" name="Included File" map-to="def:string"/>
<style id="char" name="Character" map-to="def:character"/>
<style id="keyword" name="Keyword" map-to="def:keyword"/>
<style id="type" name="Data Type" map-to="def:type"/>
```

Following the `<styles>` element there is the
`<definitions>` element, which contains the
description proper of the syntax:

```xml
<definitions>
```

Here we should define a main context, the one we enter at the beginning of
the file: to do so we use the `<context>` tag, with an
`id` equal to the `id` of the
`<language>` element:

```xml
<context id="c">
```

The element `<include>` contains the list of sub-contexts
for the current context: as we are in the main context we should put here
the top level contexts for the C language:

```xml
<include>
```

The first context defined is the one for single-line C style comments: they
start with a double slash `//` and end at the end of the line:

```xml
<context id="comment" style-ref="comment">
  <start>\/\/</start>
  <end>$</end>
</context>
```

The `<start>` element contains the regular expression telling
the highlighting engine to enter in the defined context, until the terminating
regular expression contained in the `<end>` element is found.

Those regular expressions are PCRE regular expressions in the form
`/regex/options` (see the documentation of PCRE for details). If
there are no options to be specified and you don't need to match the spaces at
the start and at the end of the regular expression, you can omit the slashes,
putting here only `regex`.

The possible options are:

- `i`: case insensitive;
- `x`: extended (spaces are ignored and it is possible to put comments
    starting with `#` and ending at the end of the line);
- `s`: the metacharacter `.` matches the `\n`.

You can set the default options using the `<default-regex-options` tag
before the `<definitions>` element. To disable a group of options,
instead, you have to precede them with a hyphen (`-`).
[FIXME: add an example]

In GtkSourceView are available also some extensions to the standard perl
style regular expressions:

- `\%[` and `\%]` are custom word boundaries, which can be redefined with the `<keyword-char-class>` tag (in
contrast with `\b`);

- `\%{id}` will include the regular expression defined in the
`<define-regex>` tag with the same id, useful if you have
common portions of regular expressions used in different contexts;

- `\%{subpattern@start}` can be used only inside the
`<end>` tag and will be substituted with the
string matched in the corresponding
sub-pattern (can be a number or a name if named sub-patterns are
used) in the preceding `<start>` element. For an example
see the implementation of here-documents in the `sh.lang`
language description distributed with GtkSourceView.

The next context is for C-style strings. They start and end with a double
quote but they can contain escaped double quotes, so we should make sure
we don't end the string prematurely:

```xml
<context id="string" end-at-line-end="true" style-ref="string">
```

The `end-at-line-end` attribute tells the engine that the current context
should be forced to terminate at the end of the line, even if the ending
regular expression is not found, and that an error should be displayed.

```xml
<start>"</start>
<end>"</end>
<include>
```

To implement the escape handling we include a `escape` context:

```xml
  <context id="escape" style-ref="escaped-character">
    <match>\\.</match>
  </context>
```

This is a simple context matching a single regular expression, contained in
the `<match>` element. This context will extend its parent, causing the
ending regular expression of the `"string"` context to not match the escaped
double quote.

```xml
</include>
</context>
```

Multiline C-style comment can span over multiple lines and cannot be
escaped, but to make things more interesting we want to highlight every
internet address contained:

```xml
<context id="comment-multiline" style-ref="comment">
  <start>\/\*</start>
  <end>\*\/</end>
  <include>
    <context id="net-address" style-ref="net-address" extend-parent="false">
```

In this case, the child should be terminated if the end of the parent is
found, so we use `false` in the `extend-parent` attribute.

```xml
      <match>http:\/\/[^\s]*</match>
    </context>
  </include>
</context>
```

For instance in the following comment the string `http://www.gnome.org*/`
matches the `net-address` context but it contains the end of the parent
context (`*/`). As `extend-parent` is false,
only `http://www.gnome.org` is
highlighted as an address and `*/` is correctly recognized as the end of
the comment.

```xml
/* This is a comment http://www.gnome.org */
```

Character constants in C are delimited by single quotes (`'`) and can
contain escaped characters:

```xml
<context id="char" end-at-line-end="true" style-ref="string">
  <start>'</start>
  <end>'</end>
  <include>
    <context ref="escape"/>
```

The `ref` attribute is used when we want to reuse a previously defined
context. Here we reuse the `escape` context defined in the `string`
context, without repeating its definition.

```xml
  </include>
</context>
```

Using `ref` it is also possible to refer to contexts defined in other
languages, preceding the id of the context with the id of the containing
language, separating them with a colon:

```xml
<context ref="def:decimal"/>
<context ref="def:float"/>
```

The definitions for decimal and float constants are in a external file,
with id `def`, which is not associated with any language but contains
reusable contexts which every language definition can import.

The `def` language file contains an `in-comment` context that can contain
addresses and tags such as FIXME and TODO, so we can write a new version of
our `comment-multiline` context that uses the definitions from `def.lang`.

```xml
<context id="comment-multiline" style-ref="comment">
  <start>\/\*</start>
  <end>\*\/</end>
  <include>
    <context ref="def:in-comment"/>
```

```xml
  </include>
</context>
```

Keywords can be grouped in a context using a list of `<keyword>`
elements:

```xml
<context id="keywords" style-ref="keyword">
  <keyword>if</keyword>
  <keyword>else</keyword>
  <keyword>for</keyword>
  <keyword>while</keyword>
  <keyword>return</keyword>
  <keyword>break</keyword>
  <keyword>switch</keyword>
  <keyword>case</keyword>
  <keyword>default</keyword>
  <keyword>do</keyword>
  <keyword>continue</keyword>
  <keyword>goto</keyword>
  <keyword>sizeof</keyword>
</context>
```

Keywords with different meaning can be grouped in different context, making
possible to highlight them differently:

```xml
<context id="types" style-ref="type">
  <keyword>char</keyword>
  <keyword>const</keyword>
  <keyword>double</keyword>
  <keyword>enum</keyword>
  <keyword>float</keyword>
  <keyword>int</keyword>
  <keyword>long</keyword>
  <keyword>short</keyword>
  <keyword>signed</keyword>
  <keyword>static</keyword>
  <keyword>struct</keyword>
  <keyword>typedef</keyword>
  <keyword>union</keyword>
  <keyword>unsigned</keyword>
  <keyword>void</keyword>
</context>
```

You can also set a prefix (or a suffix) common to every keyword using the
`<prefix>` and `<suffix>` tags:

```xml
<context id="preprocessor" style-ref="preprocessor">
  <prefix>^#</prefix>
```

If not specified, `<prefix>` and `<suffix>`
are set to, respectively, `\%[` and
`\%]`.

```xml
  <keyword>define</keyword>
  <keyword>undef</keyword>
```

Keep in mind that every keyword is a regular expression:

```xml
  <keyword>if(n?def)?</keyword>
  <keyword>else</keyword>
  <keyword>elif</keyword>
  <keyword>endif</keyword>
</context>
```

In C, there is a common practice to use `#if 0` to express multi-line
nesting comments. To make things easier to the user, we want to highlight
these pseudo-comments as comments:

```xml
<context id="if0-comment" style-ref="comment">
  <start>^#if 0\b</start>
  <end>^#(endif|else|elif)\b</end>
  <include>
```

As `#if 0` comments are nesting, we should consider that inside a comment
we can find other `#if`s with the corresponding `#endif`s, avoiding
the termination of the comment on the wrong `#endif`. To do so we use a
nested context, that will extend the parent on every nested
`#if`/`#endif`:

```xml
  <context id="if-in-if0">
    <start>^#if(n?def)?\b</start>
    <end>^#endif\b</end>
    <include>
```

Nested contexts can be recursive:

```xml
      <context ref="if-in-if0"/>
    </include>
  </context>
  </include>
</context>
```

Because contexts defined before have higher priority, `if0-comment` will
never be matched. To make things work we should move it before the
`preprocessor` context, thus giving `if0-comment` a higher priority.

For the `#include` preprocessor directive it could be useful to highlight
differently the included file:

```xml
<context id="include" style-ref="preprocessor">
  <match>^#include (".*"|&amp;lt;.*&amp;gt;)</match>
  <include>
```

To do this we use grouping sub-patterns in the regular expression,
associating them with a context with the `sub-pattern` attribute:

```xml
    <context id="included-file" sub-pattern="1"
             style-ref="included-file"/>
```

In the `sub-pattern` attribute we could use:

- 0: the whole regular expression;
- 1: the first sub-pattern (a sub-espression enclosed in parenthesis);
- 2: the second;
- ...
- `name`: a named sub-pattern with name `name` (see the PCRE documentation).

We could also use a `where` attribute with value
`start` or `end` to
specify the regular expression the context is referring, when we have both
the `<start>` and `<end>` element.

```xml
  </include>
</context>
```

Having defined a good subset of the C syntax we close every tag still open:

```xml
</include>
</context>
</definitions>
</language>
```

# The full language definition

This is the full language definition for the subset of C taken in consideration
for this tutorial:

```xml
<?xml version="1.0" encoding="UTF-8"?>
<language id="c" name="C" version="2.0" _section="Source">
  <metadata>
    <property name="mimetypes">text/x-c;text/x-csrc</property>
    <property name="globs">*.c</property>
  </metadata>
  <styles>
    <style id="comment" name="Comment" map-to="def:comment"/>
    <style id="string" name="String" map-to="def:string"/>
    <style id="escaped-character" name="Escaped Character" map-to="def:special-char"/>
    <style id="preprocessor" name="Preprocessor" map-to="def:preprocessor"/>
    <style id="included-file" name="Included File" map-to="def:string"/>
    <style id="char" name="Character" map-to="def:character"/>
    <style id="keyword" name="Keyword" map-to="def:keyword"/>
    <style id="type" name="Data Type" map-to="def:type"/>
  </styles>
  <definitions>
    <context id="c">
      <include>

        <context id="comment" style-ref="comment">
          <start>\/\/</start>
          <end>$</end>
        </context>

        <context id="string" end-at-line-end="true" style-ref="string">
          <start>"</start>
          <end>"</end>
          <include>
            <context id="escape" style-ref="escaped-character">
              <match>\\.</match>
            </context>
          </include>
        </context>

        <context id="comment-multiline" style-ref="comment">
          <start>\/\*</start>
          <end>\*\/</end>
          <include>
            <context ref="def:in-comment"/>
          </include>
        </context>

        <context id="char" end-at-line-end="true" style-ref="string">
          <start>'</start>
          <end>'</end>
          <include>
            <context ref="escape"/>
          </include>
        </context>

        <context ref="def:decimal"/>
        <context ref="def:float"/>

        <context id="keywords" style-ref="keyword">
          <keyword>if</keyword>
          <keyword>else</keyword>
          <keyword>for</keyword>
          <keyword>while</keyword>
          <keyword>return</keyword>
          <keyword>break</keyword>
          <keyword>switch</keyword>
          <keyword>case</keyword>
          <keyword>default</keyword>
          <keyword>do</keyword>
          <keyword>continue</keyword>
          <keyword>goto</keyword>
          <keyword>sizeof</keyword>
        </context>

        <context id="types" style-ref="type">
          <keyword>char</keyword>
          <keyword>const</keyword>
          <keyword>double</keyword>
          <keyword>enum</keyword>
          <keyword>float</keyword>
          <keyword>int</keyword>
          <keyword>long</keyword>
          <keyword>short</keyword>
          <keyword>signed</keyword>
          <keyword>static</keyword>
          <keyword>struct</keyword>
          <keyword>typedef</keyword>
          <keyword>union</keyword>
          <keyword>unsigned</keyword>
          <keyword>void</keyword>
        </context>

        <context id="if0-comment" style-ref="comment">
          <start>^#if 0\b</start>
          <end>^#(endif|else|elif)\b</end>
          <include>
            <context id="if-in-if0">
              <start>^#if(n?def)?\b</start>
              <end>^#endif\b</end>
              <include>
                <context ref="if-in-if0"/>
              </include>
            </context>
          </include>
        </context>

        <context id="preprocessor" style-ref="preprocessor">
          <prefix>^#</prefix>
          <keyword>define</keyword>
          <keyword>undef</keyword>
          <keyword>if(n?def)?</keyword>
          <keyword>else</keyword>
          <keyword>elif</keyword>
          <keyword>endif</keyword>
        </context>

        <context id="include" style-ref="preprocessor">
          <match>^#include (".*"|&amp;lt;.*&amp;gt;)</match>
          <include>
            <context id="included-file"
                     sub-pattern="1"
                     style-ref="included-file"/>
          </include>
        </context>

      </include>
    </context>
  </definitions>
</language>
```