1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202
|
# BibTeX tutorial
A tutorial for parsing well known format for bibliographic references.
---
The word [BibTeX](http://www.BibTeX.org/) stands for a tool and a file format
which are used to describe and process lists of references, mostly in
conjunction with LaTeX documents.
An example of BibTeX entry is given below.
```BibTeX
@article{DejanovicADomain-SpecificLanguageforDefiningStaticStructureofDatabaseApplications2010,
author = "Igor Dejanovi\'{c} and Gordana Milosavljevi\'{c} and Branko Peri\v{s}i\'{c} and Maja Tumbas",
title = "A {D}omain-Specific Language for Defining Static Structure of Database Applications",
journal = "Computer Science and Information Systems",
year = "2010",
volume = "7",
pages = "409--440",
number = "3",
month = "June",
issn = "1820-0214",
doi = "10.2298/CSIS090203002D",
url = "http://www.comsis.org/ComSIS/Vol7No3/RegularPapers/paper2.htm",
type = "M23"
}
```
Each BibTeX entry starts with `@` and a keyword denoting entry type (`article`)
in this example. After the entry type is the body of the reference inside curly
braces. The body of the reference consists of elements separated by a comma.
The first element is the key of the entry. It should be unique.
The rest of the entries are fields in the format:
<field_name> = <field_value>
# The grammar
Let's start with the grammar.
Create file `bibtex.py`, and import `arpeggio`.
```python
from arpeggio import *
from arpeggio import RegExMatch as _
```
Then create grammar rules:
- BibTeX file consists of zero or more BibTeX entries.
```python
def bibfile(): return ZeroOrMore(bibentry), EOF
```
- Now we define the structure of BibTeX entry.
```python
def bibentry(): return bibtype, "{", bibkey, ",", field, ZeroOrMore(",", field), "}"
```
- Each field is given as field name, equals char (`=`), and the field value.
```python
def field(): return fieldname, "=", fieldvalue
```
- Field value can be specified inside braces or quotes.
```python
def fieldvalue(): return [fieldvalue_braces, fieldvalue_quotes]
def fieldvalue_braces(): return "{", fieldvalue_braced_content, "}"
def fieldvalue_quotes(): return '"', fieldvalue_quoted_content, '"'
```
- Now, let's define field name, BibTeX type and the key. We use regular
expression match for this (`RegExMatch` class).
```python
def fieldname(): return _(r'[-\w]+')
def bibtype(): return _(r'@\w+')
def bibkey(): return _(r'[^\s,]+')
```
Field name is defined as hyphen or alphanumeric one or more times.
BibTeX entry type is `@` char after which must be one or more alphanumeric.
BibTeX key is everything until the first space or comma.
- Field value can be quoted and braced. Let's match the content.
```python
def fieldvalue_quoted_content(): return _(r'((\\")|[^"])*')
def fieldvalue_braced_content(): return Combine(ZeroOrMore(Optional(And("{"), fieldvalue_inner),\
fieldvalue_part))
def fieldvalue_part(): return _(r'((\\")|[^{}])+')
def fieldvalue_inner(): return "{", fieldvalue_braced_content, "}"
```
!!! note "Combine decorator"
We use `Combine` decorator to specify braced content. This decorator
produces a [Terminal](../parse_trees.md#terminal-nodes) node in [the parse
tree](../parse_trees.md).
# The parser
To instantiate the parser we are using `ParserPython` Arpeggio's class.
```python
parser = ParserPython(bibfile)
```
Now, we have our parser. Let's parse some input:
- First load some BibTeX data from a file.
```python
file_name = os.path.join(os.path.dirname(__file__), 'bibtex_example.bib')
with codecs.open(file_name, "r", encoding="utf-8") as bibtexfile:
bibtexfile_content = bibtexfile.read()
```
We are using `codecs` module to load the file using `utf-8` encoding.
`bibtexfile_content` is now a string with the content of the file.
- Parse the input string
```pyhton
parse_tree = parser.parse(bibtexfile_content)
```
The parse tree is produced.
# Extracting data from the parse tree
Let's suppose that we want our BibTeX file to be transformed to a list of
Python dictionaries where each field is keyed by its name and the value is
the field value cleaned up from the BibTeX cruft.
Like this:
```python
{ 'author': 'Igor Dejanović and Gordana Milosavljević and Branko Perišić and Maja Tumbas',
'bibkey': 'DejanovicADomain-SpecificLanguageforDefiningStaticStructureofDatabaseApplications2010',
'bibtype': '@article',
'doi': '10.2298/CSIS090203002D',
'issn': '1820-0214',
'journal': 'Computer Science and Information Systems',
'month': 'June',
'number': '3',
'pages': '409--440',
'title': 'A Domain-Specific Language for Defining Static Structure of Database Applications',
'type': 'M23',
'url': 'http://www.comsis.org/ComSIS/Vol7No3/RegularPapers/paper2.htm',
'volume': '7',
'year': '2010'}
```
The key is stored under a dict key `bibkey` while the entry type is stored
under the dict key `bibtype`.
After calling the `parse` method on the parser our textual data will be parsed
and stored in [the parse tree](../parse_trees.md). We could navigate the tree
to extract the data and build the python list of dictionaries but a lot easier
is to use [Arpeggio's visitor support](../semantics.md).
In this case we shall create `BibTeXVisitor` class with `visit_*` methods for
each grammar rule whose parse tree node we want to process.
```python
class BibTeXVisitor(PTNodeVisitor):
def visit_bibfile(self, node, children):
"""
Just returns list of child nodes (bibentries).
"""
# Return only dict nodes
return [x for x in children if type(x) is dict]
def visit_bibentry(self, node, children):
"""
Constructs a map where key is bibentry field name.
Key is returned under 'bibkey' key. Type is returned under 'bibtype'.
"""
bib_entry_map = {
'bibtype': children[0],
'bibkey': children[1]
}
for field in children[2:]:
bib_entry_map[field[0]] = field[1]
return bib_entry_map
def visit_field(self, node, children):
"""
Constructs a tuple (fieldname, fieldvalue).
"""
field = (children[0], children[1])
return field
```
Now, apply the visitor to the parse tree.
```python
ast = visit_parse_tree(parse_tree, BibTeXVisitor())
```
`ast` is now a Python list of dictionaries in the desired format from above.
A full source code for this example can be found in [the source
code repository](https://github.com/textX/Arpeggio/tree/master/examples/bibtex).
!!! note
Example in the repository is actually a fully working parser with the
support for BibTeX comments and comment entries. This is out of scope
for this tutorial. You can find the details in the source code.
|