1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379
|
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2//EN">
<HTML>
<HEAD>
<TITLE>SWI-Prolog RDF parser</TITLE>
</HEAD>
<BODY BGCOLOR="white">
<BLOCKQUOTE>
<BLOCKQUOTE>
<BLOCKQUOTE>
<BLOCKQUOTE>
<CENTER>
<H1>SWI-Prolog RDF parser</H1>
</CENTER>
<HR>
<CENTER>
<I>Jan Wielemaker <BR>
SWI, <BR>
University of Amsterdam <BR>
The Netherlands <BR>
E-mail: <A HREF="mailto:jan@swi.psy.uva.nl">jan@swi.psy.uva.nl</A></I>
</CENTER>
<HR>
</BLOCKQUOTE>
</BLOCKQUOTE>
</BLOCKQUOTE>
</BLOCKQUOTE>
<CENTER><H3>Abstract</H3></Center>
<TABLE WIDTH="90%" ALIGN=center BORDER=2 BGCOLOR="#f0f0f0"><TR><TD>
RDF (<B>R</B>esource <B>D</B>escription <B>F</B>ormat) is a W3C standard
for expressing meta-data about web-resources. It has three
representations providing the same semantics. RDF documents are normally
transferred as XML documents using the RDF-XML syntax. This format is
very unsuitable for processing. The parser defined here converts an
RDF-XML document into the triple notation.
</TABLE>
<H1><A NAME="document-contents">Table of Contents</A></H1>
<UL>
<LI><A HREF="#sec:1"><B>1 Introduction</B></A>
<LI><A HREF="#sec:2"><B>2 Parsing RDF in Prolog</B></A>
<LI><A HREF="#sec:3"><B>3 Predicates</B></A>
<UL>
<LI><A HREF="#sec:3.1">3.1 Name spaces</A>
<LI><A HREF="#sec:3.2">3.2 Low-level access</A>
</UL>
<LI><A HREF="#sec:4"><B>4 Testing the RDF translator</B></A>
<LI><A HREF="#sec:5"><B>5 Metrics</B></A>
<LI><A HREF="#sec:6"><B>6 Installation</B></A>
<UL>
<LI><A HREF="#sec:6.1">6.1 Unix systems</A>
<LI><A HREF="#sec:6.2">6.2 Windows</A>
</UL>
</UL>
<P>
<H2><A NAME="sec:1">1 Introduction</A></H2>
<P>RDF is a promising standard for representing meta-data about
documents on the web as well as exchanging ontologies. RDF is often
associated with `semantics on the web'. It consists of a formal
data-model defined in terms of <EM>triples</EM>. In addition, a <EM>graph</EM>
model is defined for visualisation and an XML application is defined for
exchange.
<P>`Semantics on the web' is often associated with the Prolog
programming language. It is assumed that Prolog is a suitable vehicle to
reason with the data expressed in RDF models. Most of the related
web-infra structure (e.g. XML parsers, DOM implementations) are defined
in Java, Perl, C or C++.
<P>Various routes are available to the Prolog user. Low-level XML
parsing is due to its nature best done in C or C++. These languages
produce fast code. As XML/SGML are the basis of most of the other
web-related formats we will benefit most here. XML and SGML being very
stable specifications make this even more attractive.
<P>But what about RDF? RDF-XML is defined in XML, and provided with a
Prolog term representing the XML document processing it according to the
RDF syntax is quick and easy in Prolog. The alternative, getting yet
another library and language attached to the system, is getting less
attractive.
<H2><A NAME="sec:2">2 Parsing RDF in Prolog</A></H2>
<P>To demonstrate this, we realised an RDF compiler in Prolog on top of
the sgml2pl package (providing a name-space sensitive XML parser). The
transformation is realised in two passes.
<P>The first pass rewrites the XML term into a Prolog term conveying the
same information in a more friendly manner. This transformation is
defined in a high-level pattern matching language defined on top of
Prolog with properties similar to DCG (Definite Clause Grammar).
<P>The source of this translation is very close to the BNF notation used
by the <A HREF="http://www.w3.org/TR/REC-rdf-syntax/">specification</A>,
so correctness is `obvious'. Below is a part of the definition of RDF
containers. Note that XML elements are represented using a term of the
format:
<BLOCKQUOTE>
<CODE>element(Name, [AttrName = Value...], [Content ...])</CODE>
</BLOCKQUOTE>
<P><P><TABLE WIDTH="90%" ALIGN=center BORDER=6 BGCOLOR="#e0e0e0"><TR><TD NOWRAP>
<PRE>
memberElt(LI) ::=
\referencedItem(LI).
memberElt(LI) ::=
\inlineItem(LI).
referencedItem(LI) ::=
element(\rdf(li),
[ \resourceAttr(LI) ],
[]).
inlineItem(literal(LI)) ::=
element(\rdf(li),
[ \parseLiteral ],
LI).
inlineItem(description(description, _, _, Properties)) ::=
element(\rdf(li),
[ \parseResource ],
\propertyElts(Properties)).
inlineItem(LI) ::=
element(\rdf(li),
[],
[\rdf_object(LI)]), !. % inlined object
inlineItem(literal(LI)) ::=
element(\rdf(li),
[],
[LI]). % string value
</PRE>
</TABLE>
<P>Expression in the rule that are prefixed by the <CODE>\</CODE>
operator acts as invocation of another rule-set. The body-term is
converted into a term where all rule-references are replaced by
variables. The resulting term is matched and translation of the
arguments is achieved by calling the appropriate rule. Below is the
Prolog code for the
<B>referencedItem</B> rule:
<P><P><TABLE WIDTH="90%" ALIGN=center BORDER=6 BGCOLOR="#e0e0e0"><TR><TD NOWRAP>
<PRE>
referencedItem(A, element(B, [C], [])) :-
rdf(li, B),
resourceAttr(A, C).
</PRE>
</TABLE>
<P>Additional code can be added using a notation close to the Prolog DCG
notation. Here is the rule for a description, producing properties both
using <B>propAttrs</B> and <B>propertyElts</B>.
<P><P><TABLE WIDTH="90%" ALIGN=center BORDER=6 BGCOLOR="#e0e0e0"><TR><TD NOWRAP>
<PRE>
description(description, About, BagID, Properties) ::=
element(\rdf('Description'),
\attrs([ \?idAboutAttr(About),
\?bagIdAttr(BagID)
| \propAttrs(PropAttrs)
]),
\propertyElts(PropElts)),
{ !, append(PropAttrs, PropElts, Properties)
}.
</PRE>
</TABLE>
<H2><A NAME="sec:3">3 Predicates</A></H2>
<P>The parser is designed to operate on various environments and
therefore provides interfaces at various levels. First we describe the
top level defined in <CODE>library(rdf)</CODE>, simply parsing a PDF-XML
file into a list of triples. Please note these are <EM>not</EM> asserted
into the database because it is not necessarily the final format the
user wishes to reason with and it is not clean how the user wants to
deal with multiple RDF documents. Some options are using global URI's in
one pool, in Prolog modules or using an additional argument.
<DL>
<DT><A NAME="load_rdf/2"><STRONG>load_rdf</STRONG>(<VAR>+File, -Triples</VAR>)</A><DD>
Same as <CODE>load_rdf(File, Triples,)</CODE>.
<DT><A NAME="load_rdf/3"><STRONG>load_rdf</STRONG>(<VAR>+File, -Triples,
+Options</VAR>)</A><DD>
Read the RDF-XML file <VAR>File</VAR> and return a list of <VAR>Triples</VAR>.
<VAR>Options</VAR> defines additional processing options. Currently
defined options are:
<DL>
<DT><STRONG>base_uri</STRONG>(<VAR>BaseURI</VAR>)<DD>
If provided local identifiers and identifier-references are globalised
using this URI. If omited or the atom <CODE>[]</CODE>, local identifiers
are not tagged.
</DL>
<P>The <VAR>Triples</VAR> list is a list of <CODE>rdf(Subject,
Predicate, Object)</CODE> triples. <VAR>Subject</VAR> is either a plain
resource (an atom), or one of the terms <CODE>each(URI)</CODE> or <CODE>prefix(URI)</CODE>
with the obvious meaning. <VAR>Predicate</VAR> is either a plain atom
for explicitely non-qualified names or a term
<VAR>NameSpace</VAR><B>:</B><VAR>Name</VAR>. If <VAR>NameSpace</VAR> is
the defined RDF name space it is returned as the atom <CODE>rdf</CODE>.
Finally, <VAR>Object</VAR> is a URI, a <VAR>Predicate</VAR> or a term of
the format <CODE>literal(Value)</CODE> for literal values. <VAR>Value</VAR>
is either a plain atom or a parsed XML term (list of atoms and
elements).
</DL>
<H3><A NAME="sec:3.1">3.1 Name spaces</A></H3>
<P>XML name spaces are identified using a URI. Unfortunately various
URI's are in common use to refer to RDF. The <CODE>rdf_parser.pl</CODE>
module therefore defined the namespace as a <A NAME="idx:multifile1:1"></A><B>multifile/1</B>
predicate, that can be extended by the user. For example, to parse the <A HREF="http://www.mozilla.org/rdf/doc/inference.html">Netscape
OpenDirectory</A>
<CODE>structure.rdf</CODE> file, the following declarations are used:
<P><P><TABLE WIDTH="90%" ALIGN=center BORDER=6 BGCOLOR="#e0e0e0"><TR><TD NOWRAP>
<PRE>
:- multifile
rdf_parser:rdf_name_space/1.
rdf_parser:rdf_name_space('http://www.w3.org/TR/RDF/').
rdf_parser:rdf_name_space('http://directory.mozilla.org/rdf').
rdf_parser:rdf_name_space('http://dmoz.org/rdf').
</PRE>
</TABLE>
<P>The initial definition of this predicate is given below.
<P><P><TABLE WIDTH="90%" ALIGN=center BORDER=6 BGCOLOR="#e0e0e0"><TR><TD NOWRAP>
<PRE>
rdf_name_space('http://www.w3.org/1999/02/22-rdf-syntax-ns#').
rdf_name_space('http://www.w3.org/TR/REC-rdf-syntax').
</PRE>
</TABLE>
<H3><A NAME="sec:3.2">3.2 Low-level access</A></H3>
<P>The above defined <A NAME="idx:loadrdf23:2"></A><A HREF="#load_rdf/2">load_rdf/[2,3]</A>
is not always suitable. For example, it cannot deal with documents where
the RDF statement is embedded an XML document. It also cannot deal with
really big documents (e.g. the Netscape OpenDirectory project), without
huge amounts of memory.
<P>For really big documents, the <B>sgml2pl</B> parser can be programmed
to handle the content of a specific element (i.e. <TT><rdf:RDF></TT>)
element-by-element. The parsing primitives defined in this section can
be used to process these one-by-one.
<DL>
<DT><A NAME="xml_to_rdf/3"><STRONG>xml_to_rdf</STRONG>(<VAR>+XML,
+BaseURI, -Triples</VAR>)</A><DD>
Process an XML term produced by <A NAME="idx:loadstructure3:3"></A><B>load_structure/3</B>
using the
<CODE>dialect(xmlns)</CODE> output option. <VAR>XML</VAR> is either a
complete <TT><rdf:RDF></TT> element, a list of RDF-objects
(container or description) or a single description of container.
<DT><A NAME="process_rdf/3"><STRONG>process_rdf</STRONG>(<VAR>+File,
+BaseURI, :OnTriples</VAR>)</A><DD>
Exploits the call-back interface of <B>sgml2pl</B>, calling
<VAR>OnTriples</VAR> with the list of triples resulting from a single
top level RDF object for each RDF element in the file. This predicate
can be used to process arbitrary large RDF files as the file is
processed object-by-object. The example below simply asserts all triples
into the database:
<P><P><TABLE WIDTH="90%" ALIGN=center BORDER=6 BGCOLOR="#e0e0e0"><TR><TD NOWRAP>
<PRE>
assert_list([]).
assert_list([H|T]) :-
assert(H),
assert_list(T).
?- process_rdf('structure,rdf', [], assert_list).
</PRE>
</TABLE>
<P>
</DL>
<H2><A NAME="sec:4">4 Testing the RDF translator</A></H2>
<P>A test-suite and driver program are provided by <CODE>rdf_test.pl</CODE>
in the source directory. To run these tests, load this file into Prolog
in the distribution directory. The test files are in the directory
<CODE>suite</CODE> and the proper output in <CODE>suite/ok</CODE>.
Predicates provided by <CODE>rdf_test.pl</CODE>:
<DL>
<DT><A NAME="suite/1"><STRONG>suite</STRONG>(<VAR>+N</VAR>)</A><DD>
Run test <VAR>N</VAR> using the file <CODE>suite/tN.rdf</CODE> and
display the RDF source, the intermediate Prolog representation and the
resulting triples.
<DT><A NAME="passed/1"><STRONG>passed</STRONG>(<VAR>+N</VAR>)</A><DD>
Process <CODE>suite/tN.rdf</CODE> and store the resulting triples in
<CODE>suite/ok/tN.pl</CODE> for later validation by <A NAME="idx:test0:4"></A><A HREF="#test/0">test/0</A>.
<DT><A NAME="test/0"><STRONG>test</STRONG></A><DD>
Run all tests and classify the result.
</DL>
<H2><A NAME="sec:5">5 Metrics</A></H2>
<P>It took three days to write and one to document the Prolog RDF
parser. A significant part of the time was spent understanding the RDF
specification.
<P>The size of the source (including comments) is given in the table
below.
<P>
<CENTER>
<TABLE BORDER=2 FRAME=box RULES=groups>
<TR VALIGN=top><TD ALIGN=right><B>lines</B></TD><TD ALIGN=right><B>words</B></TD><TD ALIGN=right><B>bytes</B></TD><TD><B>file</B></TD><TD><B>function </B></TD></TR>
<TBODY>
<TR VALIGN=top><TD ALIGN=right>109</TD><TD ALIGN=right>255</TD><TD ALIGN=right>2663</TD><TD>rdf.pl</TD><TD>Driver
program </TD></TR>
<TR VALIGN=top><TD ALIGN=right>312</TD><TD ALIGN=right>649</TD><TD ALIGN=right>6416</TD><TD>rdf_parser.pl</TD><TD>1-st
phase parser </TD></TR>
<TR VALIGN=top><TD ALIGN=right>246</TD><TD ALIGN=right>752</TD><TD ALIGN=right>5852</TD><TD>rdf_triple.pl</TD><TD>2-nd
phase parser </TD></TR>
<TR VALIGN=top><TD ALIGN=right>126</TD><TD ALIGN=right>339</TD><TD ALIGN=right>2596</TD><TD>rewrite.pl</TD><TD>rule-compiler </TD></TR>
<TBODY>
<TR VALIGN=top><TD ALIGN=right>793</TD><TD ALIGN=right>1995</TD><TD ALIGN=right>17527</TD><TD>total</TD></TR>
</TABLE>
</CENTER>
<P>We also compared the performance using an RDF-Schema file generated
by
<A HREF="http://www.smi.stanford.edu/projects/protege/">Protege-2000</A>
and interpreted as RDF. This file contains 162 descriptions in 50
Kbytes, resulting in 599 triples. Environment: Intel Pentium-II/450 with
384 Mbytes memory running SuSE Linux 6.3.
<P>The parser described here requires 0.15 seconds excluding 0.13
seconds Prolog startup time to process this file. The <A HREF="http://www.pro-solutions.com/rdfdemo/">Pro
Solutions</A> parser (written in Perl) requires 1.5 seconds exluding
0.25 seconds startup time.
<H2><A NAME="sec:6">6 Installation</A></H2>
<H3><A NAME="sec:6.1">6.1 Unix systems</A></H3>
<P>Installation on Unix system uses the commonly found <EM>configure</EM>,
<EM>make</EM> and <EM>make install</EM> sequence. SWI-Prolog should be
installed before building this package. If SWI-Prolog is not installed
as <B>pl</B>, the environment variable <CODE>PL</CODE> must be set to
the name of the SWI-Prolog executable. Installation is now accomplished
using:
<P><P><TABLE WIDTH="90%" ALIGN=center BORDER=6 BGCOLOR="#e0e0e0"><TR><TD NOWRAP>
<PRE>
% ./configure
% make
% make install
</PRE>
</TABLE>
<P>This installs the Prolog library files in <CODE>$PLBASE/library</CODE>,
where
<CODE>$PLBASE</CODE> refers to the SWI-Prolog `home-directory'.
<H3><A NAME="sec:6.2">6.2 Windows</A></H3>
<P>Run the file <CODE>setup.pl</CODE> by double clicking it. This will
install the required files into the SWI-Prolog directory and update the
library directory.</BODY></HTML>
|