File: query.html

package info (click to toggle)
semweb 1.05%2Bdfsg-1
links: PTS, VCS
area: main
in suites: lenny
size: 3,988 kB
ctags: 2,832
sloc: cs: 14,483; makefile: 180; sh: 107; perl: 20; ansic: 7
file content (380 lines) | stat: -rw-r--r-- 17,806 bytes
parent folder | download | duplicates (3)
<html>
	<head>
		<title>SemWeb: Querying</title>
		<link rel="stylesheet" type="text/css" href="stylesheet.css"/>
	</head>
	
	<body>

<p><a href="index.html">SemWeb Documentation</a></p>
<h1>Querying: Simple Entailment and SPARQL</h1>

<p>This page provides an example for running queries against RDF data
using SemWeb.  Three query methods are supported.  The first is the
GraphMatch engine which does simple entailment, matching a simple graph
with no disjunctions or optional edges against target data.  The second method
runs SPARQL queries against any data source supported by SemWeb using
a SPARQL query engine. The third method passes SPARQL queries to a remote
SPARQL endpoint over HTTP.</p>

<h2>Getting the data</h2>

<p>We'll use an RDF description of the people in the U.S. Congress
for this example.  Download the data files at
<a href="http://www.govtrack.us/data/rdf/people.rdf.gz">http://www.govtrack.us/data/rdf/people.rdf.gz</a>,
<a href="http://www.govtrack.us/data/rdf/bills.108.rdf.gz">http://www.govtrack.us/data/rdf/bills.108.rdf.gz</a>, and
<a href="http://www.govtrack.us/data/rdf/bills.108.cosponsors.rdf.gz">http://www.govtrack.us/data/rdf/bills.108.cosponsors.rdf.gz</a>
and un-gzip them (on Windows use WinZip).</p>

<p>To simply some things, we'll put the contents of these three files into a single Notation3 file using the following command.  (You may need to adjust the path to <tt>rdfstorage.exe</tt>.  It should be in SemWeb's <tt>bin</tt> directory.)</p>

<pre class="code">
$ mono rdfstorage.exe --out n3:congress.n3 people.rdf bills.108.rdf bills.108.cosponsors.rdf
</pre>

<p><tt>rdfstorage.exe</tt> reads RDF files into
a StatementSink, either an RdfWriter or a Store.  The default is to read files in RDF/XML format (with the RdfXmlReader).  We specified the output as <tt>n3:congress.n3</tt>, which means to write the data in Notation 3 (N3) format to the file <tt>congress.n3</tt>.  The command outputs the following:</p>

<pre class="code">
people.rdf  0m5s, 106423 statements, 19041 st/sec
bills.108.rdf  0m13s, 212142 statements, 15866 st/sec
bills.108.cosponsors.rdf  0m8s, 145743 statements, 16814 st/sec
Total Time: 0m27s, 464308 statements, 16787 st/sec
</pre>

<h2>Writing the GraphMatch Query</h2>

<p>The first query method is the GraphMatch method using my own "RSquary" query format, which is actually just plain RDF (think RDF-squared query because it's an RDF query over RDF data).  A simple RSquary query is just a graph to be matched against the target data model, here in N3 format:</p>

<pre class="code">
@prefix rdf: &lt;http://www.w3.org/1999/02/22-rdf-syntax-ns#&gt; .
@prefix foaf: &lt;http://xmlns.com/foaf/0.1/&gt; .
@prefix bill: &lt;tag:govshare.info,2005:rdf/usbill/&gt; .

?bill rdf:type bill:SenateBill .
?bill bill:congress "108" .
?bill bill:number "1024" .

?bill bill:cosponsor ?person .
?person foaf:name ?name .
</pre>

<p>A benefit of using N3 is that it allows entity names starting with "?" which are read in by the N3Reader as Variable objects.  (The Variable class is a subclass of BNode.)  Actually, in queries
BNodes are treated as variables too.  This makes sense because a BNode in the query graph could not possibly match a BNode in the target data model since a BNode cannot appear in two documents.  So only named entities (with URIs) and literals are used to match against the target data model</p>

<p>The query above says: Find all bindings
for the variables <tt>?bill</tt>, <tt>?person</tt>, and <tt>?name</tt>
such that 1) <tt>?bill</tt> is a Senate bill identified by congress 108 and number 1024, 2) <tt>?bill</tt> has <tt>?person</tt> as one of its cosponsors,
and 3) <tt>?name</tt> is a name of <tt>?person</tt>.</p>

<p>Save the above query as <tt>congress_query.n3</tt>.</p>

<h2>Running the Query</h2>

<p>SemWeb contains a program called <tt>rdfquery.exe</tt> which runs
a query against a target data model.  To run the query execute:</p>

<pre class="code">
$ mono rdfquery.exe n3:congress.n3 < congress_query.n3
</pre>

<p><tt>rdfquery.exe</tt> reads a query from standard input (hence the redirect) and matches it against the data sources listed in arguments on the command line.  It will take a few moments to load in the 710k statements from the congress.n3 file before it outputs the results.  The output is by default in the standard SPARQL result XML format.  Here it is, below.  (Some XML comments appear at the top to tell you how the query was executed, but that is not repeated below.)</p>

<pre class="code">
&lt;sparql xmlns="http://www.w3.org/2005/sparql-results#"&gt;
  &lt;head&gt;
    &lt;variable name="bill" /&gt;
    &lt;variable name="person" /&gt;
    &lt;variable name="name" /&gt;
  &lt;/head&gt;
  &lt;results ordered="false" distinct="false"&gt;
    &lt;result&gt;
      &lt;binding name="bill"&gt;
        &lt;uri&gt;tag:govshare.info,2005:data/us/congress/108/bills/s1024&lt;/uri&gt;
      &lt;/binding&gt;
      &lt;binding name="person"&gt;
        &lt;uri&gt;tag:govshare.info,2005:data/us/congress/people/C001041&lt;/uri&gt;
      &lt;/binding&gt;
      &lt;binding name="name"&gt;
        &lt;literal&gt;Hillary Clinton&lt;/literal&gt;
      &lt;/binding&gt;
    &lt;/result&gt;
    &lt;result&gt;
      &lt;binding name="bill"&gt;
        &lt;uri&gt;tag:govshare.info,2005:data/us/congress/108/bills/s1024&lt;/uri&gt;
      &lt;/binding&gt;
      &lt;binding name="person"&gt;
        &lt;uri&gt;tag:govshare.info,2005:data/us/congress/people/C000880&lt;/uri&gt;
      &lt;/binding&gt;
      &lt;binding name="name"&gt;
        &lt;literal&gt;Michael Crapo&lt;/literal&gt;
      &lt;/binding&gt;
    &lt;/result&gt;
    &lt;result&gt;
      &lt;binding name="bill"&gt;
        &lt;uri&gt;tag:govshare.info,2005:data/us/congress/108/bills/s1024&lt;/uri&gt;
      &lt;/binding&gt;
      &lt;binding name="person"&gt;
        &lt;uri&gt;tag:govshare.info,2005:data/us/congress/people/L000174&lt;/uri&gt;
      &lt;/binding&gt;
      &lt;binding name="name"&gt;
        &lt;literal&gt;Patrick Leahy&lt;/literal&gt;
      &lt;/binding&gt;
    &lt;/result&gt;
    &lt;result&gt;
      &lt;binding name="bill"&gt;
        &lt;uri&gt;tag:govshare.info,2005:data/us/congress/108/bills/s1024&lt;/uri&gt;
      &lt;/binding&gt;
      &lt;binding name="person"&gt;
        &lt;uri&gt;tag:govshare.info,2005:data/us/congress/people/M001153&lt;/uri&gt;
      &lt;/binding&gt;
      &lt;binding name="name"&gt;
        &lt;literal&gt;Lisa Murkowski&lt;/literal&gt;
      &lt;/binding&gt;
    &lt;/result&gt;
    &lt;result&gt;
      &lt;binding name="bill"&gt;
        &lt;uri&gt;tag:govshare.info,2005:data/us/congress/108/bills/s1024&lt;/uri&gt;
      &lt;/binding&gt;
      &lt;binding name="person"&gt;
        &lt;uri&gt;tag:govshare.info,2005:data/us/congress/people/M001111&lt;/uri&gt;
      &lt;/binding&gt;
      &lt;binding name="name"&gt;
        &lt;literal&gt;Patty Murray&lt;/literal&gt;
      &lt;/binding&gt;
    &lt;/result&gt;
    &lt;result&gt;
      &lt;binding name="bill"&gt;
        &lt;uri&gt;tag:govshare.info,2005:data/us/congress/108/bills/s1024&lt;/uri&gt;
      &lt;/binding&gt;
      &lt;binding name="person"&gt;
        &lt;uri&gt;tag:govshare.info,2005:data/us/congress/people/S000148&lt;/uri&gt;
      &lt;/binding&gt;
      &lt;binding name="name"&gt;
        &lt;literal&gt;Charles Schumer&lt;/literal&gt;
      &lt;/binding&gt;
    &lt;/result&gt;
  &lt;/results&gt;
&lt;/sparql&gt;
</pre>

<p>The query took 15 seconds to execute on my machine, with a good portion of that just loading the data from the file into memory.  We could speed things up by first putting the RDF data into a SQL database and then querying the database directly.  This way, the data is not loaded into memory and queries against the database make use of indexes already present.</p>

<!--<p>To load the data into a database, I'll use Sqlite again, so see the section of the documentation on Databases if you haven't looked at that already.  To get the data into a database, I'll modify the <tt>rdfstorage.exe</tt> command at the start as follows:</p>

<pre class="code">
$ mono rdfstorage.exe --out "sqlite:rdf:Uri=file:congress.sqlite;version=3" people.rdf bills.108.rdf bills.108.cosponsors.rdf
</pre>

<p>(We could have also read the <tt>congress.n3</tt> file directly, but to read N3 files you must add the option <tt>-in n3</tt>.)  Loading the data into Sqlite is much slower than loading it into memory (MySQL would be somewhere in the middle), but the speed-up will be worth it in the end when queries are much faster.</p>-->



<h2>SPARQL Queries</h2>

<p>For executing SPARQL queries over data sources that don't support SPARQL themselves (i.e. the MemoryStore, SQLStore, etc.),
SemWeb uses the <a href="http://sourceforge.net/projects/sparql/">SPARQL query engine by Ryan Levering</a>.  The library is written in Java, but for SemWeb I convert it to a .NET assembly using <a href="http://www.ikvm.net/">IKVM</a>.</p>

<p>The advantage of SPARQL over RSquary is that it supports much more complex queries, including
optional statements, disjunctions/unions, and special filters.</p>

<p>The query above equivalently in SPARQL is:</p>

<pre class="code">
PREFIX rdf: &lt;http://www.w3.org/1999/02/22-rdf-syntax-ns#&gt;
PREFIX foaf: &lt;http://xmlns.com/foaf/0.1/&gt;
PREFIX bill: &lt;tag:govshare.info,2005:rdf/usbill/&gt;
SELECT ?bill ?person ?name
WHERE {
   ?bill rdf:type bill:SenateBill .
   ?bill bill:congress "108" .
   ?bill bill:number "1024" .
   ?bill bill:cosponsor ?person .
   ?person foaf:name ?name .
}
</pre>

<p>Put that in <tt>congress_query.sparql</tt> and then run it with:</p>

<pre class="code">
$ mono rdfquery.exe -type sparql n3:congress.n3 < congress_query.sparql
</pre>

<p>This has the same output as above.</p>

<p>You can also use the <tt>rdfquery</tt> tool to query a remote SPARQL end point. Invoke it like this:</p>

<pre class="code">
echo "DESCRIBE &lt;tag:govshare.info,2005:data/us&gt;" | \
	mono bin/rdfquery.exe -type sparql sparql-http:http://www.govtrack.us/sparql
</pre>

<h2>Running Queries Programmatically</h2>

<p>To run a query from a program, you need to 1) create a <tt>Query</tt> object, 2) create a <tt>QueryResultSink</tt> object that will receive the results of the query, and 3) run <tt>Run</tt> on the <tt>Query</tt> object.</p>

<p>There are two types of <tt>Query</tt> objects, <tt>GraphMatch</tt> objects that perform simple entailment queries (i.e. RSquary) and <tt>SPARQL</tt> objects that perform SPARQL queries.</p>

<p>The <tt>GraphMatch</tt> classs takes a graph with variables and figures out all of the ways the variables can be assigned to ("bound") to values in the target data model so that the statements in the query are all found in the target data model.  Each set of variable assignments becomes a result.</p>

<p>Create a <tt>GraphMatch</tt> class (in the <tt>SemWeb.Query</tt> namespace) by passing to its constructor a <tt>StatementSource</tt>.  Remember that <tt>RdfReader</tt>s and <tt>MemoryStore</tt>s are <tt>StatementSource</tt>s, so you can either pass it a reader over a file or a store in which you've programmatically constructed the query.</p>

<pre class="code">
Query query = new GraphMatch(new N3Reader(queryfile));
</pre>

<p>Then load the data that the query will be run against:</p>

<pre class="code">
MemoryStore data = new MemoryStore();
data.Import(new N3Reader(datafile));
</pre>

<p>Next, create a <tt>QueryResultSink</tt>.  This class has an <tt>Add</tt> method that receives an array of variable bindings which is called for each query result.  The variable bindings say how each variable in the query was bound to a resource in the target data model.  There is one implementation of this class in SemWeb, then <tt>SparqlXmlQuerySink</tt> which is the standardized XML output format for SPARQL results.  Note that you can use this output format with any <tt>Query</tt> object, not just the <tt>Sparql</tt> class.  The constructor takes a <tt>TextWriter</tt> or <tt>XmlWriter</tt> to which the results are written.</p>

<pre class="code">
QueryResultSink sink = new SparqlXmlQuerySink(Console.Out);
</pre>

<p>You can, of course, create your own subclass of <tt>QueryResultSink</tt> which you will have to do if you want to do anything interesting with the results of the query.  Here's an example <tt>QueryResultSink</tt> which simply prints the variable bindings to the Console.  (Note that there are several other methods that can be overridden which are executed at the start and end of the query.)</p>

<pre class="code">
public class PrintQuerySink : QueryResultSink {
    public override bool Add(VariableBindings result) {
        foreach (Variable var in result.Variables) {
            if (var.LocalName != null && result[var] != null) {
                Console.WriteLine(var.LocalName + " ==> " + result[var].ToString());
            }
            Console.WriteLine();
        }
        return true;
    }
}
</pre>

<p>Lastly, run the query with <tt>Run</tt>, passing it the target data model and the result sink.</p>

<pre class="code">
query.Run(data, sink);
</pre>

<p>To create a SPARQL query instead, construct a new <tt>SparqlEngine</tt> object (in the <tt>SemWeb.Query</tt> namespace but in the separate <tt>SemWeb.Sparql.dll</tt> assembly!).</p>

<pre class="code">
Query query = new SparqlEngine(new StreamReader(queryfile));
</pre>

<p>Run the query the same as with <tt>GraphMatch</tt>.  There are several types of SPARQL queries, not all of which result a list of variable bindings.  For instance, the DESCRIBE and CONSTRUCT query types return RDF triples.  You can run queries generically and output the results to a <tt>TextWriter</tt> just by passing a <tt>TextWriter</tt> to <tt>Run</tt> instead of a <tt>QueryResultSink</tt>.  Or, see the API documentation on the <tt>Sparql</tt> class for more control over the output of SPARQL queries.</p>

<p>An entire program for querying is below:</p>

<pre class="code" file="../examples/query.cs">// This example runs a query.

using System;
using System.IO;

using SemWeb;
using SemWeb.Query;

public class Example {

    public static void Main(string[] argv) {
        if (argv.Length &lt; 3) {
            Console.WriteLine("Usage: query.exe format queryfile datafile");
            return;
        }
        
        string format = argv[0];
        string queryfile = argv[1];
        string datafile = argv[2];
    
        Query query;
        
        if (format == "rsquary") {
            // Create a simple-entailment "RSquary" query
            // from the N3 file.
            query = new GraphMatch(new N3Reader(queryfile));
        } else {
            // Create a SPARQL query by reading the file's
            // contents.
            query = new SparqlEngine(new StreamReader(queryfile));
        }
    
        // Load the data file from disk
        MemoryStore data = new MemoryStore();
        data.Import(new N3Reader(datafile));
        
        // First, print results in SPARQL XML Results format...
        
        // Create a result sink where results are written to.
        QueryResultSink sink = new SparqlXmlQuerySink(Console.Out);
        
        // Run the query.
        query.Run(data, sink);
        
        // Second, print the results via our own custom QueryResultSink...
        query.Run(data, new PrintQuerySink());
    }

    public class PrintQuerySink : QueryResultSink {
        public override bool Add(VariableBindings result) {
            foreach (Variable var in result.Variables) {
                if (var.LocalName != null &amp;&amp; result[var] != null) {
                    Console.WriteLine(var.LocalName + " ==&gt; " + result[var].ToString());
                }
                Console.WriteLine();
            }
            return true;
        }
    }
}
</pre>

<h2>Querying Remote SPARQL Endpoints</h2>

<p>It is also possible to run SPARQL queries directly against remote
HTTP endpoints. The <tt>rdfquery.exe</tt> command-line program can be
used to run queries directly. Take the following query in the file
"dbp.q" to query the <a href="http://dbpedia.org">DBpedia</a> database (a semantified Wikipedia)
for all statements that use the literal "John McCain":</p>

<pre class="code">SELECT * WHERE { ?s ?p "John McCain" . }</pre>

<p>Run this against the remote SPARQL endpoint at <tt>http://DBpedia.org/sparql</tt>
using:</p>

<pre class="code">mono bin/rdfquery.exe sparql-http:http://DBpedia.org/sparql -type sparql < dbp.q</pre>

<p>The output is given below:</p>

<pre class="code">&lt;sparql>
  &lt;head>
    &lt;variable name="s"/>
    &lt;variable name="p"/>
  &lt;/head>
  &lt;results distinct="false" ordered="true">
    &lt;result>
      &lt;binding name="s">&lt;uri>http://dbpedia.org/resource/John_McCain&lt;/uri>&lt;/binding>
      &lt;binding name="p">&lt;uri>rdfs:label&lt;/uri>&lt;/binding>
    &lt;/result>
    &lt;result>
      &lt;binding name="s">&lt;uri>http://dbpedia.org/resource/John_McCain&lt;/uri>&lt;/binding>
      &lt;binding name="p">&lt;uri>http://dbpedia.org/property/name&lt;/uri>&lt;/binding>
    &lt;/result>
  &lt;/results>
&lt;/sparql></pre>

<p>It is also possible to query remote endpoints programmatically using
the <tt>SemWeb.Remote.SparqlHttpSource</tt> class. For example:</p>

<pre class="code">SparqlHttpSource source = new SparqlHttpSource("http://DBpedia.org/sparql");
source.RunSparqlQuery("SELECT * WHERE { ?s ?p \"John McCain\" . }", Console.Out);
   (or)
source.RunSparqlQuery("SELECT * WHERE { ?s ?p \"John McCain\" . }", new SparqlXmlQuerySink(Console.Out));</pre>

<p>There are other overloads of <tt>RunSparqlQuery</tt> that provide better access
to the results than dumping the output to a TextWriter. See the API documentation
for details.</p>

	</body>
</html>