1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213
|
# Using bio-vcf with RDF
bio-vcf can output many types of formats. In this exercise we will load
a triple store (4store) with VCF data and do some queries on that.
## Install and start 4store
### On GNU Guix
See https://github.com/pjotrp/guix-notes/blob/master/packages/4store.org
### On Debian
Get root
```sh
su
apt-get install avahi-daemon
apt-get install raptor-utils
exit
```
As normal user
```sh
guix package -i sparql-query curl
```
Initialize and start the server again as root (or another user)
```
su
export PATH=/home/user/.guix-profile/bin:$PATH
mkdir -p /var/lib/4store
dbname=test
4s-backend-setup $dbname
4s-backend $dbname
4s-httpd -p 8000 $dbname
```
Try the web browser and point it to http://localhost:8000/status/
Open a new terminal as user.
Generate rdf with bio-vcf template
```ruby
=HEADER
@prefix : <http://biobeat.org/rdf/ns#> .
=BODY
<%
id = ['chr'+rec.chr,rec.pos,rec.alt].join('_')
%>
:<%= id %>
:query_id "<%= id %>";
:chr "<%= rec.chr %>" ;
:alt "<%= rec.alt.join("") %>" ;
:pos <%= rec.pos %> .
```
so it looks like
```
:chrX_134713855_A
:query_id "chrX_134713855_A";
:chr "X" ;
:alt "A" ;
:pos 134713855 .
```
and test with rapper using [gatk_exome.vcf](https://github.com/pjotrp/bioruby-vcf/blob/master/test/data/input/gatk_exome.vcf)
```sh
cat gatk_exome.vcf |bio-vcf -v --template rdf_template.erb
cat gatk_exome.vcf |bio-vcf -v --template rdf_template.erb > my.rdf
rapper -i turtle my.rdf
```
Load into 4store (when no errors)
```bash
rdf=my.rdf
uri=http://localhost:8000/data/http://biobeat.org/data/$rdf
curl -X DELETE $uri
curl -T $rdf -H 'Content-Type: application/x-turtle' $uri
201 imported successfully
This is a 4store SPARQL server
```
First SPARQL query
```sh
SELECT ?id
WHERE
{
?id <http://biobeat.org/rdf/ns#chr> "X".
}
```
```
cat sparql1.rq |sparql-query "http://localhost:8000/sparql/" -p
┌──────────────────────────────────────────────┐
│ ?id │
├──────────────────────────────────────────────┤
│ <http://biobeat.org/rdf/ns#chrX_107911706_C> │
│ <http://biobeat.org/rdf/ns#chrX_55172537_A> │
│ <http://biobeat.org/rdf/ns#chrX_134713855_A> │
└──────────────────────────────────────────────┘
```
A simple python query may look like
```python
import requests
import subprocess
host = "http://localhost:8000/"
query = """
SELECT ?s ?p ?o WHERE {
?s ?p ?o .
} LIMIT 10
"""
r = requests.post(host, data={ "query": query, "output": "text" })
# print r.url
print r.text
```
renders
```
?id
<http://biobeat.org/rdf/ns#chrX_107911706_C>
<http://biobeat.org/rdf/ns#chrX_55172537_A>
<http://biobeat.org/rdf/ns#chrX_134713855_A>
```
A working example if you are using the server
http://guix.genenetwork.org and the correct PREFIX:
```python
#! /usr/bin/env python
import requests
import subprocess
host = "http://guix.genenetwork.org/sparql/"
query = """
PREFIX : <http://biobeat.org/rdf/pjotr/ns#>
SELECT ?id ?chr ?pos ?alt
WHERE
{
{ ?id :chr "X" . }
UNION
{ ?id :chr "1" . }
?id :chr ?chr .
?id :alt ?alt .
?id :pos ?pos .
FILTER (?pos > 107911705) .
}
"""
r = requests.post(host, data={ "query": query, "output": "text" })
print r.text
```
## EBI
EBI SPARQL has some advanced examples of queries, such as
```
https://www.ebi.ac.uk/rdf/services/ensembl/sparql
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX dcterms: <http://purl.org/dc/terms/>
PREFIX dc: <http://purl.org/dc/elements/1.1/>
PREFIX obo: <http://purl.obolibrary.org/obo/>
PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
PREFIX sio: <http://semanticscience.org/resource/>
PREFIX faldo: <http://biohackathon.org/resource/faldo#>
PREFIX identifiers: <http://identifiers.org/>
PREFIX ensembl: <http://rdf.ebi.ac.uk/resource/ensembl/>
PREFIX ensembltranscript: <http://rdf.ebi.ac.uk/resource/ensembl.transcript/>
PREFIX ensemblexon: <http://rdf.ebi.ac.uk/resource/ensembl.exon/>
PREFIX ensemblprotein: <http://rdf.ebi.ac.uk/resource/ensembl.protein/>
PREFIX ensemblterms: <http://rdf.ebi.ac.uk/terms/ensembl/>
SELECT DISTINCT ?transcript ?id ?typeLabel ?reference ?begin ?end ?location {
?transcript obo:SO_transcribed_from ensembl:ENSG00000139618 ;
a ?type;
dc:identifier ?id .
OPTIONAL {
?transcript faldo:location ?location .
?location faldo:begin [faldo:position ?begin] .
?location faldo:end [faldo:position ?end ] .
?location faldo:reference ?reference .
}
OPTIONAL {?type rdfs:label ?typeLabel}
}
```
See https://www.ebi.ac.uk/rdf/services/ensembl/sparql
# Exercise
Today's exercise is to create a graph using bio-vcf and/or a small program using
RDF triples and define a SPARQL query.
The more interesting the graph/SPARQL the better.
|