File: foss4g13-abstract.txt

package info (click to toggle)
python-stetl 1.2%2Bds-1
  • links: PTS, VCS
  • area: main
  • in suites: buster
  • size: 89,988 kB
  • sloc: python: 5,007; xml: 707; sql: 430; makefile: 155; sh: 50
file content (82 lines) | stat: -rw-r--r-- 5,841 bytes parent folder | download | duplicates (4)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
Data conversion combined with model and coordinate transformation 
from a source to a target datastore (files, databases) is a recurring task 
in almost every geospatial project. 
This proces is often refered to as ETL (Extract Transform Load).
Source and/or target geo-data formats are increasingly encoded as GML (Geography Markup Language), either 
as flat records, so called Simple Features, but more and more using 
domain-specific, object oriented OGC/ISO GML Application Schema's. 

GML Application Schema's are for example heavily used within the INSPIRE Data Harmonization effort in Europe. 
Many National Mapping and Cadastral Agencies (NMCAs) use GML-encoded datasets as their bulk 
format for download and exchange and via Web Feature Services (WFSs).
As geospatial professionals we are often confronted with ETL-tasks involving (complex) GML 
or worse: "GML-lookalikes", which are often XML Schema's embedded with GML-namespaced elements.

Luckily, in many cases GDAL/OGR, the Swiss Army Knife for geo-data conversion, 
can do the job. If "ogr2ogr" sounds like gibberish to you, check out http://gdal.org !
But when complex, some say rich, GML Application Schema's are involved,
data conversion can be a daunting task when GDAL/OGR alone is not sufficient. 
Firstly, often complex data model transformations have to be applied. 
In addition we may be confronted with the bulkiness of 
GML: Megabyte/Gigabyte-files, deeply nested elements where the nuggets, the actual attribute values,
reside, trees of .zip files and possibly more nasty surprises once we have unboxed a GML-delivery. 
High resource consumption in memory and CPU and long processing hours, up to complete machine-lockup, can be the 
the side-effects of naive GML-processing. 

Within the FOSS4G world we can resort to higher level, 
GUI-based, ETL-tools such as GeoKettle, Humboldt tools and Talend GeoSpatial. These are very powerful
tools by themselves, check them out as well. Some of us, like the author, like to stay closer
to GDAL/OGR, to XSLT for model transforms, some command line tools and some Python scripting, but without 
having to write a complete, ad-hoc ETL-program each time. This is the space where Stetl tries to fit in, 
so read on.

So we already have great FOSS tools for XML/GML parsing, data-conversion and 
model-transformation like GDAL/OGR (ogr2ogr!), XSLT (Extensible 
Stylesheet Language Transformations, for transforming XML) and native XML-parsing libraries like libxml2.
Each individual tool/library is extremely powerful and performant by itself. 
But we would like to combine of these tools. Take for example flat, national adres data in a PostGIS 
database that we need to transform to multiple INSPIRE Application Schema GML files.
Each individual FOSS tool can handle part of the ETL: ogr2ogr for converting
from PostGIS (including coordinate tranformation) into to simple feature GML, 
XSLT (xsltproc/libxslt) to transform
the resulting flat GML to rich INSPIRE GML. But with millions of addresses we cannot 
simply use a single GML memory datastructure (DOM) or single intermediate GML-file.  

Add Python and a configuration convention to this equation and we have 
Stetl: Streaming ETL. Stetl is a lightweight, geospatial ETL (Extract Transform Load) 
framework written in Python. ETL-processing with Stetl is driven from a configuration
file. Within a Stetl configuration file a chain of ETL-processing modules
is declared through which the data flows ("streams"). A module may be an input,
filter or output module. Modules have input and output data types declared such that only
compatible modules can be connected. However, Stetl does not define a grand internal data structure
to which all data is mapped as many ETL-tools do. Data formats are kept close to the
external tools that Stetl uses. Stetl comes with pre-defined modules for GML-parsing, 
XSLT processing, XSD Validation, PostGIS/OGR input and output, GML-splitting and many more.
Stetl calls on the above tools like OGR, libxslt and libxml2 via their native interfaces.
Stetl is even more speed-optimized as no intermediate file-storage 
is used and through other means such as native string buffers. 
For example large XML/GML-files can be split into manegeable 
documents and streamed into an XSLT filter module. Stetl-modules are offcourse extensible 
and can be user-defined. Reusable ETL-configurations invoked through parameterized commandline scripts
can be defined without programming. 

Stetl evolved from and is used within the INSPIRE-FOSS project (http://inspire-foss.org).
Here for example, Dutch national addresses (BAG) were transformed into INSPIRE Addresses GML 
(files and database). Special Stetl integration modules are available to extract and publish 
data from/to a deegree WFS and deegree "Blobstore-database". The combination Stetl/deegree is an ideal 
setup for INSPIRE deployments.

Other Dutch national datasets like Top10NL and BGT (Dutch topo vector datasets)
have been completely and successfully transformed. Work is in progress to use Stetl as
the basis for NLExtract (http://nlextract.nl), a project that provides ETL tools for Dutch
open geo-datasets. Stetl development is now (april 2013) in an initial phase and takes place in 
GitHub. The current version is workable but we hope to present a v1.0 at FOSS4G with more
documentation and as a standard Python Package via PyPi. The main link is: 
http://stetl.org (now links to GitHub). 
To get started find some basic examples here: https://github.com/geopython/stetl/tree/master/examples/basics.

This presentation will gently introduce the above "GML challenge" and explain the basics of Stetl
starting with some simple examples, building up to more complex cases like INSPIRE 
transformations and integration with the deegree WFS/WMS server (http://deegree.org).