Commit 01657309 authored by Bas Couwenberg's avatar Bas Couwenberg

New upstream version 1.2+ds

parent e85bddc3
Changes
=======
v1.2 - july 7, 2018
-------------------
See closed issues in Milestone 1.2: https://github.com/geopython/stetl/milestone/8?closed=1
Most important changes are related to deployment in Docker and Kubernetes environments, dealing
with (env) variables, Stetl arguments and logging, for example:
- issue #71: Allow Environment vars to substitute/override config template arg-variables
- issue #72: Allow multiple -a args for Stetl main prog. Allowing multiple -a arguments allows
for more simpler overriding of for example default options.
- #68 Stetl should not output passwords and other particular data in its log
v1.1.1 - november 7, 2017
-------------------------
......
Metadata-Version: 1.1
Metadata-Version: 1.2
Name: Stetl
Version: 1.1
Summary: Stetl provides transformation for spatial data
Version: 1.2
Summary: Transformation and conversion framework (ETL) mainly for geospatial data
Home-page: http://github.com/geopython/stetl
Author: Just van den Broecke
Author-email: justb4@gmail.com
Maintainer: Just van den Broecke
Maintainer-email: justb4@gmail.com
License: GNU GPL v3
Description: # Stetl - Streaming ETL
......@@ -12,7 +14,7 @@ Description: # Stetl - Streaming ETL
[![Build Status](https://travis-ci.org/geopython/stetl.png)](https://travis-ci.org/geopython/stetl)
[![Documentation Status](https://img.shields.io/badge/docs-latest-brightgreen.svg)](http://stetl.readthedocs.org/en/latest)
[![Gitter Chat](http://img.shields.io/badge/chat-online-brightgreen.svg)](https://gitter.im/justb4/stetl)
[![Gitter Chat](http://img.shields.io/badge/chat-online-brightgreen.svg)](https://gitter.im/geopython/stetl)
Notice: the Stetl GH repo is now at the [GeoPython GH organization](https://github.com/geopython).
......@@ -54,7 +56,7 @@ Description: # Stetl - Streaming ETL
Most of the data conversions within the [Dutch NLExtract Project](https://github.com/nlextract/NLExtract) apply Stetl.
Stetl also proved to be very effective in [IoT-related transformations involving the SensorWeb/SOS](https://github.com/Geonovum/smartemission).
Stetl also proved to be very effective in [IoT-related transformations involving the SensorWeb/SOS](https://github.com/smartemission).
## Examples
......@@ -63,7 +65,7 @@ Description: # Stetl - Streaming ETL
## Installation
Stetl can be installed via PyPi `pip install stetl` and recently as a [Stetl Docker image](https://hub.docker.com/r/justb4/stetl).
Stetl can be installed via PyPi `pip install stetl` and recently as a [Stetl Docker image](https://hub.docker.com/r/geopython/stetl).
More on [installation in the documentation](http://www.stetl.org/en/latest/install.html).
## Contributing
......@@ -80,7 +82,7 @@ Description: # Stetl - Streaming ETL
Stetl originated in the INSPIRE-FOSS project: [2009-2013 now archived](https://github.com/justb4/inspire-foss).
Since then Stetl evolved into a wider use like
transforming [Dutch GML-based Open Datasets](https://github.com/nlextract/NLExtract) such as IMGEO/BGT (Large Scale Topography)
and IMKAD/BRK (Cadastral Data).
and IMKAD/BRK (Cadastral Data) and [Sensor Data Transformation and Calibration](https://github.com/smartemission/docker-se-stetl).
## Finally
......@@ -96,6 +98,19 @@ Description: # Stetl - Streaming ETL
Changes
=======
v1.2 - july 7, 2018
-------------------
See closed issues in Milestone 1.2: https://github.com/geopython/stetl/milestone/8?closed=1
Most important changes are related to deployment in Docker and Kubernetes environments, dealing
with (env) variables, Stetl arguments and logging, for example:
- issue #71: Allow Environment vars to substitute/override config template arg-variables
- issue #72: Allow multiple -a args for Stetl main prog. Allowing multiple -a arguments allows
for more simpler overriding of for example default options.
- #68 Stetl should not output passwords and other particular data in its log
v1.1.1 - november 7, 2017
-------------------------
......@@ -206,9 +221,9 @@ Description: # Stetl - Streaming ETL
Thanks to Tom Kralidis for helping out to move from personal repo to https://github.com/geopython organization.
Keywords: etl xsl gdal gis vector feature data
Keywords: etl xsl gdal gis vector feature data gml xml
Platform: UNKNOWN
Classifier: Development Status :: 4 - Beta
Classifier: Development Status :: 5 - Production/Stable
Classifier: Environment :: Console
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
......
......@@ -4,7 +4,7 @@ Stetl, streaming ETL, pronounced "staedl", is a lightweight ETL-framework for ge
[![Build Status](https://travis-ci.org/geopython/stetl.png)](https://travis-ci.org/geopython/stetl)
[![Documentation Status](https://img.shields.io/badge/docs-latest-brightgreen.svg)](http://stetl.readthedocs.org/en/latest)
[![Gitter Chat](http://img.shields.io/badge/chat-online-brightgreen.svg)](https://gitter.im/justb4/stetl)
[![Gitter Chat](http://img.shields.io/badge/chat-online-brightgreen.svg)](https://gitter.im/geopython/stetl)
Notice: the Stetl GH repo is now at the [GeoPython GH organization](https://github.com/geopython).
......@@ -46,7 +46,7 @@ of GML/XML-based National geo-datasets to for example PostGIS.
Most of the data conversions within the [Dutch NLExtract Project](https://github.com/nlextract/NLExtract) apply Stetl.
Stetl also proved to be very effective in [IoT-related transformations involving the SensorWeb/SOS](https://github.com/Geonovum/smartemission).
Stetl also proved to be very effective in [IoT-related transformations involving the SensorWeb/SOS](https://github.com/smartemission).
## Examples
......@@ -55,7 +55,7 @@ Best is to start with the [basic examples](examples/basics)
## Installation
Stetl can be installed via PyPi `pip install stetl` and recently as a [Stetl Docker image](https://hub.docker.com/r/justb4/stetl).
Stetl can be installed via PyPi `pip install stetl` and recently as a [Stetl Docker image](https://hub.docker.com/r/geopython/stetl).
More on [installation in the documentation](http://www.stetl.org/en/latest/install.html).
## Contributing
......@@ -72,7 +72,7 @@ review the [guidelines for contributing](CONTRIBUTING.md).
Stetl originated in the INSPIRE-FOSS project: [2009-2013 now archived](https://github.com/justb4/inspire-foss).
Since then Stetl evolved into a wider use like
transforming [Dutch GML-based Open Datasets](https://github.com/nlextract/NLExtract) such as IMGEO/BGT (Large Scale Topography)
and IMKAD/BRK (Cadastral Data).
and IMKAD/BRK (Cadastral Data) and [Sensor Data Transformation and Calibration](https://github.com/smartemission/docker-se-stetl).
## Finally
......
1.1
\ No newline at end of file
1.2
\ No newline at end of file
......@@ -8,6 +8,7 @@
from stetl.main import parse_args
from stetl.etl import ETL
from stetl.util import Util
import sys
log = Util.get_log('main')
......@@ -18,12 +19,12 @@ def main():
Args:
-c --config <config_file> the Stetl config file.
-s --section <section_name> the section in the Stetl config (ini) file to execute (default is [etl]).
-a --args <arglist> substitutable args for symbolic, {arg}, values in Stetl config file, in format "arg1=foo arg2=bar" etc.
-a --args <arglist> zero or more substitutable args for symbolic, {arg}, values in Stetl config file, in format -a arg1=foo -a arg2=bar etc.
-h --help <subject> Get component documentation like its configuration parameters, e.g. stetl doc stetl.inputs.fileinput.FileInput
"""
args = parse_args()
args = parse_args(sys.argv[1:])
if args.config_file:
# Do the ETL
......
......@@ -111,6 +111,10 @@ Components: Filters
:members:
:show-inheritance:
.. automodule:: stetl.filters.sieve
:members:
:show-inheritance:
.. automodule:: stetl.filters.stringfilter
:members:
:show-inheritance:
......@@ -131,6 +135,30 @@ Components: Filters
:members:
:show-inheritance:
.. automodule:: stetl.filters.execfilter
:members:
:show-inheritance:
.. automodule:: stetl.filters.nullfilter
:members:
:show-inheritance:
.. automodule:: stetl.filters.packetbuffer
:members:
:show-inheritance:
.. automodule:: stetl.filters.packetwriter
:members:
:show-inheritance:
.. automodule:: stetl.filters.regexfilter
:members:
:show-inheritance:
.. automodule:: stetl.filters.zipfileextractor
:members:
:show-inheritance:
Components: Outputs
-------------------
......
......@@ -52,9 +52,9 @@ copyright = u'2013+, Just van den Broecke'
# built documents.
#
# The short X.Y version.
version = '1.1'
version = '1.2-dev'
# The full version, including alpha/beta/rc tags.
release = '1.1'
release = '1.2-dev'
# The language for content autogenerated by Sphinx. Refer to documentation
# for a list of supported languages.
......
......@@ -9,5 +9,4 @@ All development is done via GitHub: see https://github.com/geopython/stetl.
Contact the main author Just van den Broecke via email at just@justobjects.nl.
Online chat via Gitter: https://gitter.im/geopython/stetl
......@@ -234,9 +234,15 @@ or even OGC WPS servers (planned).
Reusable Stetl Configs
----------------------
What we saw in the last example is that it is hard to reuse this `etl.cfg` when we have for example a different input file
or want to map to different output files. For this Stetl supports `parameter substitution`. Here command line parameters are substituted
for variables in `etl.cfg`. A variable is declared between curly brackets like `{out_xml}`. See
What we saw in the last example is that it is hard to reuse this `etl.cfg`
when we have for example a different input file
or want to map to different output files.
For this Stetl supports `config parameter substitution`.
Dynamic or secret (e.g. database credentials) parameters in `etl.cfg` are declared
symbolically and substituted at runtime via the commandline or the OS environment.
A variable is declared between curly brackets like `{out_xml}`. See
example `6_cmdargs <https://github.com/geopython/stetl/tree/master/examples/basics/6_cmdargs>`_. ::
[etl]
......@@ -254,12 +260,13 @@ example `6_cmdargs <https://github.com/geopython/stetl/tree/master/examples/basi
class = outputs.fileoutput.FileOutput
file_path = {out_xml}
Note the symbolic input, xsl and output files. We can now perform this ETL using the `stetl -a option` in two ways.
Note the symbolic input, xsl and output files. We can now perform
the ETL using the `stetl -a option` in two basic ways.
One, passing the arguments on the commandline, like ::
stetl -c etl.cfg -a "in_xml=input/cities.xml in_xsl=cities2gml.xsl out_xml=output/gmlcities.gml"
Two, passing the arguments in a properties file, here called `etl.args` (the name of the suffix .args is not significant). ::
Two, passing the arguments in a properties file, here called `etl.args` (the name of the suffix .args is not significant, could be .env as well). ::
stetl -c etl.cfg -a etl.args
......@@ -270,7 +277,41 @@ Where the content of the `etl.args` properties file is: ::
in_xsl=cities2gml.xsl
out_xml=output/gmlcities.gml
This makes an ETL chain highly reusable. A very elaborate Stetl config with parameter substitution can be seen in the
It is also possible to specify **multiple -a arguments**. This provides for situations
where a `default.args` contains all default arguments and a `my.args` or explicit `-a` settings
that override the default values in `default.args`. Overriding is determined by the order of
the `-a` arguments. Examples: ::
stetl -c etl.cfg -a default.args -a my.args
stetl -c etl.cfg -a default.args -a "db_user=docker db_password=pass"
stetl -c etl.cfg -a default.args -a db_user=docker -a db_password=pass
It is also possible to pass these key/value pairs via OS Environment variables.
This is especially handy in Docker-based deployments like Docker Compose and Kubernetes.
In this case the variable names need to be prepended with `STETL_` or `stetl_` as
to not mix-up with other non-related OS-env vars. A mixture of commandline args (file)
and environment vars is possible. The rule is that
*OS Environment variables always override/overrule arguments specified with -a option(s)*.
For example, the above args could also be passed as follows: ::
export stetl_in_xml="input/cities.xml"
export stetl_in_xsl="cities2gml.xsl"
export stetl_out_xml="output/gmlcities.gml"
stetl -c etl.cfg
or only override the input file name `in_xml` from `etl.args`: ::
export stetl_in_xml="input/cities2.xml"
stetl -c etl.cfg -a etl.args
or even with multiple `-a args`: ::
export stetl_in_xml="input/cities2.xml"
stetl -c etl.cfg -a etl.args -a my.args
This makes an ETL chain highly reusable.
A very elaborate Stetl config with parameter substitution can be seen in the
`Top10NL ETL <https://github.com/geopython/stetl/blob/master/examples/top10nl/etl-top10nl.cfg>`_.
Connection Compatibility
......
# Trivial example Sieve filter.
# The input data is in input/cities.csv.
# We sieve out (passthrough) all records where city attr value
# matches "amsterdam" or "otterlo".
[etl]
chains = input_csv|attr_value_sieve|output_std,
input_csv|attr_value_sieve|output_file
[input_csv]
class = inputs.fileinput.CsvFileInput
file_path = input/cities.csv
output_format = record_array
[attr_value_sieve]
class = filters.sieve.AttrValueRecordSieve
input_format = record_array
output_format = record_array
attr_name = city
attr_values = amsterdam,otterlo
[output_std]
class = outputs.standardoutput.StandardOutput
[output_file]
class = outputs.fileoutput.FileOutput
file_path = output/cities_sieved.txt
#!/bin/sh
#
# ETL for copying a file to standard output.
#
# Shortcut to call Stetl main.py with etl config.
#
stetl -c etl.cfg
city,lat,lon
amsterdam,52.4,4.9
otterlo,52.101,5.773
rotterdam,51.9,4.5
eindhoven,51.44,5.47
[{'lat': '52.4', 'city': 'amsterdam', 'lon': '4.9'}, {'lat': '52.101', 'city': 'otterlo', 'lon': '5.773'}]
\ No newline at end of file
......@@ -4,6 +4,10 @@
#
# Author: Just van den Broecke
#
stetl -c etl.cfg
stetl=stetl
PYTHONPATH=${PYTHONPATH}:../../..
# stetl=../../../stetl/main.py
$stetl -c etl.cfg
......@@ -14,3 +14,6 @@ stetl -c etl.cfg -a "in_xml=input/cities.xml in_xsl=cities2gml.xsl out_xml=outp
# Option 2: using a properties file
stetl -c etl.cfg -a etl.args
# Option 3: multiple -a options e.g. overriding one or more default args (file)
stetl -c etl.cfg -a etl.args -a "in_xml=input/amsterdam.xml"
<?xml version='1.0' encoding='utf-8'?>
<cities>
<city>
<name>Amsterdam</name>
<lat>52.4</lat>
<lon>4.9</lon>
</city>
</cities>
......@@ -22,24 +22,4 @@
</ogr:geometry>
</ogr:City>
</gml:featureMember>
<gml:featureMember>
<ogr:City>
<ogr:name>Bonn</ogr:name>
<ogr:geometry>
<gml:Point srsName="urn:ogc:def:crs:EPSG:4326">
<gml:coordinates>50.7,7.1</gml:coordinates>
</gml:Point>
</ogr:geometry>
</ogr:City>
</gml:featureMember>
<gml:featureMember>
<ogr:City>
<ogr:name>Rome</ogr:name>
<ogr:geometry>
<gml:Point srsName="urn:ogc:def:crs:EPSG:4326">
<gml:coordinates>41.9,12.5</gml:coordinates>
</gml:Point>
</ogr:geometry>
</ogr:City>
</gml:featureMember>
</ogr:FeatureCollection>
This source diff could not be displayed because it is too large. You can view the blob instead.
......@@ -6,5 +6,4 @@ verbosity = 3
[egg_info]
tag_build =
tag_date = 0
tag_svn_revision = 0
......@@ -38,9 +38,9 @@ with open('requirements-main.txt') as f:
setup(
name='Stetl',
version=version,
description="Stetl provides transformation for spatial data",
description="Transformation and conversion framework (ETL) mainly for geospatial data",
license='GNU GPL v3',
keywords='etl xsl gdal gis vector feature data',
keywords='etl xsl gdal gis vector feature data gml xml',
author='Just van den Broecke',
author_email='justb4@gmail.com',
maintainer='Just van den Broecke',
......@@ -56,7 +56,7 @@ setup(
tests_require=['nose'],
test_suite='nose.collector',
classifiers=[
'Development Status :: 4 - Beta',
'Development Status :: 5 - Production/Stable',
'Environment :: Console',
'Intended Audience :: Developers',
'Intended Audience :: Science/Research',
......
......@@ -5,6 +5,8 @@
# Author: Just van den Broecke
#
import os
import sys
from time import time
from util import Util, ConfigSection
from packet import FORMAT
......@@ -122,6 +124,10 @@ class Component(object):
self.cfg_vals = dict()
self.next = None
self.section = section
self._max_time = -1
self._min_time = sys.maxint
self._total_time = 0
self._invoke_count = 0
# First assume single output provided by derived class
self._output_format = produces
......@@ -184,10 +190,14 @@ class Component(object):
# Current processor of packet
packet.component = self
start_time = self.timer_start()
self._invoke_count += 1
# Do something with the data
result = self.before_invoke(packet)
if result is False:
# Component indicates it does not want the chain to proceed
self.timer_stop(start_time)
return packet
# Do component-specific processing, e.g. read or write or filter
......@@ -196,8 +206,11 @@ class Component(object):
result = self.after_invoke(packet)
if result is False:
# Component indicates it does not want the chain to proceed
self.timer_stop(start_time)
return packet
self.timer_stop(start_time)
# If there is a next component, let it process
if self.next:
# Hand-over data (line, doc whatever) to the next component
......@@ -219,6 +232,17 @@ class Component(object):
# Notify all comps that we exit
self.exit()
# Simple performance stats in one line (issue #77)
# Calc average processing time, watch for 0 invoke-case
avg_time = 0.0
if self._invoke_count > 0:
avg_time = self._total_time / self._invoke_count
log.info("%s invokes=%d time(total, min, max, avg) = %.3f %.3f %.3f %.3f" %
(self.__class__.__name__, self._invoke_count,
self._total_time, self._min_time, self._max_time,
avg_time))
# If there is a next component, let it do its exit()
if self.next:
self.next.do_exit()
......@@ -258,3 +282,23 @@ class Component(object):
Allows derived Components to perform a one-time exit/cleanup.
"""
pass
def timer_start(self):
return time()
def timer_stop(self, start_time):
"""
Collect and calculate per-Component performance timing stats.
:param start_time:
:return:
"""
delta_time = time() - start_time
# Calc timing stats for Component invocation
self._total_time += delta_time
if delta_time > self._max_time:
self._max_time = delta_time
if delta_time < self._min_time and '%.3f' % delta_time != '0.000':
self._min_time = delta_time
......@@ -5,6 +5,7 @@
# Author: Just van den Broecke
#
import os
import re
import sys
from ConfigParser import ConfigParser
import version
......@@ -50,29 +51,86 @@ class ETL:
sys.path.append(ETL.CONFIG_DIR)
config_str = ''
try:
# Get config file as string
log.info("Reading config_file = %s" % config_file)
f = open(config_file, 'r')
config_str = f.read()
f.close()
except Exception as e:
log.error("Cannot read config file: err=%s" % str(e))
raise e
args_names = list()
try:
# Optional: expand symbolic arguments from args_dict and or OS Env
# ignore errors here as { .. } may appear at random.
# Parse unique list of argument names from config file string.
# https://www.machinelearningplus.com/python/python-regex-tutorial-examples/
args_names = list(set(re.findall('{[A-Z|a-z]\w+}', config_str)))
args_names = [name.split('{')[1].split('}')[0] for name in args_names]
# Optional: expand from equivalent env vars
args_dict = self.env_expand_args_dict(args_dict, args_names)
# In general all arg names should be present in args dict
for args_name in args_names:
if args_name not in args_dict:
log.warn("Arg not found in args nor environment: name=%s" % args_name)
# raise Exception("name=%s" % args_name)
except Exception as e:
log.warn("Expanding config arguments (non fatal yet): %s" % str(e))
try:
if args_dict:
log.info("Substituting %d args in config file from args_dict: %s" % (len(args_dict), str(args_dict)))
# Get config file as string
f = open(config_file, 'r')
config_str = f.read()
f.close()
log.info("Substituting %d args in config file from args_dict: %s" % (len(args_names), str(args_names)))
# Do replacements see http://docs.python.org/2/library/string.html#formatstrings
# and render substituted config string
config_str = config_str.format(**args_dict)
log.info("Substituting args OK")
# Put Config string into buffer (readfp() needs a readline() method)
config_buf = StringIO.StringIO(config_str)
# Parse config from file buffer
self.configdict.readfp(config_buf, config_file)
else:
# Parse config file directly
self.configdict.read(config_file)
except Exception as e:
log.error("Error substituting config arguments: err=%s" % str(e))
raise e
try:
# Put Config string into buffer (readfp() needs a readline() method)
config_buf = StringIO.StringIO(config_str)
# Parse config from file buffer
self.configdict.readfp(config_buf, config_file)
except Exception as e:
log.error("Fatal Error reading config file: err=%s" % str(e))
log.error("Error populating config dict from config string: err=%s" % str(e))
raise e
def env_expand_args_dict(self, args_dict, args_names):
"""
Expand values in dict with equivalent values from the
OS Env. NB vars in OS Env should be prefixed with `STETL_` or `stetl_`
as to get overrides by accident.
:return: expanded args_dict or None
"""
env_dict = os.environ
for name in env_dict:
args_key = '_'.join(name.split('_')[1:])
if name.lower().startswith('stetl_') and args_key in args_names:
# Get real key, e.g. "STETL_HOST" becomes "HOST"
# "stetl_host" becomes "host".
args_value = env_dict[name]
if not args_dict:
args_dict = dict()
# Set: optionally override any existing value
args_dict[args_key] = args_value
log.info("Set/override from env var: %s" % name)
return args_dict
def run(self):
# The main ETL processing
......
#!/usr/bin/env python
# -*- coding: utf-8 -*-
#
# Executes the given command and returns the captured output.
#
# Author: Frank Steggink
#
import subprocess
import os
from stetl.component import Config
from stetl.filter import Filter
from stetl.util import Util
from stetl.packet import FORMAT
log = Util.get_log('execfilter')
class ExecFilter(Filter):
"""
Executes any command (abstract base class).
"""
@Config(ptype=str, default='', required=False)
def env_args(self):
"""
Provides of list of environment variables which will be used when executing the given command.
Example: env_args = pgpassword=postgres othersetting=value~with~spaces
"""
pass
@Config(ptype=str, default='=', required=False)
def env_separator(self):
"""
Provides the separator to split the environment variable names from their values.
"""
pass
def __init__(self, configdict, section, consumes, produces):
Filter.__init__(self, configdict, section, consumes, produces)
def invoke(self, packet):
return packet
def execute_cmd(self, cmd):
env_vars = Util.string_to_dict(self.env_args, self.env_separator)
old_environ = os.environ.copy()
try:
os.environ.update(env_vars)
log.info("executing cmd=%s" % cmd)
result = subprocess.check_output(cmd, shell=True)
log.info("execute done")
return result
finally:
os.environ = old_environ
class CommandExecFilter(ExecFilter):
"""
Executes an arbitrary command and captures the output
consumes=FORMAT.string, produces=FORMAT.string
"""
def __init__(self, configdict, section):
ExecFilter.__init__(self, configdict, section, consumes=FORMAT.string, produces=FORMAT.string)
def invoke(self, packet):
if packet.data is not None:
packet.data = self.execute_cmd(packet.data)
return packet
#!/usr/bin/env python
# -*- coding: utf-8 -*-
#
# Extracts data from a string using a regular expression and generates a record.
#
# Author: Frank Steggink
from stetl.component import Config
from stetl.filter import Filter
from stetl.packet import FORMAT
from stetl.util import Util
import re
log = Util.get_log("regexfilter")
class RegexFilter(Filter):
"""
Extracts data from a string using a regular expression and returns the named groups as a record.
consumes=FORMAT.string, produces=FORMAT.record
"""
# Start attribute config meta
# Applying Decorator pattern with the Config class to provide
# read-only config values from the configured properties.
@Config(ptype=str, default=None, required=True)
def pattern_string(self):
"""
Regex pattern string. Should contain named groups.
"""
pass