Skip to content
Commits on Source (4)
## Version 0.3.4 (2019/12/18)
### Issues Closed
* [Issue 18](https://github.com/pytroll/trollsift/issues/18) - Different parsing allignment behaviour between 0.2.* and 0.3.* ([PR 19](https://github.com/pytroll/trollsift/pull/19))
In this release 1 issue was closed.
### Pull Requests Merged
#### Bugs fixed
* [PR 19](https://github.com/pytroll/trollsift/pull/19) - Fix regex parser being too greedy with partial string patterns ([18](https://github.com/pytroll/trollsift/issues/18))
In this release 1 pull request was closed.
## Version 0.3.3 (2019/10/09)
### Pull Requests Merged
......
trollsift (0.3.4-1) unstable; urgency=medium
* New upstream release.
-- Antonio Valentino <antonio.valentino@tiscali.it> Sun, 22 Dec 2019 08:11:07 +0000
trollsift (0.3.3-1) unstable; urgency=medium
[ Bas Couwenberg ]
......
......@@ -25,7 +25,8 @@ sys.path.insert(0, os.path.abspath('../../'))
# Add any Sphinx extension module names here, as strings. They can be extensions
# coming with Sphinx (named 'sphinx.ext.*') or your custom ones.
extensions = ['sphinx.ext.autodoc', 'sphinx.ext.doctest']
extensions = ['sphinx.ext.autodoc', 'sphinx.ext.doctest', 'sphinx.ext.intersphinx',
'sphinx.ext.napoleon', 'sphinx.ext.viewcode']
# Add any paths that contain templates here, relative to this directory.
templates_path = ['_templates']
......@@ -241,3 +242,8 @@ texinfo_documents = [
# How to display URL addresses: 'footnote', 'no', or 'inline'.
# texinfo_show_urls = 'footnote'
# How intersphinx should find links to other packages
intersphinx_mapping = {
'python': ('https://docs.python.org/3', None),
}
......@@ -16,7 +16,7 @@ for writing higher level applications and api's for satellite batch processing.
The source code of the package can be found at github, github_
.. _github: https://github.com/pnuu/trollsift
.. _github: https://github.com/pytroll/trollsift
Contents
+++++++++
......
......@@ -9,7 +9,7 @@ Installation
You can download the trollsift source code from github,::
$ git clone https://github.com/pnuu/trollsift.git
$ git clone https://github.com/pytroll/trollsift.git
and then run::
......
.. _string-format: https://docs.python.org/2/library/string.html#format-string-syntax
Usage
-----
=====
Trollsift include collection of modules that assist with formatting, parsing and filtering satellite granule file names. These modules are useful and necessary for writing higher level applications and api’s for satellite batch processing. Currently we are implementing the string parsing and composing functionality. Watch this space for further modules to do with various types of filtering of satellite data granules.
Parser
++++++++++
The trollsift string parser module is useful for composing (formatting) and parsing strings
compatible with the Python string-format_ style. In satellite data file name filtering,
------
The trollsift string parser module is useful for composing (formatting) and parsing strings
compatible with the Python :ref:`python:formatstrings`. In satellite data file name filtering,
the library is useful for extracting typical information from granule filenames, such
as observation time, platform and instrument names. The trollsift Parser can also
verify that the string formatting is invertible, i.e. specific enough to ensure that
......@@ -16,19 +14,32 @@ parsing and composing of strings are bijective mappings ( aka one-to-one corresp
which may be essential for some applications, such as predicting granule
parsing
^^^^^^^^^^^^^^^^^^^^^^^^^^^
The Parser object holds a format string, allowing us to parse and compose strings,
^^^^^^^
The Parser object holds a format string, allowing us to parse and compose strings:
>>> from trollsift import Parser
>>>
>>> p = Parser("/somedir/{directory}/hrpt_{platform:4s}{platnum:2s}_{time:%Y%m%d_%H%M}_{orbit:05d}.l1b")
>>> data = p.parse("/somedir/otherdir/hrpt_noaa16_20140210_1004_69022.l1b")
>>> print data
>>> print(data)
{'directory': 'otherdir', 'platform': 'noaa', 'platnum': '16',
'time': datetime.datetime(2014,02,12,14,12), 'orbit':69022}
Parsing in trollsift is not "greedy". This means that in the case of ambiguous
patterns it will match the shortest portion of the string possible. For example:
>>> from trollsift import Parser
>>>
>>> p = Parser("{field_one}_{field_two}")
>>> data = p.parse("abc_def_ghi")
>>> print(data)
{'field_one': 'abc', 'field_two': 'def_ghi'}
So even though the first field could have matched to "abc_def", the non-greedy
parsing chose the shorter possible match of "abc".
composing
^^^^^^^^^^^^^^^^^^^^^^^^^^^
^^^^^^^^^
The reverse operation is called 'compose', and is equivalent to the Python string
class format method. Here we change the time stamp of the data, and write out
a new file name,
......@@ -48,7 +59,7 @@ provides extra conversion options such as making all characters lowercase:
For all of the options see :class:`~trollsift.parser.StringFormatter`.
standalone parse and compose
+++++++++++++++++++++++++++++++++++++++++
----------------------------
The parse and compose methods also exist as standalone functions,
depending on your requirements you can call,
......
......@@ -171,6 +171,18 @@ class RegexFormatter(string.Formatter):
>>> regex_formatter.extract_values('{field_one:5d}_{field_two}', '12345_sometext')
{'field_one': '12345', 'field_two': 'sometext'}
Note that the regular expressions generated by this class are specially
generated to reduce "greediness" of the matches found. For ambiguous
patterns where a single field could match shorter or longer portions of
the provided string, this class will prefer the shorter version of the
string in order to make the rest of the pattern match. For example:
>>> regex_formatter.extract_values('{field_one}_{field_two}', 'abc_def_ghi')
{'field_one': 'abc', 'field_two': 'def_ghi'}
Note how `field_one` could have matched "abc_def", but the lower
greediness of this parser caused it to only match against "abc".
"""
# special string to mark a parameter not being specified
......@@ -255,7 +267,7 @@ class RegexFormatter(string.Formatter):
raise ValueError("Invalid format specification: '{}'".format(format_spec))
final_regex = char_type
if ftype in allow_multiple and (not width or width == '0'):
final_regex += r'*'
final_regex += r'*?'
elif width and width != '0':
if not fill:
# we know we have exactly this many characters
......@@ -266,7 +278,7 @@ class RegexFormatter(string.Formatter):
# later during type conversion.
final_regex = r'.{{{}}}'.format(int(width))
elif ftype in allow_multiple:
final_regex += r'*'
final_regex += r'*?'
return r'(?P<{}>{})'.format(field_name, final_regex)
......@@ -284,7 +296,7 @@ class RegexFormatter(string.Formatter):
# Replace format spec with glob patterns (*, ?, etc)
if not format_spec:
return r'(?P<{}>.*)'.format(field_name)
return r'(?P<{}>.*?)'.format(field_name)
if '%' in format_spec:
return r'(?P<{}>{})'.format(field_name, self._regex_datetime(format_spec))
return self.format_spec_to_regex(field_name, format_spec)
......
......@@ -300,6 +300,27 @@ class TestParser(unittest.TestCase):
self.assertRaises(ValueError, compose, "{a!X}", key_vals)
self.assertEqual(new_str, 'this Is A-Test b_test c test')
def test_greediness(self):
"""Test that the minimum match is parsed out.
See GH #18.
"""
from trollsift import parse
template = '{band_type}_{polarization_extracted}_{unit}_{s1_fname}'
fname = 'Amplitude_VH_db_S1A_IW_GRDH_1SDV_20160528T171628_20160528T171653_011462_011752_0EED.tif'
res_dict = parse(template, fname)
exp = {
'band_type': 'Amplitude',
'polarization_extracted': 'VH',
'unit': 'db',
's1_fname': 'S1A_IW_GRDH_1SDV_20160528T171628_20160528T171653_011462_011752_0EED.tif',
}
self.assertEqual(exp, res_dict)
template = '{band_type:s}_{polarization_extracted}_{unit}_{s1_fname}'
res_dict = parse(template, fname)
self.assertEqual(exp, res_dict)
def suite():
"""The suite for test_parser
......
......@@ -23,9 +23,9 @@ def get_keywords():
# setup.py/versioneer.py will grep for the variable names, so they must
# each be defined on a line of their own. _version.py will just call
# get_keywords().
git_refnames = " (HEAD -> master, tag: v0.3.3)"
git_full = "e0a82d62b317df5f62eb2532480ef110f2fe3b16"
git_date = "2019-10-09 08:55:57 -0500"
git_refnames = " (HEAD -> master, tag: v0.3.4)"
git_full = "c37ad63f1ab71d3b0ae75a3c04d4a24bc6b2e899"
git_date = "2019-12-18 07:46:35 -0600"
keywords = {"refnames": git_refnames, "full": git_full, "date": git_date}
return keywords
......