Commit 82324716 authored by Norbert Preining's avatar Norbert Preining

New upstream version 0.7

parents
language: python
python:
- "2.7"
- "3.4"
script: python ./setup.py install && python ./test.py
No-notice MIT License
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
THE SOFTWARE.
[![Build Status](https://travis-ci.org/avakar/pycson.svg?branch=master)](https://travis-ci.org/avakar/pycson)
# pycson
A python parser for the Coffeescript Object Notation (CSON).
## Installation
The parser is tested on Python 2.7 and 3.4.
pip install cson
The interface is the same as for the standard `json` package.
>>> import cson
>>> cson.loads('a: 1')
{'a': 1}
>>> with open('file.cson', 'rb') as fin:
... obj = cson.load(fin)
>>> obj
{'a': 1}
## The language
There is not formal definition of CSON, only an informal note in [one project][1]'s readme.
Informally, CSON is a JSON, but with a Coffeescript syntax. Sadly [Coffescript][2] has no
format grammar either; it instead has a canonical implementation.
This means that bugs in the implementation translate into bugs in the language itself.
Worse, this particular implementation inserts a "rewriter" between the typical
lexer/parser pair, purporting that it makes the grammar simpler. Unfortunately, it adds
weird corner cases to the language..
This parser does away with the corner cases,
in exchange for changing the semantics of documents in a few unlikely circumstances.
In other words, some documents may be parsed differently by the Coffescript parser and pycson.
Here are some important highlights (see the [formal grammar][3] for details).
* String interpolations (`"#{test}"`) are allowed, but are treated literally.
* Whitespace is ignored in arrays and in objects enclosed in braces
(Coffeescripts requires consistent indentation).
* Unbraced objects greedily consume as many key/value pairs as they can.
* All lines in an unbraced object must have the same indentation. This is the only place
where whitespace is significant. There are no special provisions for partial dedents.
For two lines to have the same indent, their maximal sequences of leading spaces and tabs
must be the same (Coffescript only tracks the number of whitespace characters).
* Unbraced objects that don't start on their own line will never span multiple lines.
* Commas at the end of the line can always be removed without changing the output.
I believe the above rules make the parse unambiguous.
This example demonstrates the effect of indentation.
# An array containing a single element: an object with three keys.
[
a: 1
b: 2
c: 3
]
# An array containing three elements: objects with one key.
[
a: 1
b: 2
c: 3
]
# An array containing two objects, the first of which having one key.
[ a: 1
b: 2
c: 3 ]
Note that pycson can parse all JSON documents correctly (Coffeescript can't because
of whitespace and string interpolations).
[1]: https://github.com/bevry/cson
[2]: http://coffeescript.org/
[3]: grammar.md
from .parser import load, loads
from .writer import dump, dumps
from speg import ParseError
from speg import peg
import re, sys
if sys.version_info[0] == 2:
_chr = unichr
else:
_chr = chr
def load(fin):
return loads(fin.read())
def loads(s):
if isinstance(s, bytes):
s = s.decode('utf-8')
if s.startswith(u'\ufeff'):
s = s[1:]
return peg(s.replace('\r\n', '\n'), _p_root)
def _p_ws(p):
p('[ \t]*')
def _p_nl(p):
p(r'([ \t]*(?:#[^\n]*)?\r?\n)+')
def _p_ews(p):
with p:
p(_p_nl)
p(_p_ws)
def _p_id(p):
return p(r'[$a-zA-Z_][$0-9a-zA-Z_]*')
_escape_table = {
'r': '\r',
'n': '\n',
't': '\t',
'f': '\f',
'b': '\b',
}
def _p_unescape(p):
esc = p('\\\\(?:u[0-9a-fA-F]{4}|[^\n])')
if esc[1] == 'u':
return _chr(int(esc[2:], 16))
return _escape_table.get(esc[1:], esc[1:])
_re_indent = re.compile(r'[ \t]*')
def _p_block_str(p, c):
p(r'{c}{c}{c}'.format(c=c))
lines = [['']]
with p:
while True:
s = p(r'(?:{c}(?!{c}{c})|[^{c}\\])*'.format(c=c))
l = s.split('\n')
lines[-1].append(l[0])
lines.extend([x] for x in l[1:])
if p(r'(?:\\\n[ \t]*)*'):
continue
p.commit()
lines[-1].append(p(_p_unescape))
p(r'{c}{c}{c}'.format(c=c))
lines = [''.join(l) for l in lines]
strip_ws = len(lines) > 1
if strip_ws and all(c in ' \t' for c in lines[-1]):
lines.pop()
indent = None
for line in lines[1:]:
if not line:
continue
if indent is None:
indent = _re_indent.match(line).group(0)
continue
for i, (c1, c2) in enumerate(zip(indent, line)):
if c1 != c2:
indent = indent[:i]
break
ind_len = len(indent or '')
if strip_ws and all(c in ' \t' for c in lines[0]):
lines = [line[ind_len:] for line in lines[1:]]
else:
lines[1:] = [line[ind_len:] for line in lines[1:]]
return '\n'.join(lines)
_re_mstr_nl = re.compile(r'(?:[ \t]*\n)+[ \t]*')
_re_mstr_trailing_nl = re.compile(_re_mstr_nl.pattern + r'\Z')
def _p_multiline_str(p, c):
p('{c}(?!{c}{c})(?:[ \t]*\n[ \t]*)?'.format(c=c))
string_parts = []
with p:
while True:
string_parts.append(p(r'[^{c}\\]*'.format(c=c)))
if p(r'(?:\\\n[ \t]*)*'):
string_parts.append('')
continue
p.commit()
string_parts.append(p(_p_unescape))
p(c)
string_parts[-1] = _re_mstr_trailing_nl.sub('', string_parts[-1])
string_parts[::2] = [_re_mstr_nl.sub(' ', part) for part in string_parts[::2]]
return ''.join(string_parts)
def _p_string(p):
with p:
return p(_p_block_str, '"')
with p:
return p(_p_block_str, "'")
with p:
return p(_p_multiline_str, '"')
return p(_p_multiline_str, "'")
def _p_array_value(p):
with p:
p(_p_nl)
return p(_p_object)
with p:
p(_p_ws)
return p(_p_line_object)
p(_p_ews)
return p(_p_simple_value)
def _p_key(p):
with p:
return p(_p_id)
return p(_p_string)
def _p_flow_kv(p):
k = p(_p_key)
p(_p_ews)
p(':')
with p:
p(_p_nl)
return k, p(_p_object)
with p:
p(_p_ws)
return k, p(_p_line_object)
p(_p_ews)
return k, p(_p_simple_value)
def _p_flow_obj_sep(p):
with p:
p(_p_ews)
p(',')
p(_p_ews)
return
p(_p_nl)
p(_p_ws)
def _p_simple_value(p):
with p:
p('null')
return None
with p:
p('false')
return False
with p:
p('true')
return True
with p:
return int(p('0b[01]+')[2:], 2)
with p:
return int(p('0o[0-7]+')[2:], 8)
with p:
return int(p('0x[0-9a-fA-F]+')[2:], 16)
with p:
return float(p(r'-?(?:[1-9][0-9]*|0)?\.[0-9]+(?:e[\+-]?[0-9]+)?|(?:[1-9][0-9]*|0)(?:\.[0-9]+)e[\+-]?[0-9]+'))
with p:
return int(p('-?[1-9][0-9]*|0'), 10)
with p:
return p(_p_string)
with p:
p(r'\[')
r = []
with p:
p.set('I', '')
r.append(p(_p_array_value))
with p:
while True:
with p:
p(_p_ews)
p(',')
rr = p(_p_array_value)
if not p:
p(_p_nl)
with p:
rr = p(_p_object)
if not p:
p(_p_ews)
rr = p(_p_simple_value)
r.append(rr)
p.commit()
with p:
p(_p_ews)
p(',')
p(_p_ews)
p(r'\]')
return r
p(r'\{')
r = {}
p(_p_ews)
with p:
p.set('I', '')
k, v = p(_p_flow_kv)
r[k] = v
with p:
while True:
p(_p_flow_obj_sep)
k, v = p(_p_flow_kv)
r[k] = v
p.commit()
p(_p_ews)
with p:
p(',')
p(_p_ews)
p(r'\}')
return r
def _p_line_kv(p):
k = p(_p_key)
p(_p_ws)
p(':')
p(_p_ws)
with p:
p(_p_nl)
p(p.get('I'))
return k, p(_p_indented_object)
with p:
return k, p(_p_line_object)
with p:
return k, p(_p_simple_value)
p(_p_nl)
p(p.get('I'))
p('[ \t]')
p(_p_ws)
return k, p(_p_simple_value)
def _p_line_object(p):
k, v = p(_p_line_kv)
r = { k: v }
with p:
while True:
p(_p_ws)
p(',')
p(_p_ws)
k, v = p(_p_line_kv)
r[k] = v # uniqueness
p.commit()
return r
def _p_object(p):
p.set('I', p.get('I') + p('[ \t]*'))
r = p(_p_line_object)
with p:
while True:
p(_p_ws)
with p:
p(',')
p(_p_nl)
p(p.get('I'))
rr = p(_p_line_object)
r.update(rr) # unqueness
p.commit()
return r
def _p_indented_object(p):
p.set('I', p.get('I') + p('[ \t]'))
return p(_p_object)
def _p_root(p):
with p:
p(_p_nl)
with p:
p.set('I', '')
r = p(_p_object)
p(_p_ws)
with p:
p(',')
if not p:
p(_p_ws)
r = p(_p_simple_value)
p(_p_ews)
p(p.eof)
return r
import re, json, sys
if sys.version_info[0] == 2:
def _is_num(o):
return isinstance(o, int) or isinstance(o, long) or isinstance(o, float)
def _stringify(o):
if isinstance(o, str):
return unicode(o)
if isinstance(o, unicode):
return o
return None
else:
def _is_num(o):
return isinstance(o, int) or isinstance(o, float)
def _stringify(o):
if isinstance(o, bytes):
return o.decode()
if isinstance(o, str):
return o
return None
_id_re = re.compile(r'[$a-zA-Z_][$0-9a-zA-Z_]*\Z')
class CSONEncoder:
def __init__(self, skipkeys=False, ensure_ascii=True, check_circular=True, allow_nan=True, sort_keys=False,
indent=None, default=None):
self._skipkeys = skipkeys
self._ensure_ascii = ensure_ascii
self._allow_nan = allow_nan
self._sort_keys = sort_keys
self._indent = ' ' * (indent or 4)
self._default = default
if check_circular:
self._obj_stack = set()
else:
self._obj_stack = None
def _format_simple_val(self, o):
if o is None:
return 'null'
if isinstance(o, bool):
return 'true' if o else 'false'
if _is_num(o):
return str(o)
s = _stringify(o)
if s is not None:
return self._escape_string(s)
return None
def _escape_string(self, s):
r = json.dumps(s, ensure_ascii=self._ensure_ascii)
return u"'{}'".format(r[1:-1].replace("'", r"\'"))
def _escape_key(self, s):
if s is None or isinstance(s, bool) or _is_num(s):
s = str(s)
s = _stringify(s)
if s is None:
if self._skipkeys:
return None
raise TypeError('keys must be a string')
if not _id_re.match(s):
return self._escape_string(s)
return s
def _push_obj(self, o):
if self._obj_stack is not None:
if id(o) in self._obj_stack:
raise ValueError('Circular reference detected')
self._obj_stack.add(id(o))
def _pop_obj(self, o):
if self._obj_stack is not None:
self._obj_stack.remove(id(o))
def _encode(self, o, obj_val=False, indent='', force_flow=False):
if isinstance(o, list):
if not o:
if obj_val:
yield ' []\n'
else:
yield indent
yield '[]\n'
else:
if obj_val:
yield ' [\n'
else:
yield indent
yield '[\n'
indent = indent + self._indent
self._push_obj(o)
for v in o:
for chunk in self._encode(v, obj_val=False, indent=indent, force_flow=True):
yield chunk
self._pop_obj(o)
yield indent[:-len(self._indent)]
yield ']\n'
elif isinstance(o, dict):
items = [(self._escape_key(k), v) for k, v in o.items()]
if self._skipkeys:
items = [(k, v) for k, v in items if k is not None]
if self._sort_keys:
items.sort()
if force_flow or not items:
if not items:
if obj_val:
yield ' {}\n'
else:
yield indent
yield '{}\n'
else:
if obj_val:
yield ' {\n'
else:
yield indent
yield '{\n'
indent = indent + self._indent
self._push_obj(o)
for k, v in items:
yield indent
yield k
yield ':'
for chunk in self._encode(v, obj_val=True, indent=indent + self._indent, force_flow=False):
yield chunk
self._pop_obj(o)
yield indent[:-len(self._indent)]
yield '}\n'
else:
if obj_val:
yield '\n'
self._push_obj(o)
for k, v in items:
yield indent
yield k
yield ':'
for chunk in self._encode(v, obj_val=True, indent=indent + self._indent, force_flow=False):
yield chunk
self._pop_obj(o)
else:
v = self._format_simple_val(o)
if v is None:
self._push_obj(o)
v = self.default(o)
for chunk in self._encode(v, obj_val=obj_val, indent=indent, force_flow=force_flow):
yield chunk
self._pop_obj(o)
else:
if obj_val:
yield ' '
else:
yield indent
yield v
yield '\n'
def iterencode(self, o):
return self._encode(o)
def encode(self, o):
return ''.join(self.iterencode(o))
def default(self, o):
if self._default is None:
raise TypeError('Cannot serialize an object of type {}'.format(type(o).__name__))
return self._default(o)
def dump(obj, fp, skipkeys=False, ensure_ascii=True, check_circular=True, allow_nan=True, cls=None,
indent=None, default=None, sort_keys=False, **kw):
if indent is None and cls is None:
return json.dump(obj, fp, skipkeys=skipkeys, ensure_ascii=ensure_ascii, check_circular=check_circular,
allow_nan=allow_nan, default=default, sort_keys=sort_keys, separators=(',', ':'))
if cls is None:
cls = CSONEncoder
encoder = cls(skipkeys=skipkeys, ensure_ascii=ensure_ascii, check_circular=check_circular,
allow_nan=allow_nan, sort_keys=sort_keys, indent=indent, default=default, **kw)
for chunk in encoder.iterencode(obj):
fp.write(chunk)
def dumps(obj, skipkeys=False, ensure_ascii=True, check_circular=True, allow_nan=True, cls=None, indent=None,
default=None, sort_keys=False, **kw):
if indent is None and cls is None:
return json.dumps(obj, skipkeys=skipkeys, ensure_ascii=ensure_ascii, check_circular=check_circular,
allow_nan=allow_nan, default=default, sort_keys=sort_keys, separators=(',', ':'))
if cls is None:
cls = CSONEncoder
encoder = cls(skipkeys=skipkeys, ensure_ascii=ensure_ascii, check_circular=check_circular,
allow_nan=allow_nan, sort_keys=sort_keys, indent=indent, default=default, **kw)
return encoder.encode(obj)
This is a formal grammar for the language parsed by pycson.
It uses the standard [PEG syntax][1] with an extension
to support indent sensitivity: for a PEG expression `E`,
the expression `E{I=e}` will change the meaning of the identifier `I`
to the expression `e` while matching `E`.
In CSON, whitespace may contain spaces and tabs. This is more strict than
Coffeescript where any `[^\n\S]` will match.
The symbol `nl` will match one or more newlines that only contain whitespace
or comments in between. A match for `nl` also matches any whitespace preceding the first
newline. `ews` is the "extended whitespace", one that incudes newlines.
Note however that `ews` ending in a comment must be terminated by a newline character.
ws <- [ \t]*
nl <- (ws ('#' [^\n]*)? '\r'? '\n')+
ews <- nl? ws
Atomic values of type `null` and `bool`.
null <- 'null'
bool <- 'false' / 'true'
Number can be decimal, binary, hexadecimal or octal. Decimal numbers
must not have any leading zeros. The octal prefix is `0o`, and therefore
numbers like `0775` are not allowed (use `0o755` instead).
Hex digits are case-insensitive, but the `0x` prefix (and `0o` and `0b`)
must be lowercase. There is no way to make a non-decimal number negative.
number <- '0b'[01]+ / '0o'[0-7]+ / 0x[0-9a-fA-F]+
/ -?([1-9][0-9]* / '0')?'.'[0-9]+('e'[+-]?[0-9]+)?
/ -?([1-9][0-9]* / '0')('.'[0-9]+)?('e'[+-]?[0-9]+)?
Strings are delimited by one of `'`, `"`, `'''` or `"""`. There is no
difference between apostrophes and double quotes, since string interpolation
is treated literally. This means that `"#{var}"` is the same as `"\#{var}"`.
Even more importantly, `"#{"test"}"` is not a valid CSON string.
All escapes are treated literally, except for `r`, `n`, `t`, `f`, and `b`, which are
treated as is usually in all modern languages, and for a newline character.
Escaping a newline character is equivalent to removing the newline and any following
whitespace.
Single-quoted strings treat newlines and any following whitespace as a single space.
Lines containing only whitespace are ignored. Leading and trailing whitespace is ignored.
For block strings (triple-quoted strings) that contain a newline, the first line is stripped
if it only contains whitespace. Similarly for the last line. Escaped newline is treated the
same way as for single-quoted strings (removes the newline and any following whitespace).
Once assembled, a maximal prefix of whitespace characters that occurs at the beginning
of each line is found and stripped from all lines.
string <-
"'" !"''" string_tail{X="'"} /
"'''" string_tail{X="'''"} /
'"' !'""' string_tail{X='""'} /
'"""' string_tail{X='"""'}
string_tail <- (!X ('\\'. / .))* X
Identifiers may be used instead of strings as keys in objects.
id <- [$a-zA-Z_][$0-9a-zA-Z_]*
Arrays are delimited by brackets. Whitespace is insignificant and the current indent level is reset.
array <-
'[' (array_value (ews ',' array_value / nl (object / ews simple_value))* (ews ',')?)?{I=} ews ']'
array_value <- nl object / ws line_object / ews simple_value
This rule matches a brace-delimited object. The handling of whitespace is the same
as for arrays, the indent is reset.
flow_kv <- (id / string) ews ':'
(nl object / ws line_object / ews simple_value)
flow_obj_sep <- ews ',' ews / nl ws
flow_object <- '{' ews (flow_kv (flow_obj_sep flow_kv)* ews (',' ews)?)?{I=} '}'
A simple value is one which is not sensitive to the position within the document
or to the current indent level.
simple_value <- null / bool / number / string / array / flow_object
A line object is an unbraced object which doesn't start at its own line.
For example, in `[a:1, b:2]`, the array contains one `line_object`.
Note that a line object will never span multiple lines.
Line objects have no indent, but they propagate the current indent level to their
child objects.
line_kv <- (id / string) ws ':' ws
(nl I indented_object / line_object / simple_value / nl I [ \t] ws simple_value)
line_object <- line_kv (ws ',' ws line_kv)*?
This is the unbraced object that starts on its own line. It detects its ident level
and requires that all lines have this indent. The previous indent level must be
a string prefix of the newly detected one.
object <- ' ' object{I=I ' '} / '\t' object{I=I '\t'} / line_object (ws ','? nl I line_object)*
indented_object <- ' ' object{I=I ' '} / '\t' object{I=I '\t'}
A CSON document consists of a single value (either an unbraced `object` or a `simple_value`).
The value can be preceded and followed by whitespace. Note that a comment
on the last line must be terminated by a newline.
root=nl? (object{I=} ws ','? / ws simple_value) ews !.
[1]: http://www.brynosaurus.com/pub/lang/peg.pdf
#!/usr/bin/env python
# coding: utf-8
from setuptools import setup
setup(
name='cson',
version='0.7',
description='A parser for Coffeescript Object Notation (CSON)',
author='Martin Vejnár',
author_email='vejnar.martin@gmail.com',
url='https://github.com/avakar/pycson',
license='MIT',
packages=['cson'],
install_requires=['speg'],
)
import sys, os, os.path, json
sys.path.insert(0, os.path.join(os.path.split(__file__)[0], 'cson'))