Improve JSON schema draft 4 support by krismolendyke · Pull Request #3 · horejsek/python-fastjsonschema

krismolendyke · 2018-03-17T00:12:31Z

Thanks for creating a Python JSON schema validation library that isn't slow!

Aside from the good performance, and the differences noted in the documentation, I wanted to know exactly how well this library validated against the official JSON schema (draft 4) test suite.

To do that I have made the following changes:

Add https://github.com/json-schema-org/JSON-Schema-Test-Suite submodule
Add test suite runner script
Add test suite runner configuration specifying a subset of the draft 4 test suite
Update code generator to pass tests specified in configuration

To support those changes:

Add extras_require to setup.py with test requirements
Update Makefile to use pip to install test requirements
Fix performance.py script to properly dedent code for timeit run

To nitpick ;p :

Fix a typographical error, spcified → specified

All current make test target tests still pass via pytest.

This notably skips these draft 4 tests via configuration:

definitions.json
dependencies.json
optional/bignum.json
optional/ecmascript-regex.json
optional/format.json
optional/zeroTerminatedFloats.json
ref.json
refRemote.json
uniqueItems.json

uniqueItems.json partially passes but fails on nested objects. Passing that test will come at cost to performance.

Example test suite execution:

$ ./json_schema_test_suite.py json-schema-test-suite-draft-4.conf
✔ additionalItems.json
✔ additionalProperties.json
✔ allOf.json
✔ anyOf.json
⛔ bignum.json
✔ default.json
⛔ definitions.json
⛔ dependencies.json
⛔ ecmascript-regex.json
✔ enum.json
⛔ format.json
✔ items.json
✔ maxItems.json
✔ maxLength.json
✔ maxProperties.json
✔ maximum.json
✔ minItems.json
✔ minLength.json
✔ minProperties.json
✔ minimum.json
✔ multipleOf.json
✔ not.json
✔ oneOf.json
✔ pattern.json
✔ patternProperties.json
✔ properties.json
⛔ ref.json
⛔ refRemote.json
✔ required.json
✔ type.json
⛔ uniqueItems.json
⛔ zeroTerminatedFloats.json

Schema exceptions:

⛔ ref.json: expected an indented block (<string>, line 29): 'return data'
⛔ refRemote.json: expected an indented block (<string>, line 11): 'return data'

Summary of 358 tests:

Failures:

✘ FALSE_POSITIVE    0   0.0%
✘ FALSE_NEGATIVE    0   0.0%
⚠ UNDEFINED         0   0.0%
                    0   0.0%

Passes:

✔ TRUE_POSITIVE   125  52.7%
✔ TRUE_NEGATIVE   112  47.3%
                  237 100.0%

⛔ Ignored:        121
Coverage:     237/358  66.2%

I may have survived...

#    ___
#    \./     DANGER: This module implements some code generation
# .--.O.--.          techniques involving string concatenation.
#  \/   \/           If you look at it, you might die.
#

horejsek

Great job!

horejsek · 2018-03-17T08:23:44Z

fastjsonschema/generator.py

+            self.l('return {variable}')
        self._compile_regexps['{}_re'.format(self._variable)] = re.compile(self._definition['pattern'])
-        with self.l('if not {variable}_re.match({variable}):'):
+        with self.l('if not {variable}_re.search({variable}):'):


Why this change? :-)

When defining the regular expressions, it’s important to note that the string is considered valid if the expression matches anywhere within the string.

– https://spacetelescope.github.io/understanding-json-schema/reference/string.html#index-2

Python offers two different primitive operations based on regular expressions: re.match() checks for a match only at the beginning of the string, while re.search() checks for a match anywhere in the string

– http://devdocs.io/python~3.6/library/re#search-vs-match

I see! I misinterpreted the validity of regexp in JSON schema. Thanks for the fix!

horejsek · 2018-03-17T08:30:16Z

fastjsonschema/generator.py

+        with self.l('try:'):
+            self.l('{variable}.keys()')
+        with self.l('except AttributeError:'):
+            self.l('return {variable}')


Wouldn't be simple condition faster than try-except block?

That's likely. I will refactor any more of these blocks to isinstance checks to align with the rest of the code. My mistake.

No worry, great, thanks!

horejsek · 2018-03-17T08:46:45Z

fastjsonschema/generator.py

+                for pattern in self._definition['patternProperties'].keys():
+                    with self.l('if globals()["{}_re"].search(key):', pattern):
+                        self.l('pattern_keys.add(key)')
+            self.l('{variable}_keys -= pattern_keys')


I don't understand yet why this block is here, why methods later are not enough.

I don't, either 😉. This is due to the logical demands of this test in the draft 4 spec that enforces the interaction between properties, patternProperties, and additionalProperties. There may be a more efficient way to pass this test.

:-) Yep, those all properties are funny ones. Will look at it in more detail on Monday.

horejsek · 2018-03-19T12:00:32Z

fastjsonschema/generator.py

+        with self.l('if not isinstance({variable}, dict):'):
+            self.l('return {variable}')
+        for pattern, definition in self._definition['patternProperties'].items():
+            self._compile_regexps['{}_re'.format(pattern)] = re.compile(pattern)


This needs better key because in generate_pattern is {variable}_re and in can clash. Or maybe keep this one and change the first one to {variable}_re_var as it never run to conflict ({varable}_var_re or var_{variable}_re could as rgexp could be x_var or var_x).

I was thinking why you used globals() and then I figured out that pattern is not valid Python name and I'm thinking that variable name doesn't have to be valid Python name as well. Could you change it and document it why it's that way? It's very important to know this information when someone else will look here and would want to do some change.

Or, it could be even better, to not use globals() at all but create unique names for regulars. First can have name, like r1, r2, r3, ... So during generating have that f.o is r1 if it's first one and use this name in the code. Not sure how expensive is to look to globals() in every iteration. Regulars has to be compiled in generation as it is and saved on global but on the start of the function it could take it from globals to make it even faster. I'm thinking about that because now we generate code like this one:

... for key, val in data.items(): if globals()["f.o_re"].search(key): ...

But maybe it's not very big performance improvement.

I tried the non-globals() index-based approach first and I worried about ordering during key iteration so I refactored to this may-based approach. Ordering shouldn't be an issue if the dict is never mutated but I didn't know how to be sure about that beyond converting the schema definition to an OrderedDict and that felt like too much change.

I don't love polluting globals but this seemed like the lesser of evils. A local solution would be faster but I haven't profiled the difference.

horejsek · 2018-03-19T12:58:25Z

fastjsonschema/generator.py

+                for pattern in self._definition['patternProperties'].keys():
+                    with self.l('if globals()["{}_re"].search(key):', pattern):
+                        self.l('pattern_keys.add(key)')
+            self.l('{variable}_keys -= pattern_keys')


This code shouldn't be needed. Look how properties work. There is this code for every found property:

self.l('{variable}_keys.remove("{}")', key)

So it you handle this already in generate_pattern_properties, then it's not needed to iterate them twice.

You're correct. I may have implemented these "out of order" based on how the JSON schema test suite executes. We should be able to simply return here. Good catch.

horejsek · 2018-03-19T12:59:14Z

fastjsonschema/generator.py

+    def generate_additional_properties(self):
+        with self.l('if not isinstance({variable}, dict):'):
+            self.l('return {variable}')
+        self.l('{variable}_keys = set({variable}.keys())')


This is created already by object type on the beginning. No need to do it again.

I understand your comments, but I think this is required due to this test:

$ ./json_schema_test_suite.py JSON-Schema-Test-Suite/tests/draft4/additionalProperties.json ✔ additionalProperties.json ⚠ additionalProperties can exist by itself ⚠ UNDEFINED NameError an additional valid property is valid: name 'data_keys' is not defined ⚠ UNDEFINED NameError an additional invalid property is invalid: name 'data_keys' is not defined Summary of 12 tests: Failures: ✘ FALSE_POSITIVE 0 0.0% ✘ FALSE_NEGATIVE 0 0.0% ⚠ UNDEFINED 2 16.7% 2 16.7% Passes: ✔ TRUE_POSITIVE 8 66.7% ✔ TRUE_NEGATIVE 2 16.7% 10 83.3% ⛔ Ignored: 0 Coverage: 12/12 100.0%

Which is a pretty odd case, but is apparently valid according to the spec.

This is very probably connected to previous comment where keys should be managed once. Maybe test doesn't pass because generated code is not as simple as needed.

horejsek · 2018-03-19T13:00:44Z

fastjsonschema/generator.py

+
+    def generate_additional_properties(self):
+        with self.l('if not isinstance({variable}, dict):'):
+            self.l('return {variable}')


This is done already by object type on the beginning. No need to do it again.

horejsek · 2018-03-19T13:01:07Z

fastjsonschema/generator.py

+
+    def generate_pattern_properties(self):
+        with self.l('if not isinstance({variable}, dict):'):
+            self.l('return {variable}')


This is created already by object type on the beginning. No need to do it again.

horejsek · 2018-03-19T13:06:41Z

fastjsonschema/generator.py

+        add_prop_definition = self._definition["additionalProperties"]
+        if add_prop_definition:
+            with self.l('for {variable}_key in {variable}_keys:'):
+                with self.l('if {variable}_key not in "{}":', self._definition.get('properties', [])):


This generates wrong code:

if data_key not in "{'foo': {'type': 'array', 'maxItems': 3}, 'bar': {'type': 'array'}}":

I think this condition shouldn't even be here as in {variable}_keys is only the additional properties already.

You are right, I meant to just check if {variable}_key was not in the properties keys. I'll fix that.

I do think we need this check though for this test case:

$ ./json_schema_test_suite.py JSON-Schema-Test-Suite/tests/draft4/additionalProperties.json ✘ additionalProperties.json ✘ additionalProperties allows a schema which should validate ✘ FALSE_NEGATIVE JsonSchemaException no additional properties is valid: data.foo must be boolean ✘ FALSE_NEGATIVE JsonSchemaException an additional valid property is valid: data.bar must be boolean Summary of 2 tests: Failures: ✘ FALSE_POSITIVE 0 0.0% ✘ FALSE_NEGATIVE 2 100.0% ⚠ UNDEFINED 0 0.0% 2 100.0% Passes: ✔ TRUE_POSITIVE 0 0.0% ✔ TRUE_NEGATIVE 0 0.0% 0 0.0% ⛔ Ignored: 0 Coverage: 2/2 100.0%

horejsek · 2018-03-19T13:15:57Z

I checked the code in details. There is many issues with patternProps and additionalProps. For example I looked how is code generated for this JSON schema:

{'properties': {'foo': {'type': 'array', 'maxItems': 3}, 'bar': {'type': 'array'}}, 'patternProperties': {'f.o': {'minItems': 2}}, 'additionalProperties': {'type': 'integer'}}

Solution you provided:

def func(data):
    NoneType = type(None)
    if not isinstance(data, dict):
        return data
    data_keys = set(data.keys())
    if "foo" in data_keys:
        data_keys.remove("foo")
        data_foo = data["foo"]
        if not isinstance(data_foo, (list)):
            raise JsonSchemaException("data.foo must be array")
        if not isinstance(data_foo, list):
            return data_foo
        data_foo_len = len(data_foo)
        if data_foo_len > 3:
            raise JsonSchemaException("data.foo must contain less than or equal to 3 items")
    if "bar" in data_keys:
        data_keys.remove("bar")
        data_bar = data["bar"]
        if not isinstance(data_bar, (list)):
            raise JsonSchemaException("data.bar must be array")
    if not isinstance(data, dict):
        return data

    for key, val in data.items():
        if globals()["f.o_re"].search(key):
            if not isinstance(val, list):
                return val
            val_len = len(val)
            if val_len < 2:
                raise JsonSchemaException(""+"data.patternProperties.{key}".format(**locals())+" must contain at least 2 items")
    pattern_keys = set()
    for key in data_keys:
        if globals()["f.o_re"].search(key):
            pattern_keys.add(key)
    data_keys -= pattern_keys
    for data_key in data_keys:
        data_value = data.get(data_key)
        if not isinstance(data_value, (int)) or isinstance(data_value, bool):
            raise JsonSchemaException(""+"data.{data_key}".format(**locals())+" must be integer")
    if not isinstance(data, dict):
        return data
    
    for key, val in data.items():
        if globals()["f.o_re"].search(key):
            if not isinstance(val, list):
                return val
                
            if val_len < 2:
                raise JsonSchemaException(""+"data.{key}".format(**locals())+" must contain at least 2 items")
    if not isinstance(data, dict):
        return data
    data_keys = set(data.keys())
    pattern_keys = set()
    for key in data_keys:
        if globals()["f.o_re"].search(key):
            pattern_keys.add(key)
    data_keys -= pattern_keys
    for data_key in data_keys:
        if data_key not in "{'foo': {'type': 'array', 'maxItems': 3}, 'bar': {'type': 'array'}}":
            data_value = data.get(data_key)
            if not isinstance(data_value, (int)) or isinstance(data_value, bool):
                raise JsonSchemaException(""+"data.{data_key}".format(**locals())+" must be integer")
    return data

Code which should be correct if I didn't make any mistake but looks good:

def func(data):
    NoneType = type(None)
    if not isinstance(data, dict):
        return data
    data_keys = set(data.keys())
    if "foo" in data_keys:
        data_keys.remove("foo")
        data_foo = data["foo"]
        if not isinstance(data_foo, (list)):
            raise JsonSchemaException("data.foo must be array")
        if not isinstance(data_foo, list):
            return data_foo
        data_foo_len = len(data_foo)
        if data_foo_len > 3:
            raise JsonSchemaException("data.foo must contain less than or equal to 3 items")
    if "bar" in data_keys:
        data_keys.remove("bar")
        data_bar = data["bar"]
        if not isinstance(data_bar, (list)):
            raise JsonSchemaException("data.bar must be array")
    for key, val in data.items():
        if globals()["f.o_re"].search(key):
            if key in data_keys:
                data_keys.remove(key)
            if not isinstance(val, list):
                return val
            val_len = len(val)
            if val_len < 2:
                raise JsonSchemaException(""+"data.{key}".format(**locals())+" must contain at least 2 items")
    for data_key in data_keys:
        data_value = data.get(data_key)
        if not isinstance(data_value, (int)) or isinstance(data_value, bool):
            raise JsonSchemaException(""+"data.{data_key}".format(**locals())+" must be integer")
    return data

I commented specific problems in the diff. Do you want to fix it or should I?

krismolendyke · 2018-03-19T14:07:01Z

Thanks for the review! I'm not working today, I'll have a closer look tomorrow. If you want to make the changes and the test suites still pass that would be great!

…

On Mon, Mar 19, 2018, 9:15 AM Michal Hořejšek ***@***.***> wrote: I checked the code in details. There is many issues with patternProps and additionalProps. For example I looked how is code generated for this JSON schema: {'properties': {'foo': {'type': 'array', 'maxItems': 3}, 'bar': {'type': 'array'}}, 'patternProperties': {'f.o': {'minItems': 2}}, 'additionalProperties': {'type': 'integer'}} Solution you provided: def func(data): NoneType = type(None) if not isinstance(data, dict): return data data_keys = set(data.keys()) if "foo" in data_keys: data_keys.remove("foo") data_foo = data["foo"] if not isinstance(data_foo, (list)): raise JsonSchemaException("data.foo must be array") if not isinstance(data_foo, list): return data_foo data_foo_len = len(data_foo) if data_foo_len > 3: raise JsonSchemaException("data.foo must contain less than or equal to 3 items") if "bar" in data_keys: data_keys.remove("bar") data_bar = data["bar"] if not isinstance(data_bar, (list)): raise JsonSchemaException("data.bar must be array") if not isinstance(data, dict): return data for key, val in data.items(): if globals()["f.o_re"].search(key): if not isinstance(val, list): return val val_len = len(val) if val_len < 2: raise JsonSchemaException(""+"data.patternProperties.{key}".format(**locals())+" must contain at least 2 items") pattern_keys = set() for key in data_keys: if globals()["f.o_re"].search(key): pattern_keys.add(key) data_keys -= pattern_keys for data_key in data_keys: data_value = data.get(data_key) if not isinstance(data_value, (int)) or isinstance(data_value, bool): raise JsonSchemaException(""+"data.{data_key}".format(**locals())+" must be integer") if not isinstance(data, dict): return data for key, val in data.items(): if globals()["f.o_re"].search(key): if not isinstance(val, list): return val if val_len < 2: raise JsonSchemaException(""+"data.{key}".format(**locals())+" must contain at least 2 items") if not isinstance(data, dict): return data data_keys = set(data.keys()) pattern_keys = set() for key in data_keys: if globals()["f.o_re"].search(key): pattern_keys.add(key) data_keys -= pattern_keys for data_key in data_keys: if data_key not in "{'foo': {'type': 'array', 'maxItems': 3}, 'bar': {'type': 'array'}}": data_value = data.get(data_key) if not isinstance(data_value, (int)) or isinstance(data_value, bool): raise JsonSchemaException(""+"data.{data_key}".format(**locals())+" must be integer") return data Code which should be correct if I didn't make any mistake but looks good: def func(data): NoneType = type(None) if not isinstance(data, dict): return data data_keys = set(data.keys()) if "foo" in data_keys: data_keys.remove("foo") data_foo = data["foo"] if not isinstance(data_foo, (list)): raise JsonSchemaException("data.foo must be array") if not isinstance(data_foo, list): return data_foo data_foo_len = len(data_foo) if data_foo_len > 3: raise JsonSchemaException("data.foo must contain less than or equal to 3 items") if "bar" in data_keys: data_keys.remove("bar") data_bar = data["bar"] if not isinstance(data_bar, (list)): raise JsonSchemaException("data.bar must be array") for key, val in data.items(): if globals()["f.o_re"].search(key): if key in data_keys: data_keys.remove(key) if not isinstance(val, list): return val val_len = len(val) if val_len < 2: raise JsonSchemaException(""+"data.{key}".format(**locals())+" must contain at least 2 items") for data_key in data_keys: data_value = data.get(data_key) if not isinstance(data_value, (int)) or isinstance(data_value, bool): raise JsonSchemaException(""+"data.{data_key}".format(**locals())+" must be integer") return data I commented specific problems in the diff. Do you want to fix it or should I? — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#3 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAHreprf5QqWI0Apc9CWjfL2oHWz8RmAks5tf6-NgaJpZM4Suljt> .

horejsek · 2018-03-19T14:22:15Z

Ok, no problem. I will leave it to you as I know it's good feeling to finish work. :-) Anyway I can help if needed by end of the week.

krismolendyke · 2018-03-20T14:24:36Z

Thanks again for all the help and review. I pushed a couple fixes and responded to several review comments. I'm happy to make additional changes if necessary.

At this point, given how the draft 4 spec tests are written, and how confusing the interaction between properties, patternProperties and additionalProperties is, I think we should add failing test cases before changing too much more. The JSON schema test suite is really all I have to go to determine "correctness." What are your thoughts?

horejsek · 2018-03-27T12:50:40Z

There are still some issue. I merged your changes and continue work on it.

krismolendyke · 2018-03-27T14:26:35Z

Excellent. I'm happy to help out. Thanks again!

P.S. maybe Travis or similar CI would make future pull requests easier to review?

horejsek · 2018-03-28T10:26:35Z

I just pushed new code to master. I fixed several bugs. One of them was that any early return would mess up the data in deep conditions. Undeclared variable. I simplified pattern and additional properties. Added support of formats and base support of refs. Also moved tests to pytest so it's easier to work with. It still need to be more tested, I know about some unit tests which has to be made and also I want to test if there is still some performance issue.

horejsek · 2018-03-28T10:28:39Z

I will not have time for it next two weeks. If you want, you can set up travis, do unit tests for early return and came up with more other tests for edge cases. Also you can check the performance of changes if it still the same as before.

horejsek · 2018-04-24T12:49:32Z

I released version 1.2.

krismolendyke added 20 commits March 15, 2018 17:59

Add test suite submodule

af6d7d7

Add test runner script

04b26b5

Update test requirements, install target

0ad1589

Fix performance script

28e96f4

Fix typo

cb650e2

Validate minimum and maximum for non-numbers

a8ba7c4

Validate additional properties

93f23a3

Validate additional items

494b6ce

Validate not

fe82f6e

Validate items

d3648df

Validate max items

f1e353c

Validate max length

3419f30

Validate max properties

306914b

Validate min items

11f72ce

Validate min length

3493b38

Validate min properties

a4d68a8

Validate multiple of

7067946

Validate required

81eb575

Validate pattern

668f824

Validate pattern properties

ae98690

horejsek reviewed Mar 17, 2018

View reviewed changes

Test instance type not attribute

0e71323

horejsek reviewed Mar 19, 2018

View reviewed changes

krismolendyke added 2 commits March 20, 2018 09:38

Simplify additional properties logic

83af56a

Fix properties keys check

8015d6a

horejsek merged commit 510bdbc into horejsek:master Mar 27, 2018

Uh oh!

Conversation

krismolendyke commented Mar 17, 2018

Uh oh!

horejsek left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

horejsek Mar 17, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

horejsek Mar 20, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

horejsek commented Mar 19, 2018

Uh oh!

krismolendyke commented Mar 19, 2018 via email

Uh oh!

horejsek commented Mar 19, 2018

Uh oh!

krismolendyke commented Mar 20, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

horejsek commented Mar 27, 2018

Uh oh!

krismolendyke commented Mar 27, 2018

Uh oh!

horejsek commented Mar 28, 2018

Uh oh!

horejsek commented Mar 28, 2018

Uh oh!

horejsek commented Apr 24, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

horejsek Mar 17, 2018 •

edited

Loading

horejsek Mar 20, 2018 •

edited

Loading

krismolendyke commented Mar 20, 2018 •

edited

Loading