Skip to content

Improve JSON schema draft 4 support#3

Merged
horejsek merged 23 commits intohorejsek:masterfrom
krismolendyke:json-schema-draft4
Mar 27, 2018
Merged

Improve JSON schema draft 4 support#3
horejsek merged 23 commits intohorejsek:masterfrom
krismolendyke:json-schema-draft4

Conversation

@krismolendyke
Copy link
Contributor

Thanks for creating a Python JSON schema validation library that isn't slow!

Aside from the good performance, and the differences noted in the documentation, I wanted to know exactly how well this library validated against the official JSON schema (draft 4) test suite.

To do that I have made the following changes:

To support those changes:

  • Add extras_require to setup.py with test requirements
  • Update Makefile to use pip to install test requirements
  • Fix performance.py script to properly dedent code for timeit run

To nitpick ;p :

  • Fix a typographical error, spcifiedspecified

All current make test target tests still pass via pytest.

This notably skips these draft 4 tests via configuration:

definitions.json
dependencies.json
optional/bignum.json
optional/ecmascript-regex.json
optional/format.json
optional/zeroTerminatedFloats.json
ref.json
refRemote.json
uniqueItems.json

uniqueItems.json partially passes but fails on nested objects. Passing that test will come at cost to performance.

Example test suite execution:

$ ./json_schema_test_suite.py json-schema-test-suite-draft-4.conf
✔ additionalItems.json
✔ additionalProperties.json
✔ allOf.json
✔ anyOf.json
⛔ bignum.json
✔ default.json
⛔ definitions.json
⛔ dependencies.json
⛔ ecmascript-regex.json
✔ enum.json
⛔ format.json
✔ items.json
✔ maxItems.json
✔ maxLength.json
✔ maxProperties.json
✔ maximum.json
✔ minItems.json
✔ minLength.json
✔ minProperties.json
✔ minimum.json
✔ multipleOf.json
✔ not.json
✔ oneOf.json
✔ pattern.json
✔ patternProperties.json
✔ properties.json
⛔ ref.json
⛔ refRemote.json
✔ required.json
✔ type.json
⛔ uniqueItems.json
⛔ zeroTerminatedFloats.json

Schema exceptions:

⛔ ref.json: expected an indented block (<string>, line 29): 'return data'
⛔ refRemote.json: expected an indented block (<string>, line 11): 'return data'

Summary of 358 tests:

Failures:

✘ FALSE_POSITIVE    0   0.0%
✘ FALSE_NEGATIVE    0   0.0%
⚠ UNDEFINED         0   0.0%
                    0   0.0%

Passes:

✔ TRUE_POSITIVE   125  52.7%
✔ TRUE_NEGATIVE   112  47.3%
                  237 100.0%

⛔ Ignored:        121
Coverage:     237/358  66.2%

I may have survived...

#    ___
#    \./     DANGER: This module implements some code generation
# .--.O.--.          techniques involving string concatenation.
#  \/   \/           If you look at it, you might die.
#

Copy link
Owner

@horejsek horejsek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great job!

self.l('return {variable}')
self._compile_regexps['{}_re'.format(self._variable)] = re.compile(self._definition['pattern'])
with self.l('if not {variable}_re.match({variable}):'):
with self.l('if not {variable}_re.search({variable}):'):
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why this change? :-)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When defining the regular expressions, it’s important to note that the string is considered valid if the expression matches anywhere within the string.

https://spacetelescope.github.io/understanding-json-schema/reference/string.html#index-2

Python offers two different primitive operations based on regular expressions: re.match() checks for a match only at the beginning of the string, while re.search() checks for a match anywhere in the string

http://devdocs.io/python~3.6/library/re#search-vs-match

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see! I misinterpreted the validity of regexp in JSON schema. Thanks for the fix!

with self.l('try:'):
self.l('{variable}.keys()')
with self.l('except AttributeError:'):
self.l('return {variable}')
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't be simple condition faster than try-except block?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's likely. I will refactor any more of these blocks to isinstance checks to align with the rest of the code. My mistake.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No worry, great, thanks!

for pattern in self._definition['patternProperties'].keys():
with self.l('if globals()["{}_re"].search(key):', pattern):
self.l('pattern_keys.add(key)')
self.l('{variable}_keys -= pattern_keys')
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand yet why this block is here, why methods later are not enough.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't, either 😉. This is due to the logical demands of this test in the draft 4 spec that enforces the interaction between properties, patternProperties, and additionalProperties. There may be a more efficient way to pass this test.

Copy link
Owner

@horejsek horejsek Mar 17, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:-) Yep, those all properties are funny ones. Will look at it in more detail on Monday.

with self.l('if not isinstance({variable}, dict):'):
self.l('return {variable}')
for pattern, definition in self._definition['patternProperties'].items():
self._compile_regexps['{}_re'.format(pattern)] = re.compile(pattern)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs better key because in generate_pattern is {variable}_re and in can clash. Or maybe keep this one and change the first one to {variable}_re_var as it never run to conflict ({varable}_var_re or var_{variable}_re could as rgexp could be x_var or var_x).

I was thinking why you used globals() and then I figured out that pattern is not valid Python name and I'm thinking that variable name doesn't have to be valid Python name as well. Could you change it and document it why it's that way? It's very important to know this information when someone else will look here and would want to do some change.

Or, it could be even better, to not use globals() at all but create unique names for regulars. First can have name, like r1, r2, r3, ... So during generating have that f.o is r1 if it's first one and use this name in the code. Not sure how expensive is to look to globals() in every iteration. Regulars has to be compiled in generation as it is and saved on global but on the start of the function it could take it from globals to make it even faster. I'm thinking about that because now we generate code like this one:

    ...
    for key, val in data.items():
        if globals()["f.o_re"].search(key):
    ...

But maybe it's not very big performance improvement.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried the non-globals() index-based approach first and I worried about ordering during key iteration so I refactored to this may-based approach. Ordering shouldn't be an issue if the dict is never mutated but I didn't know how to be sure about that beyond converting the schema definition to an OrderedDict and that felt like too much change.

I don't love polluting globals but this seemed like the lesser of evils. A local solution would be faster but I haven't profiled the difference.

for pattern in self._definition['patternProperties'].keys():
with self.l('if globals()["{}_re"].search(key):', pattern):
self.l('pattern_keys.add(key)')
self.l('{variable}_keys -= pattern_keys')
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code shouldn't be needed. Look how properties work. There is this code for every found property:

self.l('{variable}_keys.remove("{}")', key)

So it you handle this already in generate_pattern_properties, then it's not needed to iterate them twice.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're correct. I may have implemented these "out of order" based on how the JSON schema test suite executes. We should be able to simply return here. Good catch.

def generate_additional_properties(self):
with self.l('if not isinstance({variable}, dict):'):
self.l('return {variable}')
self.l('{variable}_keys = set({variable}.keys())')
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is created already by object type on the beginning. No need to do it again.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand your comments, but I think this is required due to this test:

$ ./json_schema_test_suite.py JSON-Schema-Test-Suite/tests/draft4/additionalProperties.json 
✔ additionalProperties.json
  ⚠ additionalProperties can exist by itself
    ⚠ UNDEFINED NameError an additional valid property is valid: name 'data_keys' is not defined
    ⚠ UNDEFINED NameError an additional invalid property is invalid: name 'data_keys' is not defined

Summary of 12 tests:

Failures:

✘ FALSE_POSITIVE    0   0.0%
✘ FALSE_NEGATIVE    0   0.0%
⚠ UNDEFINED         2  16.7%
                    2  16.7%

Passes:

✔ TRUE_POSITIVE     8  66.7%
✔ TRUE_NEGATIVE     2  16.7%
                   10  83.3%

⛔ Ignored:          0
Coverage:      12/12 100.0%

Which is a pretty odd case, but is apparently valid according to the spec.

Copy link
Owner

@horejsek horejsek Mar 20, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is very probably connected to previous comment where keys should be managed once. Maybe test doesn't pass because generated code is not as simple as needed.


def generate_additional_properties(self):
with self.l('if not isinstance({variable}, dict):'):
self.l('return {variable}')
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is done already by object type on the beginning. No need to do it again.


def generate_pattern_properties(self):
with self.l('if not isinstance({variable}, dict):'):
self.l('return {variable}')
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is created already by object type on the beginning. No need to do it again.

add_prop_definition = self._definition["additionalProperties"]
if add_prop_definition:
with self.l('for {variable}_key in {variable}_keys:'):
with self.l('if {variable}_key not in "{}":', self._definition.get('properties', [])):
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This generates wrong code:

    if data_key not in "{'foo': {'type': 'array', 'maxItems': 3}, 'bar': {'type': 'array'}}":

I think this condition shouldn't even be here as in {variable}_keys is only the additional properties already.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right, I meant to just check if {variable}_key was not in the properties keys. I'll fix that.

I do think we need this check though for this test case:

$ ./json_schema_test_suite.py JSON-Schema-Test-Suite/tests/draft4/additionalProperties.json 
✘ additionalProperties.json
  ✘ additionalProperties allows a schema which should validate
    ✘ FALSE_NEGATIVE JsonSchemaException no additional properties is valid: data.foo must be boolean
    ✘ FALSE_NEGATIVE JsonSchemaException an additional valid property is valid: data.bar must be boolean

Summary of 2 tests:

Failures:

✘ FALSE_POSITIVE    0   0.0%
✘ FALSE_NEGATIVE    2 100.0%
⚠ UNDEFINED         0   0.0%
                    2 100.0%

Passes:

✔ TRUE_POSITIVE     0   0.0%
✔ TRUE_NEGATIVE     0   0.0%
                    0   0.0%

⛔ Ignored:          0
Coverage:       2/2 100.0%

@horejsek
Copy link
Owner

I checked the code in details. There is many issues with patternProps and additionalProps. For example I looked how is code generated for this JSON schema:

{'properties': {'foo': {'type': 'array', 'maxItems': 3}, 'bar': {'type': 'array'}}, 'patternProperties': {'f.o': {'minItems': 2}}, 'additionalProperties': {'type': 'integer'}}

Solution you provided:

def func(data):
    NoneType = type(None)
    if not isinstance(data, dict):
        return data
    data_keys = set(data.keys())
    if "foo" in data_keys:
        data_keys.remove("foo")
        data_foo = data["foo"]
        if not isinstance(data_foo, (list)):
            raise JsonSchemaException("data.foo must be array")
        if not isinstance(data_foo, list):
            return data_foo
        data_foo_len = len(data_foo)
        if data_foo_len > 3:
            raise JsonSchemaException("data.foo must contain less than or equal to 3 items")
    if "bar" in data_keys:
        data_keys.remove("bar")
        data_bar = data["bar"]
        if not isinstance(data_bar, (list)):
            raise JsonSchemaException("data.bar must be array")
    if not isinstance(data, dict):
        return data

    for key, val in data.items():
        if globals()["f.o_re"].search(key):
            if not isinstance(val, list):
                return val
            val_len = len(val)
            if val_len < 2:
                raise JsonSchemaException(""+"data.patternProperties.{key}".format(**locals())+" must contain at least 2 items")
    pattern_keys = set()
    for key in data_keys:
        if globals()["f.o_re"].search(key):
            pattern_keys.add(key)
    data_keys -= pattern_keys
    for data_key in data_keys:
        data_value = data.get(data_key)
        if not isinstance(data_value, (int)) or isinstance(data_value, bool):
            raise JsonSchemaException(""+"data.{data_key}".format(**locals())+" must be integer")
    if not isinstance(data, dict):
        return data
    
    for key, val in data.items():
        if globals()["f.o_re"].search(key):
            if not isinstance(val, list):
                return val
                
            if val_len < 2:
                raise JsonSchemaException(""+"data.{key}".format(**locals())+" must contain at least 2 items")
    if not isinstance(data, dict):
        return data
    data_keys = set(data.keys())
    pattern_keys = set()
    for key in data_keys:
        if globals()["f.o_re"].search(key):
            pattern_keys.add(key)
    data_keys -= pattern_keys
    for data_key in data_keys:
        if data_key not in "{'foo': {'type': 'array', 'maxItems': 3}, 'bar': {'type': 'array'}}":
            data_value = data.get(data_key)
            if not isinstance(data_value, (int)) or isinstance(data_value, bool):
                raise JsonSchemaException(""+"data.{data_key}".format(**locals())+" must be integer")
    return data

Code which should be correct if I didn't make any mistake but looks good:

def func(data):
    NoneType = type(None)
    if not isinstance(data, dict):
        return data
    data_keys = set(data.keys())
    if "foo" in data_keys:
        data_keys.remove("foo")
        data_foo = data["foo"]
        if not isinstance(data_foo, (list)):
            raise JsonSchemaException("data.foo must be array")
        if not isinstance(data_foo, list):
            return data_foo
        data_foo_len = len(data_foo)
        if data_foo_len > 3:
            raise JsonSchemaException("data.foo must contain less than or equal to 3 items")
    if "bar" in data_keys:
        data_keys.remove("bar")
        data_bar = data["bar"]
        if not isinstance(data_bar, (list)):
            raise JsonSchemaException("data.bar must be array")
    for key, val in data.items():
        if globals()["f.o_re"].search(key):
            if key in data_keys:
                data_keys.remove(key)
            if not isinstance(val, list):
                return val
            val_len = len(val)
            if val_len < 2:
                raise JsonSchemaException(""+"data.{key}".format(**locals())+" must contain at least 2 items")
    for data_key in data_keys:
        data_value = data.get(data_key)
        if not isinstance(data_value, (int)) or isinstance(data_value, bool):
            raise JsonSchemaException(""+"data.{data_key}".format(**locals())+" must be integer")
    return data

I commented specific problems in the diff. Do you want to fix it or should I?

@krismolendyke
Copy link
Contributor Author

krismolendyke commented Mar 19, 2018 via email

@horejsek
Copy link
Owner

Ok, no problem. I will leave it to you as I know it's good feeling to finish work. :-) Anyway I can help if needed by end of the week.

@krismolendyke
Copy link
Contributor Author

krismolendyke commented Mar 20, 2018

Thanks again for all the help and review. I pushed a couple fixes and responded to several review comments. I'm happy to make additional changes if necessary.

At this point, given how the draft 4 spec tests are written, and how confusing the interaction between properties, patternProperties and additionalProperties is, I think we should add failing test cases before changing too much more. The JSON schema test suite is really all I have to go to determine "correctness." What are your thoughts?

@horejsek horejsek merged commit 510bdbc into horejsek:master Mar 27, 2018
@horejsek
Copy link
Owner

There are still some issue. I merged your changes and continue work on it.

@krismolendyke
Copy link
Contributor Author

Excellent. I'm happy to help out. Thanks again!

P.S. maybe Travis or similar CI would make future pull requests easier to review?

@horejsek
Copy link
Owner

I just pushed new code to master. I fixed several bugs. One of them was that any early return would mess up the data in deep conditions. Undeclared variable. I simplified pattern and additional properties. Added support of formats and base support of refs. Also moved tests to pytest so it's easier to work with. It still need to be more tested, I know about some unit tests which has to be made and also I want to test if there is still some performance issue.

@horejsek
Copy link
Owner

I will not have time for it next two weeks. If you want, you can set up travis, do unit tests for early return and came up with more other tests for edge cases. Also you can check the performance of changes if it still the same as before.

@horejsek
Copy link
Owner

I released version 1.2.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants