Improve JSON schema draft 4 support#3
Conversation
| self.l('return {variable}') | ||
| self._compile_regexps['{}_re'.format(self._variable)] = re.compile(self._definition['pattern']) | ||
| with self.l('if not {variable}_re.match({variable}):'): | ||
| with self.l('if not {variable}_re.search({variable}):'): |
There was a problem hiding this comment.
When defining the regular expressions, it’s important to note that the string is considered valid if the expression matches anywhere within the string.
– https://spacetelescope.github.io/understanding-json-schema/reference/string.html#index-2
Python offers two different primitive operations based on regular expressions: re.match() checks for a match only at the beginning of the string, while re.search() checks for a match anywhere in the string
There was a problem hiding this comment.
I see! I misinterpreted the validity of regexp in JSON schema. Thanks for the fix!
| with self.l('try:'): | ||
| self.l('{variable}.keys()') | ||
| with self.l('except AttributeError:'): | ||
| self.l('return {variable}') |
There was a problem hiding this comment.
Wouldn't be simple condition faster than try-except block?
There was a problem hiding this comment.
That's likely. I will refactor any more of these blocks to isinstance checks to align with the rest of the code. My mistake.
| for pattern in self._definition['patternProperties'].keys(): | ||
| with self.l('if globals()["{}_re"].search(key):', pattern): | ||
| self.l('pattern_keys.add(key)') | ||
| self.l('{variable}_keys -= pattern_keys') |
There was a problem hiding this comment.
I don't understand yet why this block is here, why methods later are not enough.
There was a problem hiding this comment.
I don't, either 😉. This is due to the logical demands of this test in the draft 4 spec that enforces the interaction between properties, patternProperties, and additionalProperties. There may be a more efficient way to pass this test.
There was a problem hiding this comment.
:-) Yep, those all properties are funny ones. Will look at it in more detail on Monday.
| with self.l('if not isinstance({variable}, dict):'): | ||
| self.l('return {variable}') | ||
| for pattern, definition in self._definition['patternProperties'].items(): | ||
| self._compile_regexps['{}_re'.format(pattern)] = re.compile(pattern) |
There was a problem hiding this comment.
This needs better key because in generate_pattern is {variable}_re and in can clash. Or maybe keep this one and change the first one to {variable}_re_var as it never run to conflict ({varable}_var_re or var_{variable}_re could as rgexp could be x_var or var_x).
I was thinking why you used globals() and then I figured out that pattern is not valid Python name and I'm thinking that variable name doesn't have to be valid Python name as well. Could you change it and document it why it's that way? It's very important to know this information when someone else will look here and would want to do some change.
Or, it could be even better, to not use globals() at all but create unique names for regulars. First can have name, like r1, r2, r3, ... So during generating have that f.o is r1 if it's first one and use this name in the code. Not sure how expensive is to look to globals() in every iteration. Regulars has to be compiled in generation as it is and saved on global but on the start of the function it could take it from globals to make it even faster. I'm thinking about that because now we generate code like this one:
...
for key, val in data.items():
if globals()["f.o_re"].search(key):
...
But maybe it's not very big performance improvement.
There was a problem hiding this comment.
I tried the non-globals() index-based approach first and I worried about ordering during key iteration so I refactored to this may-based approach. Ordering shouldn't be an issue if the dict is never mutated but I didn't know how to be sure about that beyond converting the schema definition to an OrderedDict and that felt like too much change.
I don't love polluting globals but this seemed like the lesser of evils. A local solution would be faster but I haven't profiled the difference.
fastjsonschema/generator.py
Outdated
| for pattern in self._definition['patternProperties'].keys(): | ||
| with self.l('if globals()["{}_re"].search(key):', pattern): | ||
| self.l('pattern_keys.add(key)') | ||
| self.l('{variable}_keys -= pattern_keys') |
There was a problem hiding this comment.
This code shouldn't be needed. Look how properties work. There is this code for every found property:
self.l('{variable}_keys.remove("{}")', key)
So it you handle this already in generate_pattern_properties, then it's not needed to iterate them twice.
There was a problem hiding this comment.
You're correct. I may have implemented these "out of order" based on how the JSON schema test suite executes. We should be able to simply return here. Good catch.
| def generate_additional_properties(self): | ||
| with self.l('if not isinstance({variable}, dict):'): | ||
| self.l('return {variable}') | ||
| self.l('{variable}_keys = set({variable}.keys())') |
There was a problem hiding this comment.
This is created already by object type on the beginning. No need to do it again.
There was a problem hiding this comment.
I understand your comments, but I think this is required due to this test:
$ ./json_schema_test_suite.py JSON-Schema-Test-Suite/tests/draft4/additionalProperties.json
✔ additionalProperties.json
⚠ additionalProperties can exist by itself
⚠ UNDEFINED NameError an additional valid property is valid: name 'data_keys' is not defined
⚠ UNDEFINED NameError an additional invalid property is invalid: name 'data_keys' is not defined
Summary of 12 tests:
Failures:
✘ FALSE_POSITIVE 0 0.0%
✘ FALSE_NEGATIVE 0 0.0%
⚠ UNDEFINED 2 16.7%
2 16.7%
Passes:
✔ TRUE_POSITIVE 8 66.7%
✔ TRUE_NEGATIVE 2 16.7%
10 83.3%
⛔ Ignored: 0
Coverage: 12/12 100.0%Which is a pretty odd case, but is apparently valid according to the spec.
There was a problem hiding this comment.
This is very probably connected to previous comment where keys should be managed once. Maybe test doesn't pass because generated code is not as simple as needed.
|
|
||
| def generate_additional_properties(self): | ||
| with self.l('if not isinstance({variable}, dict):'): | ||
| self.l('return {variable}') |
There was a problem hiding this comment.
This is done already by object type on the beginning. No need to do it again.
|
|
||
| def generate_pattern_properties(self): | ||
| with self.l('if not isinstance({variable}, dict):'): | ||
| self.l('return {variable}') |
There was a problem hiding this comment.
This is created already by object type on the beginning. No need to do it again.
fastjsonschema/generator.py
Outdated
| add_prop_definition = self._definition["additionalProperties"] | ||
| if add_prop_definition: | ||
| with self.l('for {variable}_key in {variable}_keys:'): | ||
| with self.l('if {variable}_key not in "{}":', self._definition.get('properties', [])): |
There was a problem hiding this comment.
This generates wrong code:
if data_key not in "{'foo': {'type': 'array', 'maxItems': 3}, 'bar': {'type': 'array'}}":
I think this condition shouldn't even be here as in {variable}_keys is only the additional properties already.
There was a problem hiding this comment.
You are right, I meant to just check if {variable}_key was not in the properties keys. I'll fix that.
I do think we need this check though for this test case:
$ ./json_schema_test_suite.py JSON-Schema-Test-Suite/tests/draft4/additionalProperties.json
✘ additionalProperties.json
✘ additionalProperties allows a schema which should validate
✘ FALSE_NEGATIVE JsonSchemaException no additional properties is valid: data.foo must be boolean
✘ FALSE_NEGATIVE JsonSchemaException an additional valid property is valid: data.bar must be boolean
Summary of 2 tests:
Failures:
✘ FALSE_POSITIVE 0 0.0%
✘ FALSE_NEGATIVE 2 100.0%
⚠ UNDEFINED 0 0.0%
2 100.0%
Passes:
✔ TRUE_POSITIVE 0 0.0%
✔ TRUE_NEGATIVE 0 0.0%
0 0.0%
⛔ Ignored: 0
Coverage: 2/2 100.0%
|
I checked the code in details. There is many issues with Solution you provided: Code which should be correct if I didn't make any mistake but looks good: I commented specific problems in the diff. Do you want to fix it or should I? |
|
Thanks for the review! I'm not working today, I'll have a closer look
tomorrow. If you want to make the changes and the test suites still pass
that would be great!
…On Mon, Mar 19, 2018, 9:15 AM Michal Hořejšek ***@***.***> wrote:
I checked the code in details. There is many issues with patternProps and
additionalProps. For example I looked how is code generated for this JSON
schema:
{'properties': {'foo': {'type': 'array', 'maxItems': 3}, 'bar': {'type': 'array'}}, 'patternProperties': {'f.o': {'minItems': 2}}, 'additionalProperties': {'type': 'integer'}}
Solution you provided:
def func(data):
NoneType = type(None)
if not isinstance(data, dict):
return data
data_keys = set(data.keys())
if "foo" in data_keys:
data_keys.remove("foo")
data_foo = data["foo"]
if not isinstance(data_foo, (list)):
raise JsonSchemaException("data.foo must be array")
if not isinstance(data_foo, list):
return data_foo
data_foo_len = len(data_foo)
if data_foo_len > 3:
raise JsonSchemaException("data.foo must contain less than or equal to 3 items")
if "bar" in data_keys:
data_keys.remove("bar")
data_bar = data["bar"]
if not isinstance(data_bar, (list)):
raise JsonSchemaException("data.bar must be array")
if not isinstance(data, dict):
return data
for key, val in data.items():
if globals()["f.o_re"].search(key):
if not isinstance(val, list):
return val
val_len = len(val)
if val_len < 2:
raise JsonSchemaException(""+"data.patternProperties.{key}".format(**locals())+" must contain at least 2 items")
pattern_keys = set()
for key in data_keys:
if globals()["f.o_re"].search(key):
pattern_keys.add(key)
data_keys -= pattern_keys
for data_key in data_keys:
data_value = data.get(data_key)
if not isinstance(data_value, (int)) or isinstance(data_value, bool):
raise JsonSchemaException(""+"data.{data_key}".format(**locals())+" must be integer")
if not isinstance(data, dict):
return data
for key, val in data.items():
if globals()["f.o_re"].search(key):
if not isinstance(val, list):
return val
if val_len < 2:
raise JsonSchemaException(""+"data.{key}".format(**locals())+" must contain at least 2 items")
if not isinstance(data, dict):
return data
data_keys = set(data.keys())
pattern_keys = set()
for key in data_keys:
if globals()["f.o_re"].search(key):
pattern_keys.add(key)
data_keys -= pattern_keys
for data_key in data_keys:
if data_key not in "{'foo': {'type': 'array', 'maxItems': 3}, 'bar': {'type': 'array'}}":
data_value = data.get(data_key)
if not isinstance(data_value, (int)) or isinstance(data_value, bool):
raise JsonSchemaException(""+"data.{data_key}".format(**locals())+" must be integer")
return data
Code which should be correct if I didn't make any mistake but looks good:
def func(data):
NoneType = type(None)
if not isinstance(data, dict):
return data
data_keys = set(data.keys())
if "foo" in data_keys:
data_keys.remove("foo")
data_foo = data["foo"]
if not isinstance(data_foo, (list)):
raise JsonSchemaException("data.foo must be array")
if not isinstance(data_foo, list):
return data_foo
data_foo_len = len(data_foo)
if data_foo_len > 3:
raise JsonSchemaException("data.foo must contain less than or equal to 3 items")
if "bar" in data_keys:
data_keys.remove("bar")
data_bar = data["bar"]
if not isinstance(data_bar, (list)):
raise JsonSchemaException("data.bar must be array")
for key, val in data.items():
if globals()["f.o_re"].search(key):
if key in data_keys:
data_keys.remove(key)
if not isinstance(val, list):
return val
val_len = len(val)
if val_len < 2:
raise JsonSchemaException(""+"data.{key}".format(**locals())+" must contain at least 2 items")
for data_key in data_keys:
data_value = data.get(data_key)
if not isinstance(data_value, (int)) or isinstance(data_value, bool):
raise JsonSchemaException(""+"data.{data_key}".format(**locals())+" must be integer")
return data
I commented specific problems in the diff. Do you want to fix it or should
I?
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#3 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAHreprf5QqWI0Apc9CWjfL2oHWz8RmAks5tf6-NgaJpZM4Suljt>
.
|
|
Ok, no problem. I will leave it to you as I know it's good feeling to finish work. :-) Anyway I can help if needed by end of the week. |
|
Thanks again for all the help and review. I pushed a couple fixes and responded to several review comments. I'm happy to make additional changes if necessary. At this point, given how the draft 4 spec tests are written, and how confusing the interaction between |
|
There are still some issue. I merged your changes and continue work on it. |
|
Excellent. I'm happy to help out. Thanks again! P.S. maybe Travis or similar CI would make future pull requests easier to review? |
|
I just pushed new code to master. I fixed several bugs. One of them was that any early return would mess up the data in deep conditions. Undeclared variable. I simplified pattern and additional properties. Added support of formats and base support of refs. Also moved tests to pytest so it's easier to work with. It still need to be more tested, I know about some unit tests which has to be made and also I want to test if there is still some performance issue. |
|
I will not have time for it next two weeks. If you want, you can set up travis, do unit tests for early return and came up with more other tests for edge cases. Also you can check the performance of changes if it still the same as before. |
|
I released version 1.2. |
Thanks for creating a Python JSON schema validation library that isn't slow!
Aside from the good performance, and the differences noted in the documentation, I wanted to know exactly how well this library validated against the official JSON schema (draft 4) test suite.
To do that I have made the following changes:
To support those changes:
extras_requiretosetup.pywithtestrequirementsMakefileto usepipto installtestrequirementsperformance.pyscript to properlydedentcodefortimeitrunTo nitpick ;p :
spcified→specifiedAll current
make testtarget tests still pass viapytest.This notably skips these draft 4 tests via configuration:
uniqueItems.jsonpartially passes but fails on nested objects. Passing that test will come at cost to performance.Example test suite execution:
I may have survived...