Skip to content

API based add to allow list script (marshmallow)#224

Merged
jon-betts merged 6 commits into
mainfrom
api-based-add-to-allow-mashmallow
Mar 18, 2021
Merged

API based add to allow list script (marshmallow)#224
jon-betts merged 6 commits into
mainfrom
api-based-add-to-allow-mashmallow

Conversation

@jon-betts
Copy link
Copy Markdown
Contributor

@jon-betts jon-betts commented Mar 15, 2021

For: #195

Alternative to: #223 based on marshmallow. The review and tests notes on that ticket pretty much still apply here.

@jon-betts jon-betts self-assigned this Mar 15, 2021
@jon-betts
Copy link
Copy Markdown
Contributor Author

jon-betts commented Mar 15, 2021

It's been requested to write this again using marshmallow. So I thought that might be a good time to write down some observations about the differences between JSON schema (the format, as implemented here with jsonschema the library) and marshmallow . I'm more familiar with JSON schema, so when I say "JSON schema can do this..." it's probably pretty sound; when I say "marshmallow doesn't support this..." it's more likely I just haven't spotted how to do it yet.

tl;dr

  • marshmallow is Python specific, and does validation and object creation / serialisation
  • JSON schema is language agnostic, has more powerful validation capabilities, but is only for validation

About structure (advantage jsonschema)

  • marshmallow Schema objects generally only express one level of nesting each, therefore the creation of a nested structure requires as many classes as there are nested things
  • JSON schema flip flops between data and metadata at each level of nesting, but allows arbitrary nesting in place
  • marshmallow the base library appears to lack some of the features which are part of the spec of JSON Schema such as enumerations, allOf / anyOf / oneOf sub-schema capabilities etc.
  • JSON schema allows more flexible lists in particular, which can either be mixed content or have different schema at different specific positions
  • Some gaps in marshmallow could probably be made up for by writing your own validators (JSON schema provides for regexes over dict keys for example, but you could easily write your own)
  • Some gaps can be made up by installing additional packages like marshmallow_union, marshmallow_polyfield, ... although these are separate packages, not written and maintained by the main marshmallow developers from what I can see
  • Some gaps probably can't be fixed as it's incompatible with the serialisation / deserialisation focus of the library
  • The lack of polymorphism seems likely to cap the complexity of the schema it is possible to express to simpler or single objects.

Built in type / format checking (similar)

  • Both marshmallow and the JSON schema spec support a number of built-in types and validators
  • They seem roughly similar for types (strings, lists, dicts etc) and formats (dates, URLs,numbers, floats etc.)
  • Both seem to provide extra methods for constraining lengths etc.
  • Each provides a couple the other doesn't
  • As marshmallow translates values rather than just checking them, some of the field types it returns have handy features attached to them, like methods on IPv4 objects for is_multicast() and things like that

Custom format checking (similar)

  • Both support custom validators
  • Both let you apply this to any part of the hierarchy of data, and do whatever you like
  • For example you can define a JSON schema format validator has-foo and apply it to a number of things, or apply multiple formats to a single field
    marshmallow also merges deserialisation and validation, so you can modify the data as you check it
  • JSON Schema is purely a validation spec, although Python being Python, if you get a-hold of a mutable object you can mess with it, although you certainly shouldn't

Speed (similar, but...)

  • On the example in this PR both libraries seem to be similar
  • You'd probably need a more exciting example to be able to tell the difference
  • There are different implementations for JSON schema though, so there is a fastjsonschema library which proports to be much faster than jsonschema
  • I gave it a quick shot, and it's 10x faster than either, but it also includes this comment in the code:
#    ___
#    \./     DANGER: This project implements some code generation
# .--.O.--.          techniques involving string concatenation.
#  \/   \/           If you look at it, you might die.
  • So we might or might not want to use it, but I'd say advantage JSON schema the format rather than jsonschema the specific implementation

Features unique to one or the other

marshmallow:

  • Focused on serialisation/deserialisation with validation
  • This means you can mutate with the data as you validate it and return something different from what you received
  • Allows you to read and write from data different to that which you are validating (used to create objects etc.)
  • Will convert some types into handy objects with useful methods
  • The separate library webargs provides methods for getting items from params etc in various web frameworks (like Pyramid) and uses marshmallow

JSON Schema:

  • Focused on validation only
  • Appears to be mostly a superset of the validation capabilities of marshmallow with many things expressible in JSON Schema which are not expressible in marshmallow
  • In particular it allows:
    • allOf, oneOf, anyOf over sub-schema
    • Conditional elements if / then / else
    • Positional elements in arrays
  • Not Python specific and can be used in any language
  • Schema files can be published on and consumed directly from the web and refer to each other locally or remotely
  • Is therefore useful for communication between services and languages
  • Is used in other formats and frameworks like Open API
  • Can be stored as configuration (JSON files) not code

Extra reading

@jon-betts
Copy link
Copy Markdown
Contributor Author

jon-betts commented Mar 15, 2021

Some more speculative/personal taste things. All of these comments assume someone who is fluent with both libraries

Ergonomics

  • marshmallow code looks simpler (for small or flat examples) as it's all singly nested things
  • I'd suggest that it's actually more complicated to understand, as the nesting is implicit rather than explicit (you have to join a lot of dots in your head to see the structure)
  • You have to create a lot of classes you don't necessarily care about
  • JSON Schema is more dense and requires getting your eyes on a bit, but the nesting reflects the nesting of the object
  • JSON Schema $ref allow you a similar ability to break things up to make them understandable if you want to
  • That's good or bad depending on what you do with it. As JSON Schema allows you to create more complicated things, those things might be harder to understand if you do it badly, but it's hard to knock it here, as marshmallow just doesn't let you do it at all.
  • jsonschema has slightly more boilerplate to get a validator instance and add format checkers (<10 lines)
  • ... but once you'd done it you might not have to write any more python for validation ever
  • You're also less likely to have to write your own code to cover limitations of marshmallow I'd guess. Although if you used marshmallow all day, you'd likely be blind to this gap so the cost wouldn't be apparent as you'd be more likely to structure your project around marshmallow limitations without realising it.
  • JSON Schema allows for adding examples directly into the schema, which is nice

Getting help

  • The docs for both are good, but in some cases a bit sparse:
  • JSON schema is a pretty active format (it's expanding quite frequently), and some of the newer draft features aren't documented as well
  • JSON Schema as a community seems a bit more active (which is no surprise as it's cross language)
  • There are more online tools etc. to help with writing JSON schemas

Not making a mess

  • marshmallow being Python and JSON schema being JSON prevents you from writing invalid Python or JSON respectively, but doesn't really ensure you have a sensible schema in either case.
  • JSON schema is more fault tolerant though, and so is more likely to silently ignore nonsense you put in your schema by mistake. So running a JSON schema aware linter / validation is a good idea. For example the spec for JSON schema is specified in JSON schema and can be applied to any schema (built into most libraries I'd guess).
  • We've written tests before that ensure a JSON schema meets it's own examples (it's pretty easy)
  • In both cases, assuming the format isn't totally trivial, you probably need some tests
  • If you wanted to get closer to the validation capabilities of JSON schema you could get somewhere by adding more companion packages to marshmallow
  • You'd probably not be able to get there, and you'd be maintaining more packages which are individually considerably more sketchy

Other

  • I get how you could make a base schema which two sub types inherit from in marshmallow, but I can't see how you'd make a single schema that validated both (this is pretty trivial in JSON schema)

@jon-betts jon-betts force-pushed the api-based-add-to-allow-mashmallow branch from 4767403 to cd229bd Compare March 15, 2021 16:55
@jon-betts jon-betts changed the title Api based add to allow mashmallow API based add to allow mashmallow Mar 15, 2021
@jon-betts jon-betts requested a review from seanh March 15, 2021 16:57
@jon-betts jon-betts changed the title API based add to allow mashmallow API based add to allow marshmallow Mar 15, 2021
@jon-betts jon-betts changed the title API based add to allow marshmallow API based add to allow list script (marshmallow) Mar 15, 2021
@seanh seanh self-assigned this Mar 16, 2021
@jon-betts jon-betts force-pushed the api-based-add-to-allow-mashmallow branch 2 times, most recently from d277153 to 5a5faa6 Compare March 17, 2021 20:56
@jon-betts
Copy link
Copy Markdown
Contributor Author

I've got this most of the way, but something very weird has happened to the requirements. I'm not sure what the state of play with that is at the moment, so I'll pick this up tomorrow.

@jon-betts jon-betts force-pushed the api-based-add-to-allow-mashmallow branch 2 times, most recently from d97953f to 2ab03a7 Compare March 18, 2021 13:25
Copy link
Copy Markdown
Contributor

@seanh seanh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. I had to make a couple of fixes to get it to work locally, and it looks like there's a couple of unrelated changes mixed in

Comment thread bin/add_to_allow_list.py Outdated
Comment thread checkmate/views/ui/api/add_to_allow_list.py Outdated
# Check it notices if we change one char
last_char = "f" if nonce[:-1] != "f" else "e"
assert not service.check_nonce(nonce[:-1] + last_char)
assert not service.check_nonce(nonce[:-2] + last_char)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

?

Copy link
Copy Markdown
Contributor Author

@jon-betts jon-betts Mar 18, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm trying to fix a flakey test, but I've just realised I've fixed it wrong. Looking at this the actual error is:

last_char = "f" if nonce[-1] != "f" else "e"

We need to modify part of the hash to see if it detects a change, so we try and swap the last char to an f, but 1 in 16 times it'll already be an f so we have to pick something else. This was in a separate commit with a message, but it's clearly still wrong.

It works now, but for the wrong reasons.

Comment thread tests/conftest.py
_TEST_DATABASE_URL = os.environ.get(
"TEST_DATABASE_URL",
"postgresql://postgres@localhost:5434/checkmate_test",
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Huh?

Copy link
Copy Markdown
Contributor Author

@jon-betts jon-betts Mar 18, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is about it's use below. In one case we read from the environ only, in the other we have a default. I'm not sure when this happened, but it makes it more awkward to run individual tests in the IDE.

Added as a separate commit for the explanation.

@jon-betts jon-betts force-pushed the api-based-add-to-allow-mashmallow branch from 5a58825 to 74fb215 Compare March 18, 2021 15:51
@jon-betts jon-betts requested a review from seanh March 18, 2021 15:55
@jon-betts jon-betts merged commit 9711bfe into main Mar 18, 2021
@jon-betts jon-betts deleted the api-based-add-to-allow-mashmallow branch March 18, 2021 16:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants