-
Notifications
You must be signed in to change notification settings - Fork 14
Fix crash when marshalling lists, and add support for 'object' and 'any' types #189
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
✅ 40/40 passed, 2 skipped, 1m58s total Running from acceptance #299 |
|
@Reviewer please note that the no-cheat check needs to be acceped:
|
|
@asnare @sundarshankar89 any chance you can review this PR ? |
sundarshankar89
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
asnare
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With respect to the purpose of this PR, I'm not sure we should be should be supporting object or Any: these go against the spirit and purpose of type annotations, and to an extent I'm missing the reason we want to support these. In a few places the existing code is a bit lazy and uses Any, but in general it shouldn't be necessary for users of this class to use it. In particular: a) we don't allow any value; b) it should be possible to express (via a union and/or type alias) arbitrary combinations (including nesting) of the types that we do support.
I'm not sure where we need this, but think a stronger justification is needed to do this. The purpose of this package is partly to enforce standards across our projects, and one of those standards is proper type hinting. This PR significantly undermines that goal.
That notwithstanding, there are a few nits that are fixed by this PR and that's great, as is the expanded test coverage for how we handle (properly typed) collections.
| if type_ref in (object, Any): | ||
| return self._marshal(type(inst), path, inst) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not so sure about this, it looks like the caller isn't providing the correct type_ref?
The intent of that parameter is that the caller should know the type of what it's trying to marshal and declare it up-front. With this change I'm not sure what the purpose of the parameter is at all?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are situations where the type_ref is not specific, because it's not specified by the caller, notably when serializing/deserializing nested structures. Generally speaking, the code will produce better results if the result types are accurately specified, but if not, it will return a list or dict, which can easily be converted by the caller, instead of crashing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The argument here is a bit circular because it boils down to the type not being specific because the caller didn't provide it. But I think they can?
I understand why you want to allow for object and Any, but I still keep coming back to:
- I haven't yet seen an example where this is truly necessary. (We had one for JSON values, but it turned out to not be necessary.)
- Specifying
Anyis always going to be inaccurate. There are obviously things that we can't serialise, and allowing this type annotation leads us down the road of crashing because we encounter one. It's a promise we can't keep.
I know this is coming across as being difficult, but I am reluctant to just let this slide because of forward compatibility: once we allow it, we can't easily remove support for it.
| if type_ref == list: | ||
| return self._marshal_list(type_ref, path, inst) | ||
| return self._marshal_raw_list(path, inst) | ||
| if type_ref == dict: | ||
| return self._marshal_raw_dict(path, inst) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this change means that the generic type parameter for lists and dictionaries (eg. list[str] or dict[str, str]) is now ignored: previously this was used for marshalling, and the caller was penalised with an error for providing values that didn't match the declared type. However now we just treat everything as a "raw" collection and there's no penalty for a misbehaving caller, is that correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No it doesn't mean that. It actually fixes a bug where an incorrect type_ref was passed down. If the declared typeref is say list[string], then it is not equal to list, it is a GenericAlias, which is handled by _marshal_generic_types.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for clarifying with respect to list vs list[T].
I noticed we don't have a test for list or dict so I tried it out:
class SampleClass:
field: listIt turns out that saving works roughly as expected, but loading does not. (To be clear, this is also the case before this PR.)
Given that list and dict aren't supported (before or after this PR), this makes me wonder about the purpose of them here: is that something you can help me understand?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's fixed now
For our own code, if one serializes for example a (In this PR, support for this feature comes on top of proper testing + bug fixes also in this PR). |
asnare
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ericvergnaud: Thanks for reminding me this morning to have another look at this. I've had time to collect my thoughts somewhat. To be clear:
- I think it's fantastic that we now have some tests that cover this code. This code being covered is difficult to reason about from inspection, and is poorly documented, so the tests are a welcome addition.
- It's also great that bugs have been fixed. Please call them out: it helps with future searching and understanding what's been fixed (and when).
My main reservation is and remains about encouraging users of this library to annotate with object and Any. As mentioned in a comment:
- Once we allow it, it's difficult to remove due to the need to retain compatibility.
- There are obviously values that fall under
objectandAnythat we cannot serialise, so this is a promise we can't keep. - I don't (yet) see the need, given that I believe that callers can express what they need using type aliases.
Following that line of thought… we can only serialise and deserialise values that fit into the JSON/YAML data model: that's the type of file we're loading and saving, after all, so they represent the maximum bounds of what Any can represent anyway. (It looks like CSV is also supported, but let's ignore that for now.)
With this in mind why don't we define a proper type alias (and include it for clients to use) for what we can support, for example the JSONValue type alias? We could call it something else (eg. AnyConfig) and although it won't support data classes (because it can't distinguish between them and alternatives in the union type), I think this is what we're actually trying to express via Any and/or object? This would be a promise we can keep and mitigate the concerns I've described above.
| if type_ref == list: | ||
| return self._marshal_list(type_ref, path, inst) | ||
| return self._marshal_raw_list(path, inst) | ||
| if type_ref == dict: | ||
| return self._marshal_raw_dict(path, inst) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for clarifying with respect to list vs list[T].
I noticed we don't have a test for list or dict so I tried it out:
class SampleClass:
field: listIt turns out that saving works roughly as expected, but loading does not. (To be clear, this is also the case before this PR.)
Given that list and dict aren't supported (before or after this PR), this makes me wonder about the purpose of them here: is that something you can help me understand?
| if type_ref in (object, Any): | ||
| return self._marshal(type(inst), path, inst) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The argument here is a bit circular because it boils down to the type not being specific because the caller didn't provide it. But I think they can?
I understand why you want to allow for object and Any, but I still keep coming back to:
- I haven't yet seen an example where this is truly necessary. (We had one for JSON values, but it turned out to not be necessary.)
- Specifying
Anyis always going to be inaccurate. There are obviously things that we can't serialise, and allowing this type annotation leads us down the road of crashing because we encounter one. It's a promise we can't keep.
I know this is coming across as being difficult, but I am reluctant to just let this slide because of forward compatibility: once we allow it, we can't easily remove support for it.
|
This debate is philosophical, and I don't think a PR is the right place for having it. |
These provide some baseline coverage and behavior for serialising and deserialising configuration files.
This PR continues from #189 by updating it with the following changes: - It introduces a specific `JsonValue` type alias that represents the maximum bounds of the values that can be saved for an installation. - It removes support for `Any` and `object` as type annotations on data classes that we marshal. The existing tests for these have been updated to use `JsonValue` instead. - Support for marshalling raw `list` and `dict` fields is deprecated. Saving these used to work, but loading would fail. I've chosen to deal with this by: - Issuing a DeprecationWarning if they're hit when saving, with instructions to use `list[T]` or `dict[T]` instead. - During loading, a specific error is thrown indicating how it can be fixed. - Documents the restrictions and limitations on `.save()` for what it can actually handle. - Adds a marker for an improvement whereby during marshalling the type hints should actually be checked against the values that are being saved. - Fixed some tests that were accidentally passing. - Tightens some of the type-hints we had in place where we declared `Any` but really only accepted a subset. This is sufficient to allow the use-case that we need in databrickslabs/lakebridge#1488, and is preferred to the approach in #189 because it forces developers to fix their code: - Declaring `JSONValue` as a field will let the type-checker tell them if a value is assigned that does not conform. (The fix is for the client code to validate the values.) - Similarly during loading the type-checker forces the caller to check and handle the possible types that might be present. One of the bugs fixed in #189 relates to automatic conversion between JSON values: for example, from `123` to `"123"` or vice-versa. Some tests were (incorrectly) relying on this, as were some tests in a downstream project (remorph). _There might be other places in production code where the previous behaviour is relied upon._ As such, the next release that includes this code (or that from #189) should bump the minor version and not just be a patch release. --------- Co-authored-by: Eric Vergnaud <eric.vergnaud@databricks.com>
Our current implementation crashes when marshalling/unmarshalling generic lists, such as
list[some_class].This PR fixes that crash.
Our current implementation also does not support marshalling/unmarshalling generics with 'object' or 'Any' as type parameters, such as:
This PR fixes that by allowing developers to marshall/unmarshall any type, thus making the storage layer more robust.
Required for databrickslabs/lakebridge#1488