Skip to content

Conversation

@ericvergnaud
Copy link
Contributor

@ericvergnaud ericvergnaud commented Mar 13, 2025

Our current implementation crashes when marshalling/unmarshalling generic lists, such as list[some_class].

This PR fixes that crash.

Our current implementation also does not support marshalling/unmarshalling generics with 'object' or 'Any' as type parameters, such as:

@dataclass
class SomeClass:
  field_a: list[object]
  field_b: dict[str, Any]

This PR fixes that by allowing developers to marshall/unmarshall any type, thus making the storage layer more robust.

Required for databrickslabs/lakebridge#1488

@github-actions
Copy link

github-actions bot commented Mar 13, 2025

✅ 40/40 passed, 2 skipped, 1m58s total

Running from acceptance #299

@ericvergnaud
Copy link
Contributor Author

ericvergnaud commented Mar 13, 2025

@Reviewer please note that the no-cheat check needs to be acceped:

  • cheats were added to existing code in paths.py, so that formatting doesn't yell
  • the complexity of _marshal is inherent to the problem being solved

@ericvergnaud
Copy link
Contributor Author

@asnare @sundarshankar89 any chance you can review this PR ?

Copy link
Collaborator

@sundarshankar89 sundarshankar89 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@asnare asnare left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With respect to the purpose of this PR, I'm not sure we should be should be supporting object or Any: these go against the spirit and purpose of type annotations, and to an extent I'm missing the reason we want to support these. In a few places the existing code is a bit lazy and uses Any, but in general it shouldn't be necessary for users of this class to use it. In particular: a) we don't allow any value; b) it should be possible to express (via a union and/or type alias) arbitrary combinations (including nesting) of the types that we do support.

I'm not sure where we need this, but think a stronger justification is needed to do this. The purpose of this package is partly to enforce standards across our projects, and one of those standards is proper type hinting. This PR significantly undermines that goal.

That notwithstanding, there are a few nits that are fixed by this PR and that's great, as is the expanded test coverage for how we handle (properly typed) collections.

Comment on lines +488 to +489
if type_ref in (object, Any):
return self._marshal(type(inst), path, inst)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not so sure about this, it looks like the caller isn't providing the correct type_ref?

The intent of that parameter is that the caller should know the type of what it's trying to marshal and declare it up-front. With this change I'm not sure what the purpose of the parameter is at all?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are situations where the type_ref is not specific, because it's not specified by the caller, notably when serializing/deserializing nested structures. Generally speaking, the code will produce better results if the result types are accurately specified, but if not, it will return a list or dict, which can easily be converted by the caller, instead of crashing.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The argument here is a bit circular because it boils down to the type not being specific because the caller didn't provide it. But I think they can?

I understand why you want to allow for object and Any, but I still keep coming back to:

  • I haven't yet seen an example where this is truly necessary. (We had one for JSON values, but it turned out to not be necessary.)
  • Specifying Any is always going to be inaccurate. There are obviously things that we can't serialise, and allowing this type annotation leads us down the road of crashing because we encounter one. It's a promise we can't keep.

I know this is coming across as being difficult, but I am reluctant to just let this slide because of forward compatibility: once we allow it, we can't easily remove support for it.

Comment on lines 484 to +487
if type_ref == list:
return self._marshal_list(type_ref, path, inst)
return self._marshal_raw_list(path, inst)
if type_ref == dict:
return self._marshal_raw_dict(path, inst)
Copy link
Contributor

@asnare asnare Apr 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this change means that the generic type parameter for lists and dictionaries (eg. list[str] or dict[str, str]) is now ignored: previously this was used for marshalling, and the caller was penalised with an error for providing values that didn't match the declared type. However now we just treat everything as a "raw" collection and there's no penalty for a misbehaving caller, is that correct?

Copy link
Contributor Author

@ericvergnaud ericvergnaud Apr 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No it doesn't mean that. It actually fixes a bug where an incorrect type_ref was passed down. If the declared typeref is say list[string], then it is not equal to list, it is a GenericAlias, which is handled by _marshal_generic_types.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for clarifying with respect to list vs list[T].

I noticed we don't have a test for list or dict so I tried it out:

class SampleClass:
    field: list

It turns out that saving works roughly as expected, but loading does not. (To be clear, this is also the case before this PR.)

Given that list and dict aren't supported (before or after this PR), this makes me wonder about the purpose of them here: is that something you can help me understand?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's fixed now

@ericvergnaud
Copy link
Contributor Author

ericvergnaud commented Apr 3, 2025

With respect to the purpose of this PR, I'm not sure we should be should be supporting object or Any: these go against the spirit and purpose of type annotations, and to an extent I'm missing the reason we want to support these. In a few places the existing code is a bit lazy and uses Any, but in general it shouldn't be necessary for users of this class to use it. In particular: a) we don't allow any value; b) it should be possible to express (via a union and/or type alias) arbitrary combinations (including nesting) of the types that we do support.

I'm not sure where we need this, but think a stronger justification is needed to do this. The purpose of this package is partly to enforce standards across our projects, and one of those standards is proper type hinting. This PR significantly undermines that goal.

That notwithstanding, there are a few nits that are fixed by this PR and that's great, as is the expanded test coverage for how we handle (properly typed) collections.

For our own code, if one serializes for example a list[some_data_class] using a list or a list[Any] type hint, they won't get a list[some_data_class] when unmarshalling, a bug that they will need to fix asap (by providing the correct type hint). So not sure this PR undermines the stated goal re code that we produce.
However, we sometimes need to store structured data in raw format for which we have no type hint, notably nested structures provided by 3rd parties without type hints. In that case, we just act as a pass trough.

(In this PR, support for this feature comes on top of proper testing + bug fixes also in this PR).

Copy link
Contributor

@asnare asnare left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ericvergnaud: Thanks for reminding me this morning to have another look at this. I've had time to collect my thoughts somewhat. To be clear:

  • I think it's fantastic that we now have some tests that cover this code. This code being covered is difficult to reason about from inspection, and is poorly documented, so the tests are a welcome addition.
  • It's also great that bugs have been fixed. Please call them out: it helps with future searching and understanding what's been fixed (and when).

My main reservation is and remains about encouraging users of this library to annotate with object and Any. As mentioned in a comment:

  • Once we allow it, it's difficult to remove due to the need to retain compatibility.
  • There are obviously values that fall under object and Any that we cannot serialise, so this is a promise we can't keep.
  • I don't (yet) see the need, given that I believe that callers can express what they need using type aliases.

Following that line of thought… we can only serialise and deserialise values that fit into the JSON/YAML data model: that's the type of file we're loading and saving, after all, so they represent the maximum bounds of what Any can represent anyway. (It looks like CSV is also supported, but let's ignore that for now.)

With this in mind why don't we define a proper type alias (and include it for clients to use) for what we can support, for example the JSONValue type alias? We could call it something else (eg. AnyConfig) and although it won't support data classes (because it can't distinguish between them and alternatives in the union type), I think this is what we're actually trying to express via Any and/or object? This would be a promise we can keep and mitigate the concerns I've described above.

Comment on lines 484 to +487
if type_ref == list:
return self._marshal_list(type_ref, path, inst)
return self._marshal_raw_list(path, inst)
if type_ref == dict:
return self._marshal_raw_dict(path, inst)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for clarifying with respect to list vs list[T].

I noticed we don't have a test for list or dict so I tried it out:

class SampleClass:
    field: list

It turns out that saving works roughly as expected, but loading does not. (To be clear, this is also the case before this PR.)

Given that list and dict aren't supported (before or after this PR), this makes me wonder about the purpose of them here: is that something you can help me understand?

Comment on lines +488 to +489
if type_ref in (object, Any):
return self._marshal(type(inst), path, inst)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The argument here is a bit circular because it boils down to the type not being specific because the caller didn't provide it. But I think they can?

I understand why you want to allow for object and Any, but I still keep coming back to:

  • I haven't yet seen an example where this is truly necessary. (We had one for JSON values, but it turned out to not be necessary.)
  • Specifying Any is always going to be inaccurate. There are obviously things that we can't serialise, and allowing this type annotation leads us down the road of crashing because we encounter one. It's a promise we can't keep.

I know this is coming across as being difficult, but I am reluctant to just let this slide because of forward compatibility: once we allow it, we can't easily remove support for it.

@asnare asnare assigned ericvergnaud and unassigned asnare Apr 15, 2025
@ericvergnaud
Copy link
Contributor Author

This debate is philosophical, and I don't think a PR is the right place for having it.
Even in Morpheus, written in a strongly typed language, with all the code under our control, there are 26 occurrences of asInstanceOf i.e. places where we're not able to accurately specify a type.
So you think developers can always be explicit, but reality says otherwise, even in the above ideal situation.
The current code crashes. This PR fixes that.

@ericvergnaud ericvergnaud assigned asnare and unassigned ericvergnaud Apr 15, 2025
@ericvergnaud ericvergnaud changed the title Support marshalling of 'object' and 'any' Fix crash when marshalling lists, and add support for 'object' and 'any' types Apr 15, 2025
asnare added a commit that referenced this pull request Apr 16, 2025
These provide some baseline coverage and behavior for serialising and deserialising configuration files.
…tabrickslabs/blueprint into support-marshalling-of-object-and-any

* 'support-marshalling-of-object-and-any' of github.com:databrickslabs/blueprint:
  Initial test coverage for configuration marshalling/unmarshalling (#226)
  Fixed Blueprint Install (#225)
* main:
  Fix argument interpolation in colorised logs (#233)
  Ensure the non-colorized logs also include timestamps with second-lev… (#235)
  Test coverage for the customised logger setup (#232)
  Type hints for existing installation tests (#231)

# Conflicts:
#	tests/unit/test_installation.py
@asnare asnare removed the request for review from nfx May 16, 2025 09:39
@asnare asnare self-assigned this May 16, 2025
@asnare asnare added bug Something isn't working enhancement New feature or request labels May 16, 2025
* main:
  Release v0.10.2 (#239)
  Ensure that App logger emits `DEBUG` events if the CLI is invoked with `--debug` (#238)
  Consistent exception formatting in logs (#237)
  Ensure the names of logger levels are consistent (#234)
  Fix logger name abbreviation fails if the logger name contains `..` (#236)
sundarshankar89 pushed a commit that referenced this pull request May 27, 2025
This PR continues from #189 by updating it with the following changes:

- It introduces a specific `JsonValue` type alias that represents the
maximum bounds of the values that can be saved for an installation.
- It removes support for `Any` and `object` as type annotations on data
classes that we marshal. The existing tests for these have been updated
to use `JsonValue` instead.
- Support for marshalling raw `list` and `dict` fields is deprecated.
Saving these used to work, but loading would fail. I've chosen to deal
with this by:
- Issuing a DeprecationWarning if they're hit when saving, with
instructions to use `list[T]` or `dict[T]` instead.
- During loading, a specific error is thrown indicating how it can be
fixed.
- Documents the restrictions and limitations on `.save()` for what it
can actually handle.
- Adds a marker for an improvement whereby during marshalling the type
hints should actually be checked against the values that are being
saved.
 - Fixed some tests that were accidentally passing.
- Tightens some of the type-hints we had in place where we declared
`Any` but really only accepted a subset.

This is sufficient to allow the use-case that we need in
databrickslabs/lakebridge#1488, and is preferred to the approach in #189
because it forces developers to fix their code:

- Declaring `JSONValue` as a field will let the type-checker tell them
if a value is assigned that does not conform. (The fix is for the client
code to validate the values.)
- Similarly during loading the type-checker forces the caller to check
and handle the possible types that might be present.

One of the bugs fixed in #189 relates to automatic conversion between
JSON values: for example, from `123` to `"123"` or vice-versa. Some
tests were (incorrectly) relying on this, as were some tests in a
downstream project (remorph). _There might be other places in production
code where the previous behaviour is relied upon._ As such, the next
release that includes this code (or that from #189) should bump the
minor version and not just be a patch release.

---------

Co-authored-by: Eric Vergnaud <eric.vergnaud@databricks.com>
@asnare asnare removed their assignment Jun 16, 2025
@asnare
Copy link
Contributor

asnare commented Jun 16, 2025

Superseded by #226 & #241.

@asnare asnare closed this Jun 16, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants