Skip to content

Conversation

@topherinternational
Copy link
Contributor

@topherinternational topherinternational commented Sep 6, 2024

Re-post of @uranusjr 's change in #42054, with a bug fixed and tests added.

Original PR description:

Using the hash property on Dataset (and DatasetAlias) is problematic since subclassing is a possibility, and user code may not correctly implement hashing for Airflow’s purposes.


^ Add meaningful description above
Read the Pull Request Guidelines for more information.
In case of fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in a newsfragment file, named {pr_number}.significant.rst or {issue_number}.significant.rst, in newsfragments.

@topherinternational
Copy link
Contributor Author

@potiuk here's something?

@uranusjr
Copy link
Member

uranusjr commented Sep 6, 2024

This wasn't covered by a test? We should add one.

@uranusjr
Copy link
Member

uranusjr commented Sep 6, 2024

This should also include the previous changes since they were reverted.

@topherinternational
Copy link
Contributor Author

This wasn't covered by a test? We should add one.

Well it's revealed by some provider tests that look like they weren't run in #42054's checks.

I think the challenge is that this is arg defaults all the way down, there's a long chain of func(kwargs, extra=None) calls that would all need tests to detect if a call that passed no extras produced a dataset object that matched Dataset(args) (w/o an extra arg).

@topherinternational
Copy link
Contributor Author

This should also include the previous changes since they were reverted.

I opened this PR before @potiuk reverted 😓

@topherinternational
Copy link
Contributor Author

@uranusjr idk where the dataset factory signature is documented, but it should be documented that the extra kwarg should be changed from None to {}.

e.g. here is the common/io/file one:
def create_dataset(*, path: str, extra=None) -> Dataset:

When #42054 is re-implemented it should be swapped to

def create_dataset(*, path: str, extra={}) -> Dataset:

and so forth for others.

@uranusjr
Copy link
Member

uranusjr commented Sep 6, 2024

def create_dataset(*, path: str, extra={}) -> Dataset:

Can’t do this because Python… https://docs.python-guide.org/writing/gotchas/#mutable-default-arguments

Keeping the default as None and changing it to {} should be the way to do. I think I’ll add post-processing code in Dataset so None is converted automatically to {} so the interface does not change. (Once the Dataset class is created, there shouldn’t be compatible issues.)

@uranusjr
Copy link
Member

uranusjr commented Sep 6, 2024

I’ve re-applied the previous changes here, and also added a fix so callers (including HookLineageCollector) can continue to pass in a None. It will be converted automatically to a dict.

Can you work on a test to cover the dataset_factory code path?

@topherinternational
Copy link
Contributor Author

I’ve re-applied the previous changes here, and also added a fix so callers (including HookLineageCollector) can continue to pass in a None. It will be converted automatically to a dict.

Can you work on a test to cover the dataset_factory code path?

I removed my original change, I think it's best if we keep the call stack as it is and use your fix that coerces the None to {} only in the Dataset constructor, less confusing and easier to test that way imo.

I added more tests to create_dataset that cover the dataset_factory API (uri, kwargs, missing factory, and factory exception are covered).

@topherinternational topherinternational changed the title Fix Dataset extra arg default in HookLineageCollector Try to make dataset objects totally unhashable, redux Sep 6, 2024
@potiuk potiuk added the full tests needed We need to run full set of tests for this PR to merge label Sep 6, 2024
@potiuk potiuk closed this Sep 6, 2024
@potiuk potiuk reopened this Sep 6, 2024
@potiuk
Copy link
Member

potiuk commented Sep 6, 2024

I added "full-tests-needed" label and closed/reopened to re-run all tests

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:lineage full tests needed We need to run full set of tests for this PR to merge

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants