Skip to content

Conversation

@mobuchowski
Copy link
Contributor

@mobuchowski mobuchowski commented Jul 16, 2024

This PR is based on #40819

This adds support for getting lineage directly from Object Store's ObjectStoragePath.

Not every operation is being tracked, only those that modify or read the files, not the metadata.

Copy/rename/move operations are being tracked internally, while for tracking reads and writes there's TrackingFileWrapper - proxy that collects reads and writes.

This allows also tracking data reads/writes from other systems that accept file APIs - for example, from Object Store tutorial in Airflow:

base = ObjectStoragePath("s3://aws_default@airflow-tutorial-data/")
(...)

path = base / f"air_quality_{formatted_date}.parquet"

df = pd.DataFrame(response.json()).astype(aq_fields)
with path.open("wb") as file:
    df.to_parquet(file)

can generate lineage.

FileTransferOperator already has OL support (AIP-53 one).

@mobuchowski mobuchowski force-pushed the aip-62/object-storage branch from f277448 to 26d257b Compare July 17, 2024 13:16
@mobuchowski mobuchowski added the AIP-62 Tasks tracking implementation of AIP-62 Getting Lineage from Hook Instrumentation label Jul 17, 2024
@mobuchowski mobuchowski force-pushed the aip-62/object-storage branch 4 times, most recently from 2313f98 to b4dd98d Compare July 18, 2024 14:20
@potiuk potiuk force-pushed the aip-62/object-storage branch from b4dd98d to dea8071 Compare July 18, 2024 18:30
@mobuchowski mobuchowski force-pushed the aip-62/object-storage branch 3 times, most recently from 15f6920 to ccdee06 Compare July 22, 2024 09:01
@mobuchowski mobuchowski added the full tests needed We need to run full set of tests for this PR to merge label Jul 22, 2024
Copy link
Member

@uranusjr uranusjr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me aside from the store property (which I don’t have enough knowledge on)

Signed-off-by: Maciej Obuchowski <obuchowski.maciej@gmail.com>
@mobuchowski mobuchowski force-pushed the aip-62/object-storage branch from ccdee06 to 2c3c34b Compare July 23, 2024 12:22
@mobuchowski mobuchowski merged commit 6adae0b into main Jul 23, 2024
@ephraimbuddy ephraimbuddy added the changelog:skip Changes that should be skipped from the changelog (CI, tests, etc..) label Jul 24, 2024
@mobuchowski mobuchowski deleted the aip-62/object-storage branch August 2, 2024 13:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

AIP-62 Tasks tracking implementation of AIP-62 Getting Lineage from Hook Instrumentation area:dev-tools area:lineage area:providers changelog:skip Changes that should be skipped from the changelog (CI, tests, etc..) full tests needed We need to run full set of tests for this PR to merge provider:amazon AWS/Amazon - related issues provider:common-io

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants