-
Notifications
You must be signed in to change notification settings - Fork 4.5k
Closed as not planned
Closed as not planned
Copy link
Description
What would you like to happen?
Apache Beam from 2.63 onwards officially have capability to track data lineage from various IOs.
This is done by the IOs emitting the lineage information as metrics individually.
Once the lineage information is collected from IOs in the pipeline they need to be linked to each other based on the IOs of the pipeline which are connected in the graph. Dataflow backend already does this by:
- Traversing the job graph to identify all the connected paths (Note: Beam job graph can have more than one connected components)
- Identify the sources in these paths (a middle node can be source too).
- Forming pairs of sources and sinks based on the above connected node.
Once this lineage information is linked this can be served by:
- Simple API which given a source/sink connected to each other show the data lineage.
or - Sending this information to some Lineage graph visualization tool.
Issue Priority
Priority: 2 (default / most feature requests should be filed as P2)
Issue Components
- Component: Python SDK
- Component: Java SDK
- Component: Go SDK
- Component: Typescript SDK
- Component: IO connector
- Component: Beam YAML
- Component: Beam examples
- Component: Beam playground
- Component: Beam katas
- Component: Website
- Component: Infrastructure
- Component: Spark Runner
- Component: Flink Runner
- Component: Samza Runner
- Component: Twister2 Runner
- Component: Hazelcast Jet Runner
- Component: Google Cloud Dataflow Runner