Skip to content

[Feature Request]: Add support for Lineage Graphs in Beam's direct runner through job graph traversal #33980

@rohitsinha54

Description

@rohitsinha54

What would you like to happen?

Apache Beam from 2.63 onwards officially have capability to track data lineage from various IOs.
This is done by the IOs emitting the lineage information as metrics individually.

Once the lineage information is collected from IOs in the pipeline they need to be linked to each other based on the IOs of the pipeline which are connected in the graph. Dataflow backend already does this by:

  1. Traversing the job graph to identify all the connected paths (Note: Beam job graph can have more than one connected components)
  2. Identify the sources in these paths (a middle node can be source too).
  3. Forming pairs of sources and sinks based on the above connected node.

Once this lineage information is linked this can be served by:

  1. Simple API which given a source/sink connected to each other show the data lineage.
    or
  2. Sending this information to some Lineage graph visualization tool.

Issue Priority

Priority: 2 (default / most feature requests should be filed as P2)

Issue Components

  • Component: Python SDK
  • Component: Java SDK
  • Component: Go SDK
  • Component: Typescript SDK
  • Component: IO connector
  • Component: Beam YAML
  • Component: Beam examples
  • Component: Beam playground
  • Component: Beam katas
  • Component: Website
  • Component: Infrastructure
  • Component: Spark Runner
  • Component: Flink Runner
  • Component: Samza Runner
  • Component: Twister2 Runner
  • Component: Hazelcast Jet Runner
  • Component: Google Cloud Dataflow Runner

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions