Skip to content

Conversation

@Prathamesh9284
Copy link

@Prathamesh9284 Prathamesh9284 commented Feb 10, 2026

Which issue does this PR close?

Closes #508

Rationale for this change

Other Substrait producers (Isthmus, DuckDB) support generating JSON-formatted Substrait plans, which makes it much easier to inspect and compare plans across engines. The substrait crate already provides serde support via the pbjson feature, so DataFusion Python can leverage this to expose JSON plan generation and parsing with minimal effort.

What changes are included in this PR?

  • python/datafusion/substrait.py

    • Added to_json() instance method and parse_json() static method on the Plan class for converting plans to JSON strings and parse JSON strings to plans.
  • src/substrait.rs

    • Implemented the Rust backend for the above Python methods, including error mapping.
    • Added the following functions:
      • fn to_json(&self) -> PyDataFusionResult<String>
      • #[staticmethod] fn parse_json(json: &str) -> PyDataFusionResult<PyPlan>
  • Cargo.toml

    • Added serde_json as a dependency.
  • pyproject.toml

    • Added protobuf and substrait as dev dependencies for test support.
  • python/tests/test_substrait.py

    • Added test for JSON round-trip stability and protobuf compatibility validation.

Are there any user-facing changes?

Yes — four new public API methods:

  • Plan.to_json() -> str — convert a Substrait plan to a JSON string
  • Plan.parse_json(json: str) -> Plan — create a Substrait plan from a JSON string

Copy link
Member

@timsaucer timsaucer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a few questions/recommendations. My biggest concern is I don't see evidence that this is generating json strings that are actually compatible with the other libraries and described in the original issue. It's entirely possible it's working exactly as needed, but I think that's something we can show in the tests.

def to_json(self) -> str:
"""Serialize the plan to a JSON string.

Returns a JSON string following the Protobuf JSON Mapping.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find the statement "Returns a JSON string following the Protobuf JSON Mapping" to be hard to understand. What exactly does this mean? Is this LLM generated?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree the wording isn’t very clear. I’ll rephrase it to make it clearer that this just returns the Plan in its standard JSON form.


@staticmethod
def serialize_json(sql: str, ctx: SessionContext, path: str | pathlib.Path) -> None:
"""Serialize a SQL query to a JSON-formatted Substrait plan file.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the output a json file? I'm trying to understand the "serialize" in this context.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, in the current implementation it writes a JSON-formatted Substrait plan to a file. I used “serialize” in the same sense as the existing binary serialize, but I agree the naming may not be very clear here.

# Convert plan to JSON
json_str = plan.to_json()
assert isinstance(json_str, str)
assert "relations" in json_str # basic sanity check
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If these conversions to json are stable as the issue suggests, then I would think we could compare the entire output, right? What do the other libraries produce?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @timsaucer,

I was able to generate a JSON-formatted Substrait plan using my DataFusion implementation. Comparing it with Isthmus:

  • Both include relations, project expressions, virtualTable values, and correct column mappings.
  • Minor differences exist: DataFusion uses i64 instead of i32, column names differ, and optional metadata (extensions, typeAliases) are not included — these do not affect compatibility.
  • DuckDB currently does not provide a working JSON inspection method due to extension availability issues, so direct JSON comparison via DuckDB isn’t possible.

Overall, the DataFusion output is compatible and can be used for reproducibility and validation.

Comment on lines 158 to 159
@staticmethod
def serialize_json(sql: str, ctx: SessionContext, path: str | pathlib.Path) -> None:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the preferred method here to write to a file or to just produce a string output? Maybe the author of the issue has an opinion on best use case.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was originally trying to keep it consistent with Serde.serialize, which writes the binary plan to a file. But I’m also fine with just returning a JSON string from Plan.to_json() and letting users handle writing to a file themselves. Do you have a preference for which approach fits better with the project?

Comment on lines 170 to 171
def deserialize_json(path: str | pathlib.Path) -> Plan:
"""Deserialize a Substrait plan from a JSON file.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly, not sure if this needs to be a path to a file that we read or expect to pass in a string. Since it's python it's fairly easy to handle either/both

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Following up on my earlier comment if we keep JSON handling on Plan, would it make sense for from_json() to accept either a JSON string or a file path? That might simplify things instead of having separate helpers on Serde. Curious what you think.

src/errors.rs Outdated
PythonError(PyErr),
EncodeError(EncodeError),
DecodeError(DecodeError),
SerdeJsonError(SerdeJsonError),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Recommend removing SerdeJsonError and mapping to EncodeError or DecodeError. Alternatively putting into DataFusion execution error.

@Prathamesh9284 Prathamesh9284 force-pushed the feat/substrait-json-plan branch 2 times, most recently from 479bbd9 to e9df13e Compare February 12, 2026 14:37
@Prathamesh9284 Prathamesh9284 force-pushed the feat/substrait-json-plan branch from e9df13e to babd63a Compare February 12, 2026 14:48
@Prathamesh9284
Copy link
Author

Prathamesh9284 commented Feb 12, 2026

Hi @timsaucer

I have added pytest to validate cross-library compatibility. On test_substrait.py ines 94–99 parse the JSON using the official Substrait protobuf library (plan_pb2) via json_format.Parse, which would fail if the JSON didn’t conform to the Substrait protobuf schema. It then re-serializes to binary and asserts byte-for-byte equality with the original DataFusion output. This ensures any tool that supports protobuf-JSON can consume it correctly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support for generating JSON formatted substrait plan

2 participants