Describe the bug
An issue has been identified where the data type of a column in the final stage of an aggregation does not match the data type in the partial stage.
Here is a summary of a discussion between myself and @rluvaton.
This problem arises because:
- Partial Stage: The column data type is derived from the schema of the first batch during the ScanExec. For instance, the type might be a dictionary of UTF-8.
- Final Stage: The column data type is determined by converting the Spark type to an Arrow type, resulting in a plain UTF-8 type.
This mismatch causes failures in scenarios where the accumulator expects the same data type in both stages. Specifically:
- The partial stage may use a special implementation (e.g., for dictionaries of UTF-8) that is incompatible with the final stage.
- The final stage receives the state fields but applies the wrong accumulator due to mismatched data types.
Steps to Reproduce
- Use an aggregation function like max on a string column where the data type is dictionary-encoded in the partial stage.
- Observe that the final stage fails because it expects a plain UTF-8 type.
Possible Solutions Discussed
- Always unpack string dictionaries during ScanExec. While benchmarks show no significant performance regression, this limits future optimizations.
- Injecting a CopyExec to unpack dictionaries before final aggregation. However, this approach is infeasible as inputs may include nested types (e.g., list of strings, structs).
- Saving additional metadata in shuffle files to track dictionary encoding.
- Adding schema inference earlier in the planning phase to account for dictionary encoding.
Challenges
- Dictionary encoding varies across batches and files (e.g., in Parquet), and schema inference currently relies on the first batch.
- The final aggregation stage uses unbound columns based on the original child, not the state fields, leading to type mismatches.
Action Items
- Create a minimal reproduction case to isolate the issue.
- Investigate integrating schema metadata into Spark plans earlier during planning.
- Evaluate the feasibility of using DataFusion's catalog or schema inference mechanisms to improve type consistency.
Links and References
Steps to reproduce
No response
Expected behavior
No response
Additional context
No response
Describe the bug
An issue has been identified where the data type of a column in the final stage of an aggregation does not match the data type in the partial stage.
Here is a summary of a discussion between myself and @rluvaton.
This problem arises because:
This mismatch causes failures in scenarios where the accumulator expects the same data type in both stages. Specifically:
Steps to Reproduce
Possible Solutions Discussed
Challenges
Action Items
Links and References
Steps to reproduce
No response
Expected behavior
No response
Additional context
No response