Mismatch in Data Types Between Partial and Final Stages in Aggregations

### Describe the bug

An issue has been identified where the data type of a column in the final stage of an aggregation does not match the data type in the partial stage. 

Here is a summary of a discussion between myself and @rluvaton.

This problem arises because:

- Partial Stage: The column data type is derived from the schema of the first batch during the ScanExec. For instance, the type might be a dictionary of UTF-8.
- Final Stage: The column data type is determined by converting the Spark type to an Arrow type, resulting in a plain UTF-8 type.

This mismatch causes failures in scenarios where the accumulator expects the same data type in both stages. Specifically:

- The partial stage may use a special implementation (e.g., for dictionaries of UTF-8) that is incompatible with the final stage.
- The final stage receives the state fields but applies the wrong accumulator due to mismatched data types.

Steps to Reproduce

- Use an aggregation function like max on a string column where the data type is dictionary-encoded in the partial stage.
- Observe that the final stage fails because it expects a plain UTF-8 type.

Possible Solutions Discussed

- Always unpack string dictionaries during ScanExec. While benchmarks show no significant performance regression, this limits future optimizations.
- Injecting a CopyExec to unpack dictionaries before final aggregation. However, this approach is infeasible as inputs may include nested types (e.g., list of strings, structs).
- Saving additional metadata in shuffle files to track dictionary encoding.
- Adding schema inference earlier in the planning phase to account for dictionary encoding.

Challenges

- Dictionary encoding varies across batches and files (e.g., in Parquet), and schema inference currently relies on the first batch.
- The final aggregation stage uses unbound columns based on the original child, not the state fields, leading to type mismatches.

Action Items

- Create a minimal reproduction case to isolate the issue.
- Investigate integrating schema metadata into Spark plans earlier during planning.
- Evaluate the feasibility of using DataFusion's catalog or schema inference mechanisms to improve type consistency.

Links and References

- Relevant code: [ScanExec implementation](https://github.com/apache/datafusion-comet/blob/main/native/core/src/execution/operators/scan.rs#L104C1-L119C61)


### Steps to reproduce

_No response_

### Expected behavior

_No response_

### Additional context

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Mismatch in Data Types Between Partial and Final Stages in Aggregations #1287

Describe the bug

Steps to reproduce

Expected behavior

Additional context

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Mismatch in Data Types Between Partial and Final Stages in Aggregations #1287

Description

Describe the bug

Steps to reproduce

Expected behavior

Additional context

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions