Skip to content

[C++] Output schema calculated by Substrait consumer for aggregate rel seems incorrect. #34786

@westonpace

Description

@westonpace

Describe the bug, including details regarding any error messages, version, and platform.

The code calculating the output schema is here:

      FieldVector output_fields;
      output_fields.reserve(key_field_ids.size() + measure_size);
      // extract aggregate fields to output schema
      for (const auto& agg_src_fieldset : agg_src_fieldsets) {
        for (int field : agg_src_fieldset) {
          output_fields.emplace_back(input_schema->field(field));
        }
      }
      // extract key fields to output schema
      for (int key_field_id : key_field_ids) {
        output_fields.emplace_back(input_schema->field(key_field_id));
      }

      std::shared_ptr<Schema> aggregate_schema = schema(std::move(output_fields));

This appears to have two issues:

  • It is inserting the key fields after the measure fields
  • It is inserting measure fields based on the function inputs and not the function outputs

I suspect we are getting away with it in many cases because we are not applying projection / emit after the aggregate. At the very least, we should add some test cases that do this so we can verify the output schema is correct. If there is indeed an issue then we should also fix that.

Component(s)

C++

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions