Skip to content

Conversation

@shangxinli
Copy link
Contributor

This commit resolves the TODO in DeleteFilter.fileProjection() that previously prevented equality deletes from working with nested columns.

The original code used tableSchema.asStruct().field(fieldId) which only finds top-level fields, causing 'Cannot find required field for ID X' errors when equality delete files referenced nested fields.

Changes:

  • Use Schema.findColumnName() to get full column path for nested fields
  • Use Schema.select() to preserve nested structure in projection
  • Separate data column IDs from metadata column IDs to avoid passing metadata column names to Schema.select()
  • Add metadata columns explicitly after projection (they don't exist in tableSchema)
  • Add MetadataColumns.metadataColumns() to expose metadata columns dynamically
  • Add test coverage for nested columns and metadata columns

@shangxinli shangxinli force-pushed the nested_field_fix2 branch 2 times, most recently from 7713ba8 to 96a9db1 Compare December 21, 2025 16:42
This commit resolves the TODO in DeleteFilter.fileProjection() that
previously prevented equality deletes from working with nested columns.

The original code used tableSchema.asStruct().field(fieldId) which only
finds top-level fields, causing 'Cannot find required field for ID X'
errors when equality delete files referenced nested fields.

Changes:
- Use Schema.findColumnName() to get full column path for nested fields
- Use Schema.select() to preserve nested structure in projection
- Separate data column IDs from metadata column IDs to avoid passing
  metadata column names to Schema.select()
- Add metadata columns explicitly after projection (they don't exist in tableSchema)
- Add MetadataColumns.metadataColumns() to expose metadata columns dynamically
- Add test coverage for nested columns and metadata columns

Fix DeleteFilter projection to preserve requested schema column order

The previous implementation used Schema.select() which returns columns
in the original table schema order, not the requested order. This caused
TestSparkReaderDeletes.testEqualityDeleteWithDifferentScanAndDeleteColumns
to fail because it expected columns in the requested schema order.

The fix directly builds the schema by:
1. Adding requested schema columns first (preserves order)
2. Appending missing fields needed for equality deletes
3. Appending metadata columns

This ensures the final projection maintains the requested schema's
column ordering while still including all fields required for
delete filtering.
Types.NestedField field = tableSchema.asStruct().field(fieldId);
// Add missing data columns from table schema
for (int fieldId : missingDataIds) {
Types.NestedField field = tableSchema.findField(fieldId);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does using tableSchema.findField(fieldId) here lose the nested structure?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! We need to call findColumnName() to preserve the nested column names.

This change addresses the issue where using tableSchema.findField(fieldId)
in fileProjection() would lose the nested structure when adding fields from
equality deletes. The previous implementation would extract only the leaf
field without preserving its parent struct/list/map containers.

Changes:
- Use Schema.findColumnName() to get full dotted path for nested fields
- Use Schema.select() with dotted paths to preserve nested structure
- Manually rebuild columns in requested schema order to maintain consistency
- Add separate handling for metadata columns
- Extract rebuildSchemaInRequestedOrder helper to reduce cyclomatic complexity
- Add comprehensive test coverage for nested structures

Tests added include:
- Basic nested struct fields
- Deep nesting (4 levels)
- Array element nested fields
- Map value nested fields
- Multiple nested fields from different parent structures
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants