Exclude reading _pos column if it's not in the scan list#11390
Exclude reading _pos column if it's not in the scan list#11390szehon-ho merged 8 commits intoapache:mainfrom
Conversation
|
also cc @flyrain |
|
@huaxingao: I'm not an expert in the Spark codebase, but I think having a test which fails before the change and succeeds after the change would be nice. Otherwise we risk future PRs changing this behaviour without the reviewers noticing it. |
|
@huaxingao its a good find, im just wondering, where do we add _pos to the schema? Can we just not do it there? Just curious if its possible |
I think it might be from here iceberg/data/src/main/java/org/apache/iceberg/data/DeleteFilter.java Lines 260 to 263 in 684f7a7 |
|
@szehon-ho I think we still need the |
|
@pvary Thank you for your suggestion! You're correct that adding such a test would help prevent future changes from inadvertently affecting this behavior without notice. Currently, Spark doesn't check the schema when processing batch data, which is why an extra Arrow vector in |
| // vectorization reader. Before removing _pos, we need to make sure _pos is not explicitly | ||
| // selected in the query. | ||
| if (deleteFilter != null) { | ||
| if (deleteFilter.hasPosDeletes() && expectedSchema().findType("_pos") == null) { |
There was a problem hiding this comment.
Yes, we do. Changed to MetadataColumns.ROW_POSITION.name()
| fileSchema -> | ||
| VectorizedSparkParquetReaders.buildReader( | ||
| requiredSchema, fileSchema, idToConstant, deleteFilter)) | ||
| vectorizationSchema, fileSchema, idToConstant, deleteFilter)) |
There was a problem hiding this comment.
Not exactly.
If no deletes, it's expectedSchema.
If it's equality delete, it's deleteFilter.requiredSchema(), because it could be expectedSchema + equality filter column. For example:
SELECT id FROM table
supposed the equality delete has data == 'aaa'
then we do need to read the data column too, so it's deleteFilter.requiredSchema(), which is id + data
There was a problem hiding this comment.
In the case of pos delete, I can't use expectedSchema either, because it could be both pos delete and equality delete.
|
Sorry I still wanted to see if it can be done earlier, what do you think https://github.com/apache/iceberg/blob/main/spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/source/BatchDataReader.java#L99. this is for vectorized path right? Can we pass in a flag to SparkDeleteFilter to not add the column? It seems a bit wasteful to remove the column after adding it, so just wanted to explore it. Also the current fix is specific to Parquet, how about ORC? |
|
@szehon-ho Thanks for the comment. We actually also use the requiredSchema, that's the schema with the We can pass in a flag to ORC uses expectedSchema(), the schema without _pos column, to build vectorized readers. |
|
Hm then can we we just add _pos to requiredSchema (with a comment) at https://github.com/apache/iceberg/blob/main/spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/source/BaseBatchReader.java#L83? To me adding it earlier and then removing it later is worse for code understanding, than just adding it when needed. Probably cleaner with a flag to ReadConf to trigger the generateOffSetToStartPos but not sure if its feasible. fyi @aokolnychyi |
flyrain
left a comment
There was a problem hiding this comment.
Thanks @huaxingao for working on it. Sorry for the delay. Left some comments.
| SparkDeleteFilter sparkDeleteFilter = | ||
| new SparkDeleteFilter(filePath, task.deletes(), counter(), true); | ||
|
|
||
| SparkDeleteFilter deleteFilter = task.deletes().isEmpty() ? null : sparkDeleteFilter; |
There was a problem hiding this comment.
We don't need delete filter if task.deletes().isEmpty(), but the new code always creates a filter object. How about this?
filter = null;
if(!task.deletes().isEmpty()) {
filter = new filter()
}
| Schema requestedSchema, | ||
| DeleteCounter counter) { | ||
| DeleteCounter counter, | ||
| boolean isBatchReading) { |
There was a problem hiding this comment.
Does batch reading necessarily mean no _pos? I believe not, as you can explicitly project it, like select _pos from t. We should give an accurate name, something like needRowPosition or needRowPosCol.
There was a problem hiding this comment.
Think a bit more, I'm considering if only the non-vector readers actually need an implicit _pos column. If that’s the case, would it make more sense to adjust this within RowReader by adding _pos there? This approach could simplify things by eliminating the need to check whether a reader is vectorized, especially since vectorization isn’t necessarily strongly correlated with the requirement for _pos.
Here is pseudo code to add it in class RowDataReader
LOG.debug("Opening data file {}", filePath);
expectedSchema().add(`_pos`); <--- add it here
SparkDeleteFilter deleteFilter =
new SparkDeleteFilter(filePath, task.deletes(), counter(), false);
| this.eqDeletes = eqDeleteBuilder.build(); | ||
| this.requiredSchema = fileProjection(tableSchema, requestedSchema, posDeletes, eqDeletes); | ||
| this.requiredSchema = | ||
| fileProjection(tableSchema, requestedSchema, posDeletes, eqDeletes, isBatchReading); |
There was a problem hiding this comment.
One question I asked myself is whether it impacts the metadata column read? It seems not, but the method DeleteFilter::fileProjection seems a bit hard to read, we can refactor it later. It makes more sense to make it a static until method instead of instance method. Plus, it's a bit weird to pass the schema to delete filter, then get it back from the filter. This seems something we can improve on it as a follow up.
| private Map<Long, Long> generateOffsetToStartPos(Schema schema) { | ||
| if (schema.findField(MetadataColumns.ROW_POSITION.fieldId()) == null) { | ||
| private Map<Long, Long> generateOffsetToStartPos() { | ||
| if (hasPositionDelete) { |
There was a problem hiding this comment.
I'd recommend to do it in the caller like this
Map<Long, Long> offsetToStartPos = hasPositionDelete? generateOffsetToStartPos(): null;
| .filter(residual) | ||
| .caseSensitive(caseSensitive()) | ||
| .withNameMapping(nameMapping()) | ||
| .hasPositionDelete(readSchema.findField(MetadataColumns.ROW_POSITION.fieldId()) == null) |
There was a problem hiding this comment.
Do we need it for row reader? Can we use a default value here?
| // read performance as every batch read doesn't have to pay the cost of allocating memory. | ||
| .reuseContainers() | ||
| .withNameMapping(nameMapping()) | ||
| .hasPositionDelete(hasPositionDelete) |
There was a problem hiding this comment.
Wondering if withPositionDelete is more suitable
There was a problem hiding this comment.
The key here is determining whether we want to compute row offsets specifically for the filtered row groups. This decision doesn’t have to be directly tied to the presence of position deletes. Perhaps a name like needRowGroupOffset would better capture this intent and improve clarity.
|
Thanks a lot @flyrain and @szehon-ho for your review! I've thought this over: I feel the original change is much simpler. If the original one looks OK to you, I will revert to it. Thanks a lot! |
|
|
||
| Set<Integer> requiredIds = Sets.newLinkedHashSet(); | ||
| if (!posDeletes.isEmpty()) { | ||
| if (!posDeletes.isEmpty() && needRowPosCol) { |
There was a problem hiding this comment.
Nit: switch two condition to make it more readable. Not a blocker though.
if (needRowPosCol && !posDeletes.isEmpty()) {}
|
Merged, thanks @huaxingao , and also @flyrain for review, and @pvary @dramaticlly @viirya for other reviews |
|
Thanks a lot! @flyrain @szehon-ho @viirya @dramaticlly @pvary |
In Spark batch reading, Iceberg reads additional columns when there are delete files. For instance, if we have a table
test (int id, string data)and a querySELECT id FROM test, the requested schema only contains the columnid. However, to determine which rows are deleted (there is arowIdMappingfor this purpose), Iceberg appends_posto the requested schema for position deletes, and append the equality filter column for equality deletes (suppose the equality delete is on column data). As a result, Iceberg will haveColumnarBatchReaders for these extra columns. In the case of position deletes, we actually don't need to read_posto compute therowIdMapping, so this PR excludes the_poscolumns when building theColumnarBatchReader. For equality deletes, while we need to read the equality filter column to compute therowIdMapping, once we have therowIdMapping, we should exclude the values of these extra columns from theColumnarBatch. I will have a separate PR to fix equality delete.In summary:
For position delete, the vectorized reader currently returns a
ColumnarBatchthat contains arrow vectors for bothidand_pos. This PR will make iceberg not read_poscolumn, so the returnedColumnarBatchonly contains an arrow vector foridonly.For equality delete (suppose the filter is on
datacolumn, the vectorized reader currently returns aColumnarBatchthat contains arrow vectors for bothidanddata. The goal is to return aColumnarBatchthat contains an arrow vector foridonly. I will have a separate PR for this.