Skip to content

Conversation

@marton-bod
Copy link
Collaborator

Implement new FanoutPositionDeleteWriter, similar to the already-existing FanoutDataWriter class.
This new DeleteWriter implementation should provide an alternative to ClusteredPositionDeleteWriter for those users who are not inserting their data clustered by partition spec/partition values.

@aokolnychyi @openinx @pvary - can you please take a look? Thanks!

Copy link
Contributor

@jackye1995 jackye1995 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

overall looks good to me, just some nit in testing.

@marton-bod
Copy link
Collaborator Author

Thanks for the review @jackye1995 !

@aokolnychyi
Copy link
Contributor

@marton-bod, what is the use case for this type of writers? The spec requires the position delete records to be sorted by file_path and pos. If we are sorting the records anyway, why not include the spec_id and partition metadata columns as well? Is it because we don't support these metadata columns in some query engines?


@Override
protected FileWriter<PositionDelete<T>, DeleteWriteResult> newWriter(PartitionSpec spec, StructLike partition) {
return new RollingPositionDeleteWriter<>(writerFactory, fileFactory, io, targetFileSizeInBytes, spec, partition);
Copy link
Contributor

@aokolnychyi aokolnychyi Oct 27, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we support rolling ORC writers? Other writers such as ClusteredPositionDeleteWriter have a check here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it does not fail right now as we disable ORC tests in TestPartitioningWriters. We can enable them now.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes it seems that some of these tests can be re-enabled for ORC now, whereas previously they has an initial Assume to not run against ORC.

As an example:

Assume.assumeFalse("ORC delete files are not supported", fileFormat == FileFormat.ORC);

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think since ORC: Add DeleteWriteBuilder for format v2 (#3250) went in, we don't need any special treatment for ORC format in the DeleteWriters. That means we could remove the Assume.assumeFalse() checks as well.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about enabling them in a separate PR?

@aokolnychyi
Copy link
Contributor

This looks mostly good to me. We need to fix ORC writers and enable appropriate tests. I'd like to also know a little bit more about the use case.

@marton-bod
Copy link
Collaborator Author

Thanks for reviewing @aokolnychyi! You're right about the spec mandating that delete entries must be sorted by file_path and file_pos. That's what we are doing on the Hive side as well, but came across the problem that since data files can be added via the API with any arbitrary name, an alphabetical sort of the file_paths could still lead to out of order partition values.

As for the spec_id and partition columns, to be honest I kinda missed that they have been added to the MetadataColumns :) I haven't tried them out yet, but I'm assuming you could include them too in the sort (e.g. sort by spec_id, partition, file_path, file_pos) and have your data perfectly clustered with their help.

I think there might still be some utility in keeping this writer implementation as well for the problems similar to the one described above, but I'll leave it up to you. What do you think?

@rdblue
Copy link
Contributor

rdblue commented Oct 31, 2021

@marton-bod, I would rather not include this if we don't know that it is definitely needed. Otherwise it would be easy to start using it even though there are better ways to prepare the data so that it isn't needed. Does that make sense? If there's a way around adding a class, then we should avoid adding it.

@aokolnychyi
Copy link
Contributor

Fanout writers may be inefficient so I'd rather use _spec_id and _partition metadata columns if possible. I'd be happy to review PRs to make those columns available in query engines other than Spark.

That being said, we should probably still enable ORC tests in TestPartitioningWriters.

@marton-bod
Copy link
Collaborator Author

@rdblue @aokolnychyi Makes sense! Thanks for reviewing, I'll close this PR.
I'll open a new PR for enabling ORC tests in TestPartitioningWriters.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants