Skip to content

Conversation

@pvary
Copy link
Contributor

@pvary pvary commented Dec 2, 2020

Separated out the HiveIcebergOutputCommitter related stuff from #1407 so it is easier to review. HiveIcebergRecordWriter was also needed since the 2 are tightly coupled.

@pvary
Copy link
Contributor Author

pvary commented Dec 2, 2020

@marton-bod: Could you please review?
Thanks,
Peter

@github-actions github-actions bot added the MR label Dec 2, 2020
Copy link
Collaborator

@marton-bod marton-bod left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @pvary ! looks great - just a few questions


// Reading the committed files. The assumption here is that the taskIds are generated in sequential order
// starting from 0.
Tasks.range(expectedFiles)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also like to explicitly set the failure behavior. This should probably use throwFailureWhenFinished() and stopOnFailure() because this can't continue if any task fails.

Not using stopOnFailure is for tasks like cleaning up files, where each task should at least attempt.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added throwFailureWhenFinished, but did not used stopOnFailure. This way I was able to reuse the code for abort and for commit. Since this is only for the exception handing I think this could be an acceptable compromise


if (dataFiles.size() > 0) {
// Appending data files to the table
AppendFiles append = table.newAppend();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a blocker since this is scoped to appends only, but can we detect when the user called INSERT OVERWRITE at least?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer to do it in another PR. I do not see how specific writes could be handled with SerDe-s. This might be trivial, but since I have too many things in progress I prefer to not to delve into a new issue before closing up some already open tasks.

import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

class HiveIcebergRecordWriter extends org.apache.hadoop.mapreduce.RecordWriter<NullWritable, Container>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor: Since this lives in Iceberg, we can probably remove Iceberg from the class names.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer to have it in the name of the classes. When we have to debug code in Hive it could be very helpful to see that we are looking at an Iceberg class without examining the package first.

Path toDelete = new Path(file);
FileSystem fs = Util.getFs(toDelete, jobContext.getJobConf());
try {
fs.delete(toDelete, true /* recursive */);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This won't fail if the path doesn't exist, but if you delete just the file directly, you'd want to check that it exists first, since some tasks may not have finished and committed and have aborted instead.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there are runaway tasks they might write into the temp directory even after the job is finished.
So I feel that it is ok to clean this as a best effort. Do you agree?

@github-actions github-actions bot added the core label Dec 7, 2020
@pvary
Copy link
Contributor Author

pvary commented Dec 7, 2020

Big thanks @rdblue for reviewing the PR even on Saturday! Really appreciate it!

Your comments were really useful as usual. Updated the PR so if you have time again I would love to hear you thoughts.

Thanks,
Peter

@rdblue
Copy link
Contributor

rdblue commented Dec 7, 2020

Looks like the last update accidentally included a few .crc files that are temp data from tests. Could you clean up the additions?

@pvary
Copy link
Contributor Author

pvary commented Dec 7, 2020

Looks like the last update accidentally included a few .crc files that are temp data from tests. Could you clean up the additions?

Of course. I have removed the extra files.

@rdblue
Copy link
Contributor

rdblue commented Dec 7, 2020

@pvary, it looks like this now includes commits you didn't intend to add. Can you take a look at your branch? I think it's about ready to merge (we can fix the remaining issues later).

@pvary
Copy link
Contributor Author

pvary commented Dec 10, 2020

@rdblue: Cleaned up the commits, and rebased the patch. If you have time could you please check out this PR?
Thanks,
Peter

.onFailure((file, exc) -> LOG.debug("Failed on to remove directory {} on cleanup job", file, exc))
.run(file -> {
Path toDelete = new Path(file);
FileSystem fs = Util.getFs(toDelete, jobContext.getJobConf());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: I would prefer if cleanup happened by removing the expected task commit files one-by-one rather than deleting a directory because it could use FileIO. I understand that this is intended to drop the folder as well for stores that track folders. Maybe a follow-up to add a deletePrefix to FileIO would fix it.

@rdblue rdblue merged commit 05aa901 into apache:master Dec 10, 2020
@rdblue
Copy link
Contributor

rdblue commented Dec 10, 2020

@pvary, looks great! I merged this. Thanks for separating these into smaller PRs, it really helps keep the reviews manageable.

@pvary
Copy link
Contributor Author

pvary commented Dec 11, 2020

Thanks for the review and the merge @rdblue! I have learned a lot during the review!

@pvary pvary deleted the outputcommitter branch December 11, 2020 13:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants