-
Notifications
You must be signed in to change notification settings - Fork 3k
Hive: HiveIcebergOutputFormat first implementation for handling Hive inserts into unpartitioned Iceberg tables #1407
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@massdosage, @rdblue: You might be interested in reviewing this. |
mr/src/main/java/org/apache/iceberg/mr/hive/HiveIcebergOutputFormat.java
Outdated
Show resolved
Hide resolved
mr/src/main/java/org/apache/iceberg/mr/hive/HiveIcebergOutputFormat.java
Show resolved
Hide resolved
mr/src/main/java/org/apache/iceberg/mr/hive/HiveIcebergOutputFormat.java
Outdated
Show resolved
Hide resolved
mr/src/main/java/org/apache/iceberg/mr/hive/HiveIcebergSerDe.java
Outdated
Show resolved
Hide resolved
mr/src/main/java/org/apache/iceberg/mr/hive/HiveIcebergSerDe.java
Outdated
Show resolved
Hide resolved
|
Here comes the new implementation with the mapred.OutputCommitter. What I have learned:
The basic idea behind the patch is:
What I do not like:
What I have found:
Thanks, |
Can we find out in job commit how many writer tasks there were? Then we could use well-known locations and make sure each one is read. |
I suspect that the JobContext contains only input information about the number of mappers/reducers. I have only debugged the LocalJobRunner code for now, but I did not see anything which would indicate that we have up-to-date information there. Updated the PR to commit the task only at IcebergOutputCommitter.commitTask, and not at IcebergRecordWriter.close. |
mr/src/main/java/org/apache/iceberg/mr/hive/HiveIcebergOutputFormat.java
Outdated
Show resolved
Hide resolved
mr/src/main/java/org/apache/iceberg/mr/hive/HiveIcebergOutputFormat.java
Outdated
Show resolved
Hide resolved
mr/src/main/java/org/apache/iceberg/mr/hive/HiveIcebergOutputFormat.java
Outdated
Show resolved
Hide resolved
|
As @steveloughran is our resident OutputCommitter expert, I feel better about our currently proposed solution now. @rdblue, @massdosage: What are your thoughts of the current proposed solution? Code styling and formatting questions are also fair game, since I still just learn the style working on Iceberg. Could you please take a look at #1430 too, as it would help us to get rid of the SerializableMetrics class. Thanks, |
|
Good to see your name pop up here, @steveloughran! |
mr/src/main/java/org/apache/iceberg/mr/hive/HiveIcebergOutputFormat.java
Outdated
Show resolved
Hide resolved
| } | ||
|
|
||
| // Appending data files to the table | ||
| AppendFiles append = table.newAppend(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How do we know that the write is always an append? Does Hive pass the operation so we can determine whether to use an overwrite or replacePartitions operation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Integrating with Hive update/delete statements would be a harder nut to crack.
Currently we are aiming only for inserts.
Does this still mean we have to think about replacePartitions, or overwrite? Or my assumption was correct that we need only to use newAppend in this case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current patch is only aiming to insert new data to the table (no delete/update etc at the moment).
@rdblue: Is it enough to add data with newAppend, or we have to think about overwrite/replacePartitions too?
Thanks,
Peter
mr/src/main/java/org/apache/iceberg/mr/hive/HiveIcebergOutputFormat.java
Outdated
Show resolved
Hide resolved
mr/src/main/java/org/apache/iceberg/mr/hive/HiveIcebergOutputFormat.java
Outdated
Show resolved
Hide resolved
mr/src/main/java/org/apache/iceberg/mr/hive/HiveIcebergOutputFormat.java
Outdated
Show resolved
Hide resolved
mr/src/main/java/org/apache/iceberg/mr/hive/HiveIcebergOutputFormat.java
Outdated
Show resolved
Hide resolved
mr/src/main/java/org/apache/iceberg/mr/hive/HiveIcebergOutputFormat.java
Outdated
Show resolved
Hide resolved
| public boolean needsTaskCommit(TaskAttemptContext context) { | ||
| // We need to commit if this is the last phase of a MapReduce process | ||
| return TaskType.REDUCE.equals(context.getTaskAttemptID().getTaskID().getTaskType()) || | ||
| context.getJobConf().getNumReduceTasks() == 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't this just be true? I think that every task using the output committer needs to commit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can define OutputCommitter on job level only, and AFAIK Hive writes only on the last stage of the job be it a Reducer for a MapReduce job, or a Map for a MapOnly job.
This way we can save IO by not generating and reading those unnecessary files, but depending on assumptions of the way how Hive works.
Your thoughts? Still voting for the more general solution?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What I mean is that I don't think Hive calls the output committer unless it is writing to the table. The map phase of a write job should call the committer only if it is a map-only job. And the reduce phase will call the committer if the job has a reduce phase. Because the MR framework takes care of calling commitTask for the right phase, we can default this to true and it should be identical.
You can probably test this assumption by changing the logic here to throw an exception if it is violated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some HiveRunner tests that do writes that involves a MR job might help too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have tested this with adding the jars to the Hive master from the apache repo and running the queries in local mode. I got some exception and started to investigate the source.
The actual query I have used for this had an order by so it had both a Mapper and a Reducer phase:
insert into purchases select * from purchases order by id;
My debugging revealed that the commitTask was called for Mappers and Reducers as well. The JobRunner infrastructure calls those and it does not have any information about if the task has actually written anything to anywhere or not. I think this is the purpose of the needsTaskCommit() method, so the developer of the OutputCommitter could decide.
@massdosage: Do you have an easy example for HiveRunner write tests?
Thanks,
Peter
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for testing it out. If Hive does call this even in map stages, then I agree that we should keep the logic that you have here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pvary I guess it would look very similar to one of the existing HiveRunner tests here but one could do a hiveShell.execute() with an insert, update, etc statement and then do a query to select the results back?
I'm not sure whether these examples (https://github.com/klarna/HiveRunner/blob/master/src/test/java/com/klarna/hiverunner/InsertIntoTableIntegrationTest.java and https://github.com/klarna/HiveRunner/blob/master/src/test/java/com/klarna/hiverunner/data/InsertIntoTableTest.java) from HiveRunner itself would trigger the OutputFormat for the insert operations? Presumably so if the table under test created is created using the StorageHandler?
I'd also highly recommend testing this in distributed mode, we found lots of issues with the InputFormat that only manifested themselves then. I can help with this if you want as we have a test environment for this setup already.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@massdosage: Created HiveRunner tests. Could you please take a look at them? (TestHiveRunnerWrite.java)
Trying out on a real cluster is on my TODO list, and will definitely do, but if you have a test cluster and some ideas it would be good if you take a stab at it too (in Hungary we say, that "More eyes see more"). I tried with HiveOnLLAP, but the OutputCommitter lifecycle is not yet handled correctly there, so we have to try the MR way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@massdosage, @rdblue: Tested out the stuff on an MR Cluster by running the following queries successfully:
insert into test_table values(3,"3"),(4,"4");
insert into test_table select * from test_table;
insert into test_table select * from test_table order by id;
insert into test_table select * from test_table_2;
Had Map only jobs and Map-Reduce jobs too. The result was checked by Hive select queries and everything seems ok.
So if Adrian's team did not find any issues I would think that this PR is ready to be pushed.
Your thoughts?
Thanks, Peter
|
The current status of the PR: Open questions:
Postponed:
Non-goals:
|
|
Can we push this change @massdosage, @rdblue? I very much would like to see it in the next release so our testing would be much easier. All that said, we can work on our branch so I do not want to push too hard 😸 |
8ff1668 to
ee6f1b6
Compare
774ef50 to
3c858b3
Compare
|
@rdblue: If you have time, could you please review this one? |
|
@marton-bod: You might be interested in this too, as I have changed some Hive3 related stuff too |
|
@pvary, how about we target this for the 0.11.0 release? Right now, we're trying to get 0.10.0 finished and that's taking attention. After that, I should have time to start reviewing this. Does that work for you? |
Ok with that, but I would definitely love to see this in 0.11.0 😃 |
dbd96b7 to
56811a3
Compare
steveloughran
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've been dealing with changes related to job uniqueness (HADOOP-17318, SPARK-33402), so I worry about making sure you are getting unique values and paths everywhere.
Regarding the committer, why a "v1 api" not the later org.apache.hadoop.mapreduce package. I'd recommend implementing the committer under those classes ..but then I'm not up to date with how hive will use the committer...
What I will suggest: every commit operation to log its invocation and duration, along with job id. this is what you need when trying to debug issues
| * An Iceberg table committer for adding data files to the Iceberg tables. | ||
| * Currently independent of the Hive ACID transactions. | ||
| */ | ||
| public final class HiveIcebergOutputCommitter extends OutputCommitter { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why this and not the (newer) org.apache.hadoop.mapreduce.OutputCommitter
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First I was trying to implement the v2 (mapreduce) API, and was surprised why my outputcommitter methods were not called. I had to realize that Hive still uses the old v1 API. When I downgraded to v1 then the methods were called.
mr/src/main/java/org/apache/iceberg/mr/hive/HiveIcebergOutputCommitter.java
Outdated
Show resolved
Hide resolved
|
|
||
| @Override | ||
| public void commitTask(TaskAttemptContext context) throws IOException { | ||
| TaskAttemptID attemptID = context.getTaskAttemptID(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Been having fun in spark and hadoop about making sure this is unique. What runtime are you targeting here, and where does it get job/task IDs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The jobId is generated by Hadoop and I do not think in Hive we add any magic there:
JobClient jc = new JobClient(job);
rj = jc.submitJob(job);
this.jobID = rj.getJobID();
Is the problem mentioned in the jiras above (HADOOP-17318, SPARK-33402) are caused by the spark specific jobId generation, or would this appear when we are using standard Hadoop JobIDs?
mr/src/main/java/org/apache/iceberg/mr/hive/HiveIcebergOutputCommitter.java
Show resolved
Hide resolved
mr/src/main/java/org/apache/iceberg/mr/hive/HiveIcebergOutputCommitter.java
Show resolved
Hide resolved
| // If the data is not empty add to the table | ||
| if (!closedFiles.isEmpty()) { | ||
| closedFiles.forEach(file -> { | ||
| DataFiles.Builder builder = DataFiles.builder(table.spec()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In #1818, we're refactoring a little so that the writer produces a DataFile for you. Hopefully that will make this a bit smaller.
| private DeserializerHelper() { | ||
| } | ||
|
|
||
| static Record deserialize(Object data, Schema tableSchema, ObjectInspector objectInspector) throws SerDeException { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible to use the approach we've taken with the other formats and build a tree based on the schema that Hive is writing to do the inspection?
For Spark, we generate this in advance. If the user writes a tuple of (string, short) to a table, we generate a struct reader that has two field readers, one to read a short and write an int, and one to read a UTF8String and write it as a CharSequence. That avoids inspecting the schema structure to find out what to do each time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| import org.apache.iceberg.types.Type; | ||
| import org.apache.iceberg.types.Types; | ||
|
|
||
| class DeserializerHelper { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems like a good class to separate into its own PR with tests.
| * An Iceberg table committer for adding data files to the Iceberg tables. | ||
| * Currently independent of the Hive ACID transactions. | ||
| */ | ||
| public final class HiveIcebergOutputCommitter extends OutputCommitter { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we separate this class into its own PR with unit tests? I think that would help break this down into manageable chunks.
|
This was committed in separate PRs, so I'll close it. Thanks @pvary! |
The goal of the patch is to have a PoC implementation of Hive writes to Iceberg tables.
The patch changes:
The tests were run successfully. Also after adding the change after the uber jar patch I was able to create a Hive table above an unpartitioned Iceberg table and write and read back data from it. The table was created with the following command:
Findings:
Missing stuff:
Any suggestions / ideas / thoughts are welcome.
Thanks,
Peter