-
Notifications
You must be signed in to change notification settings - Fork 3k
replace SparkDataFile with DataFile #786
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
replace SparkDataFile with DataFile #786
Conversation
spark/src/main/scala/org/apache/iceberg/spark/SparkTableUtil.scala
Outdated
Show resolved
Hide resolved
spark/src/main/scala/org/apache/iceberg/spark/SparkTableUtil.scala
Outdated
Show resolved
Hide resolved
spark/src/main/scala/org/apache/iceberg/spark/SparkTableUtil.scala
Outdated
Show resolved
Hide resolved
spark/src/main/scala/org/apache/iceberg/spark/SparkTableUtil.scala
Outdated
Show resolved
Hide resolved
|
@chenjunjiedada, great work! I did a quick look and had only minor comments. |
|
@aokolnychyi , Thanks for the review, just updated. |
|
python build is failed. @aokolnychyi could you please help to trigger CI? |
|
@aokolnychyi, I'll take a look, and let you know. Thanks! |
| null, | ||
| null, | ||
| null) | ||
| val metrics = new Metrics(-1L, arrayToMap(null), arrayToMap(null), arrayToMap(null)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't rowCount be a positive number?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the metric is not intended to be used so it is set to an invalid value. We might need to read through whole file to get row count, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's try to keep the logic as close to what we had before as possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think anything positive should do. keeping it <= 0 may possibly affect some scan planning code to filter out this particular file. e.g see org.apache.iceberg.expressions.InclusiveMetricsEvaluator
@aokolnychyi , thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch, @rdsr. This is definitely a problem. Right now, the InclusiveMetricsEvaluator will remove files with negative or 0 row counts.
I don't think that the solution is to use a positive number here. The reason why this was required is that we want good stats for job planning. Setting this to -1 causes a correctness bug, but setting it to some other constant will introduce bad behavior when using the stats that are provided by Iceberg. I think we should either count the number of records, use a heuristic (file size / est. row size?), or remove support for importing Avro tables. I'm leaning toward counting the number of records.
We should also change the check in InclusiveMetricsEvaluator to check for files with 0 rows and allow files with -1 rows through to fix the correctness bug for existing tables that used this path to import Avro data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the number of records must be correct and precise as we want to answer some data queries with metadata (e.g. give me the number of records per partition). Updating our metrics evaluators to handle -1 seems reasonable to me as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rdsr, could you create follow-up issues so that we don't forget?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 I'll do that!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you guys for the detail explanation!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Created #809 to track this.
|
I won't be affected by this change; we built this into our Spark version as |
spark/src/main/scala/org/apache/iceberg/spark/SparkTableUtil.scala
Outdated
Show resolved
Hide resolved
aokolnychyi
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. I had only one super minor comment. Will test it tomorrow and merge if there are no objections.
|
will take a look today |
|
Confirmed that this change doesn't affect us. |
|
I am going to merge this one. Thank you everyone for the review and @chenjunjiedada for the work! |
* Boson iceberg1.0.x preview integration * use boson base image in apple-1.0.x-preview-scala-2.13-prb * update to boson 0.2.16-beta * update to boson 0.2.16-beta * nit * move boson version to versions.props * go back to parquet 1.12.0.16-apple * change to 1.12.0.22-apple
This fixes #763