replace SparkDataFile with DataFile #786

chenjunjiedada · 2020-02-07T13:43:18Z

This fixes #763

spark/src/main/scala/org/apache/iceberg/spark/SparkTableUtil.scala

aokolnychyi · 2020-02-11T01:03:32Z

@chenjunjiedada, great work! I did a quick look and had only minor comments.

chenjunjiedada · 2020-02-11T01:45:09Z

@aokolnychyi , Thanks for the review, just updated.

chenjunjiedada · 2020-02-11T02:14:23Z

python build is failed. @aokolnychyi could you please help to trigger CI?

spark/src/main/scala/org/apache/iceberg/spark/SparkTableUtil.scala

aokolnychyi · 2020-02-14T19:50:19Z

I have only one remaining comment and I think it should be good to go.

@rdsr, @prodeezy, @rdblue will you be affected?

rdsr · 2020-02-14T23:09:47Z

@aokolnychyi, I'll take a look, and let you know. Thanks!

rdsr · 2020-02-16T19:02:14Z

spark/src/main/scala/org/apache/iceberg/spark/SparkTableUtil.scala

-        null,
-        null,
-        null)
+      val metrics = new Metrics(-1L, arrayToMap(null), arrayToMap(null), arrayToMap(null))


Shouldn't rowCount be a positive number?

I think the metric is not intended to be used so it is set to an invalid value. We might need to read through whole file to get row count, right?

Let's try to keep the logic as close to what we had before as possible.

I think anything positive should do. keeping it <= 0 may possibly affect some scan planning code to filter out this particular file. e.g see org.apache.iceberg.expressions.InclusiveMetricsEvaluator
@aokolnychyi , thoughts?

Good catch, @rdsr. This is definitely a problem. Right now, the InclusiveMetricsEvaluator will remove files with negative or 0 row counts.

I don't think that the solution is to use a positive number here. The reason why this was required is that we want good stats for job planning. Setting this to -1 causes a correctness bug, but setting it to some other constant will introduce bad behavior when using the stats that are provided by Iceberg. I think we should either count the number of records, use a heuristic (file size / est. row size?), or remove support for importing Avro tables. I'm leaning toward counting the number of records.

We should also change the check in InclusiveMetricsEvaluator to check for files with 0 rows and allow files with -1 rows through to fix the correctness bug for existing tables that used this path to import Avro data.

I think the number of records must be correct and precise as we want to answer some data queries with metadata (e.g. give me the number of records per partition). Updating our metrics evaluators to handle -1 seems reasonable to me as well.

@rdsr, could you create follow-up issues so that we don't forget?

+1 I'll do that!

Thank you guys for the detail explanation!

Created #809 to track this.

rdblue · 2020-02-16T23:44:53Z

I won't be affected by this change; we built this into our Spark version as SNAPSHOT TABLE and MIGRATE TABLE commands, so we have a separate path.

spark/src/main/scala/org/apache/iceberg/spark/SparkTableUtil.scala

aokolnychyi

LGTM. I had only one super minor comment. Will test it tomorrow and merge if there are no objections.

prodeezy · 2020-02-18T15:43:51Z

will take a look today

prodeezy · 2020-02-18T20:35:46Z

Confirmed that this change doesn't affect us.

aokolnychyi · 2020-02-19T17:15:46Z

I am going to merge this one. Thank you everyone for the review and @chenjunjiedada for the work!

* Boson iceberg1.0.x preview integration * use boson base image in apple-1.0.x-preview-scala-2.13-prb * update to boson 0.2.16-beta * update to boson 0.2.16-beta * nit * move boson version to versions.props * go back to parquet 1.12.0.16-apple * change to 1.12.0.22-apple

replace SparkDataFile with DataFile

d06c072

chenjunjiedada requested a review from aokolnychyi February 7, 2020 13:44

aokolnychyi reviewed Feb 11, 2020

View reviewed changes

minor changes

db85445

chenjunjiedada closed this Feb 13, 2020

chenjunjiedada reopened this Feb 13, 2020

aokolnychyi reviewed Feb 14, 2020

View reviewed changes

spark/src/main/scala/org/apache/iceberg/spark/SparkTableUtil.scala Show resolved Hide resolved

rdsr reviewed Feb 16, 2020

View reviewed changes

sort data file by path

2e0f410

rdsr mentioned this pull request Feb 18, 2020

InclusiveMetricsEvaluator should not filter out files with negative row counts #809

Closed

aokolnychyi reviewed Feb 18, 2020

View reviewed changes

spark/src/main/scala/org/apache/iceberg/spark/SparkTableUtil.scala Outdated Show resolved Hide resolved

aokolnychyi approved these changes Feb 18, 2020

View reviewed changes

fix nit

daca7f1

aokolnychyi merged commit 4d96944 into apache:master Feb 19, 2020

replace SparkDataFile with DataFile #786

replace SparkDataFile with DataFile #786

Uh oh!

Conversation

chenjunjiedada commented Feb 7, 2020

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

aokolnychyi commented Feb 11, 2020

Uh oh!

chenjunjiedada commented Feb 11, 2020

Uh oh!

chenjunjiedada commented Feb 11, 2020

Uh oh!

Uh oh!

aokolnychyi commented Feb 14, 2020

Uh oh!

rdsr commented Feb 14, 2020

Uh oh!

rdsr Feb 16, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rdsr Feb 17, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rdblue commented Feb 16, 2020

Uh oh!

Uh oh!

aokolnychyi left a comment

Choose a reason for hiding this comment

Uh oh!

prodeezy commented Feb 18, 2020

Uh oh!

prodeezy commented Feb 18, 2020

Uh oh!

aokolnychyi commented Feb 19, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

rdsr Feb 16, 2020 •

edited

Loading

rdsr Feb 17, 2020 •

edited

Loading