Skip to content

Conversation

@chenjunjiedada
Copy link
Collaborator

This fixes #763

@aokolnychyi
Copy link
Contributor

@chenjunjiedada, great work! I did a quick look and had only minor comments.

@chenjunjiedada
Copy link
Collaborator Author

@aokolnychyi , Thanks for the review, just updated.

@chenjunjiedada
Copy link
Collaborator Author

python build is failed. @aokolnychyi could you please help to trigger CI?

@aokolnychyi
Copy link
Contributor

I have only one remaining comment and I think it should be good to go.

@rdsr, @prodeezy, @rdblue will you be affected?

@rdsr
Copy link
Contributor

rdsr commented Feb 14, 2020

@aokolnychyi, I'll take a look, and let you know. Thanks!

null,
null,
null)
val metrics = new Metrics(-1L, arrayToMap(null), arrayToMap(null), arrayToMap(null))
Copy link
Contributor

@rdsr rdsr Feb 16, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't rowCount be a positive number?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the metric is not intended to be used so it is set to an invalid value. We might need to read through whole file to get row count, right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's try to keep the logic as close to what we had before as possible.

Copy link
Contributor

@rdsr rdsr Feb 17, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think anything positive should do. keeping it <= 0 may possibly affect some scan planning code to filter out this particular file. e.g see org.apache.iceberg.expressions.InclusiveMetricsEvaluator
@aokolnychyi , thoughts?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, @rdsr. This is definitely a problem. Right now, the InclusiveMetricsEvaluator will remove files with negative or 0 row counts.

I don't think that the solution is to use a positive number here. The reason why this was required is that we want good stats for job planning. Setting this to -1 causes a correctness bug, but setting it to some other constant will introduce bad behavior when using the stats that are provided by Iceberg. I think we should either count the number of records, use a heuristic (file size / est. row size?), or remove support for importing Avro tables. I'm leaning toward counting the number of records.

We should also change the check in InclusiveMetricsEvaluator to check for files with 0 rows and allow files with -1 rows through to fix the correctness bug for existing tables that used this path to import Avro data.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the number of records must be correct and precise as we want to answer some data queries with metadata (e.g. give me the number of records per partition). Updating our metrics evaluators to handle -1 seems reasonable to me as well.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rdsr, could you create follow-up issues so that we don't forget?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 I'll do that!

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you guys for the detail explanation!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Created #809 to track this.

@rdblue
Copy link
Contributor

rdblue commented Feb 16, 2020

I won't be affected by this change; we built this into our Spark version as SNAPSHOT TABLE and MIGRATE TABLE commands, so we have a separate path.

Copy link
Contributor

@aokolnychyi aokolnychyi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. I had only one super minor comment. Will test it tomorrow and merge if there are no objections.

@prodeezy
Copy link
Contributor

will take a look today

@prodeezy
Copy link
Contributor

Confirmed that this change doesn't affect us.

@aokolnychyi
Copy link
Contributor

I am going to merge this one. Thank you everyone for the review and @chenjunjiedada for the work!

@aokolnychyi aokolnychyi merged commit 4d96944 into apache:master Feb 19, 2020
sunchao pushed a commit to sunchao/iceberg that referenced this pull request May 10, 2023
* Boson iceberg1.0.x preview integration

* use boson base image in apple-1.0.x-preview-scala-2.13-prb

* update to boson 0.2.16-beta

* update to boson 0.2.16-beta

* nit

* move boson version to versions.props

* go back to parquet 1.12.0.16-apple

* change to 1.12.0.22-apple
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Consider using GenericDataFile in SparkTableUtil

5 participants