Fix truncateStringMax in UnicodeUtil. by vgankidi · Pull Request #334 · apache/iceberg

vgankidi · 2019-07-31T02:26:33Z

Index to codePointAt should be the offset calculated by code points.
I incorrectly assumed that the index in codePointAt(index) refers to the index in terms of code points. It actually refers to the index of characters in the string.
resolves #328 #329

…the offset calculated by code points

rdblue · 2019-07-31T16:31:31Z

Thanks for fixing this, @vgankidi!

@ikosyaneko, can you confirm that this fixes the issue as well? The fix in #329 avoids the problem, but we think that the underlying problem was the way the index was calculated.

vgankidi · 2019-07-31T17:46:20Z

Also with the fix in #329, output of truncateStringMax(Literal.of(test7), 2) equals test7_1_expected instead of test7_2_expected.codePointAt gives the expected result if the index points to a high surrogate and the following character is a low surrogate. It computes the code point for the surrogate pair. Otherwise it returns the codepoint of the character in the given index as is.

@ikosyanenko Can you test your long input in #328 with this fix as well? Thanks!

rdblue · 2019-08-01T20:06:54Z

Looks like this needs to be rebased.

Fixes #328, fixes #329. Index to codePointAt should be the offset calculated by code points

rdblue · 2019-08-01T20:53:15Z

I rebased and merged by hand, so I'm closing this. Thanks for fixing it, @vgankidi!

* Add argument validation to HadoopTables#create (#298) * Install source JAR when running install target (#310) * Add projectStrict for Dates and Timestamps (#283) * Correctly publish artifacts on JitPack (#321) The Gradle install target produces invalid POM files that are missing the dependencyManagement section and versions for some dependencies. Instead, we directly tell JitPack to run the correct Gradle target. * Add build info to README.md (#304) * Convert Iceberg time type to Hive string type (#325) * Add overwrite option to write builders (#318) * Fix out of order Pig partition fields (#326) * Add mapping to Iceberg for external name-based schemas (#338) * Site: Fix broken link to Iceberg API (#333) * Add forTable method for Avro WriteBuilder (#322) * Remove multiple literal strings check rule for scala (#335) * Fix invalid javadoc url in README.md (#336) * Use UnicodeUtil.truncateString for Truncate transform. (#340) This truncates by unicode codepoint instead of Java chars. * Refactor metrics tests for reuse (#331) * Spark: Add support for write-audit-publish workflows (#342) * Avoid write failures if metrics mode is invalid (#301) * Fix truncateStringMax in UnicodeUtil (#334) Fixes #328, fixes #329. Index to codePointAt should be the offset calculated by code points * [Vectorization] Added batch sizing, switched to BufferAllocator, other minor style fixes.

vgankidi · 2019-08-02T01:17:01Z

Thanks @rdblue!

Fixes apache#328, fixes apache#329. Index to codePointAt should be the offset calculated by code points

Fix truncateStringMax in UnicodeUtil. Index to codePointAt should be …

d26788a

…the offset calculated by code points

rdblue pushed a commit that referenced this pull request Aug 1, 2019

Fix truncateStringMax in UnicodeUtil (#334)

30d45f8

Fixes #328, fixes #329. Index to codePointAt should be the offset calculated by code points

rdblue closed this Aug 1, 2019

rdblue pushed a commit to rdblue/iceberg that referenced this pull request Aug 22, 2019

Fix truncateStringMax in UnicodeUtil (apache#334)

ad08d6f

Fixes apache#328, fixes apache#329. Index to codePointAt should be the offset calculated by code points

rdblue pushed a commit to rdblue/iceberg that referenced this pull request Sep 5, 2019

Fix truncateStringMax in UnicodeUtil (apache#334)

1ba9c7d

Fixes apache#328, fixes apache#329. Index to codePointAt should be the offset calculated by code points

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix truncateStringMax in UnicodeUtil.#334

Fix truncateStringMax in UnicodeUtil.#334
vgankidi wants to merge 1 commit intoapache:masterfrom
vgankidi:parquetmetricstruncate

vgankidi commented Jul 31, 2019

Uh oh!

rdblue commented Jul 31, 2019

Uh oh!

vgankidi commented Jul 31, 2019

Uh oh!

rdblue commented Aug 1, 2019

Uh oh!

rdblue commented Aug 1, 2019

Uh oh!

vgankidi commented Aug 2, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

vgankidi commented Jul 31, 2019

Uh oh!

rdblue commented Jul 31, 2019

Uh oh!

vgankidi commented Jul 31, 2019

Uh oh!

rdblue commented Aug 1, 2019

Uh oh!

rdblue commented Aug 1, 2019

Uh oh!

vgankidi commented Aug 2, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants