-
Notifications
You must be signed in to change notification settings - Fork 4k
GH-35160: [Go] Add arrayApproxEqualString to ignore null chars #35161
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -19,6 +19,7 @@ package array | |
| import ( | ||
| "fmt" | ||
| "math" | ||
| "strings" | ||
|
|
||
| "github.com/apache/arrow/go/v12/arrow" | ||
| "github.com/apache/arrow/go/v12/arrow/float16" | ||
|
|
@@ -494,13 +495,13 @@ func arrayApproxEqual(left, right arrow.Array, opt equalOption) bool { | |
| return arrayEqualBinary(l, r) | ||
| case *String: | ||
| r := right.(*String) | ||
| return arrayEqualString(l, r) | ||
| return arrayApproxEqualString(l, r) | ||
| case *LargeBinary: | ||
| r := right.(*LargeBinary) | ||
| return arrayEqualLargeBinary(l, r) | ||
| case *LargeString: | ||
| r := right.(*LargeString) | ||
| return arrayEqualLargeString(l, r) | ||
| return arrayApproxEqualLargeString(l, r) | ||
| case *Int8: | ||
| r := right.(*Int8) | ||
| return arrayEqualInt8(l, r) | ||
|
|
@@ -630,6 +631,34 @@ func validityBitmapEqual(left, right arrow.Array) bool { | |
| return true | ||
| } | ||
|
|
||
| func stripNulls(s string) string { | ||
| return strings.ReplaceAll(s, "\x00", "") | ||
| } | ||
|
|
||
|
Comment on lines
+634
to
+637
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. In the Arrow spec a String array should be all valid utf8. So shouldn't this only be for trimming trailing Nulls? (ie: use To this end, can you have a test with a string that contains a null inside it that isn't at the end and ensure that this approxequal only trims nulls, and doesn't strip them from within the string?
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Interesting, wasn't aware of the valid utf8 specification that can't contain NULLs in the middle of the string. In that case there is no check in
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. So for performance reasons we don't validate the utf8 strings on Append currently and leave it up to the producer to ensure that they are passing valid utf-8 strings when constructing the array (if it's not valid utf-8 they should be using a Binary array). On my list of things to do eventually is a "Validate" method for each array type like the C++ library has. That "Validate" method would do the UTF-8 validity check on the buffer so that a consumer can choose when they take the performance hit for validating the utf-8. |
||
| func arrayApproxEqualString(left, right *String) bool { | ||
| for i := 0; i < left.Len(); i++ { | ||
| if left.IsNull(i) { | ||
| continue | ||
| } | ||
| if stripNulls(left.Value(i)) != stripNulls(right.Value(i)) { | ||
| return false | ||
| } | ||
| } | ||
| return true | ||
| } | ||
|
|
||
| func arrayApproxEqualLargeString(left, right *LargeString) bool { | ||
| for i := 0; i < left.Len(); i++ { | ||
| if left.IsNull(i) { | ||
| continue | ||
| } | ||
| if stripNulls(left.Value(i)) != stripNulls(right.Value(i)) { | ||
| return false | ||
| } | ||
| } | ||
| return true | ||
| } | ||
|
|
||
| func arrayApproxEqualFloat16(left, right *Float16, opt equalOption) bool { | ||
| for i := 0; i < left.Len(); i++ { | ||
| if left.IsNull(i) { | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this seems odd to add here. Should we also add this to
AppendValueFromStringand update the docs that you can also use hex to specify binary strings?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm....Prob it was in another branch and ended up here. will remove