-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-12851: [Go][Parquet] Add Golang Parquet encoding package #10379
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@emkornfield @sbinet @nickpoorman And here we go with the next chunk! time for the implementation of the encoding types, we're in the home stretch now, i think i've only got about 2 or 3 more of these: metadata, file reading/writing, and integration with arrow |
|
@zeroshade apologies this is likely going to be another rough week in terms of availability to review. If you don't hear anything by next tuesday please ping again. |
|
@emkornfield Pinging for tuesday update 😃 |
|
Starting to take a look. |
|
@zeroshade still working my way through but at least I've started, will try to continue to review over the next few days. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
one more thought, I'm not sure how you've organized the remaining code, but spacing values might more naturally fit with Arrow (not critical, and might not make sense)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the most part all of the code that interacts with Arrow arrays is isolated to a single package that i haven't put up yet, which utilizes decoders by being able to call the DecodeSpaced functions in order to easily populate arrow array bytes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not an issue for now, but it seems a nicer way of doing this (don't know about the performance implications for Go) is some sort of callback/visitor on Arrow arrays instead of bundling it here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure what you mean, are you saying that the PutSpaced/DecodeSpaced functions only really make sense in the context of being used with Arrow Arrays? and basically having a layer above the encoders that implements the *Spaced functions by performing the compress/expansion and then just calling Put without having a PutSpaced/DecodeSpaced on every encoder?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that was what I was thinking, its minor in the grand scheme of things, so refactoring/changing isn't necessary just thought I would call it out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's an idea that I can look into as a potential simplification refactor definitely. But i agree it's minor in the grand scheme of things :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
another int vs uint spot.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in this case, it's equivalent to doing memcpy(out, &static_cast<uint32>(intvalue), 4) in c++, even though it's writing a uint32 it's just a cast, the byte pattern isn't being changed by the cast so when read back it will still correctly be read as a little endian int32 value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
small nit: there should be a check to make total length is less then int32 max?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
total length? or just the run.Length? The total length, ie: bufferRef.Len() is a 64-bit int on 64-bit architectures above since it's using int, but I'd actually rather have idxDecode be an int64 than add a comparison against int32 max in this loop as this is a low level tight loop so adding that comparison would actually affect performance more than just changing this to be an int64 to avoid the potential for int32 overflow
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"sets in the data" does quite make sense to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
reworded, let me know your opinion on the new wording.
|
@zeroshade still making my way through. I was wondering if you would be open to limiting this PR to the interfaces and the encoders already reviewed and we can have follow-up PRs for the remaining decoders? |
|
@emkornfield I'll see if i can easily remove the files for the remaining decoders, should i consider as "reviewed" any of the encoders/decoders you've commented on? Or were there any you reviewed but didn't leave comments on? |
|
@emkornfield so because of the way things are developed and interdependent, the most i'm able to remove while still being compileable in this PR is the unit tests, the level encoder, the memo table code and the hashing stuff for those memo tables. All in all 13 files |
if you don't mind that would be great, thats ~1/3 of the PR |
|
@emkornfield Done! Hope that helps |
|
@emkornfield bump |
|
Just bumping this again in the hopes for more reviews so i can get this merged |
|
Sorry will try to finish it off this week. Please ping me again if you don't hear back anything by Wednesday morning. |
|
@emkornfield Pinging as requested since I didn't hear anything by wednesday :) |
|
Thanks for the ping, sorry this week went a bit haywire, going to review some more now. |
| } | ||
| d.lastVal = out[0] | ||
| out = out[1:] | ||
| d.prefixLengths = d.prefixLengths[1:] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
was this reverse engineered fro java (i.e. is zero actually stored for the first length?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
zero is stored as the first length, look at line 63 in this file for the encoder. When initializing the encoder, we write 0 as the first length.
| d.lastVal = make([]byte, 0, int(prefixLen)+len(suffixHolder[0])) | ||
| d.lastVal = append([]byte{}, prefix...) | ||
| d.lastVal = append(d.lastVal, suffixHolder[0]...) | ||
| out[0], out = d.lastVal, out[1:] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how expensive is slicing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A slice is three values:
- A pointer to where the data starts
- An integer holding the length of the slice
- An integer holding the total allocated space for that slice (the capacity)
Slicing is no more expensive than a C++ case of maintaining a pointer + length, explicitly designed to be extremely efficient, think of them as a view on an array.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, in C++ to keep references alive for something like shared_ptr, is actually relatively expensive which is why I asked.
|
@emkornfield Just giving another poke here for any other feedback, thanks! |
|
I'll double check but I think this is ok to merge, will try to do so this evening. |
|
@emkornfield just poking for update to get this merged so i can put the other half up |
|
Sorry, for the delay merging now. |
Adding the implementation of encoding types of data for Parquet including Plain, RLE, Dictionary, Delta Byte Array, Delta Packing types. It also includes hashing implementation for more efficient hash tables than using go's std map implementation as shown in the benchmarks included in the test files which do benchmark comparisons between a go-map based implementation and the hash table implementation that I ported from the C++
In addition, while adding some test cases I discovered that apparently the -force-vector-width=32 argument on the asm generation was causing segfaults on the encoding tests, so let's let LLVM make it's own choice about the vector width and interleaving.