Skip to content

Conversation

@zeroshade
Copy link
Member

Adding the implementation of encoding types of data for Parquet including Plain, RLE, Dictionary, Delta Byte Array, Delta Packing types. It also includes hashing implementation for more efficient hash tables than using go's std map implementation as shown in the benchmarks included in the test files which do benchmark comparisons between a go-map based implementation and the hash table implementation that I ported from the C++

In addition, while adding some test cases I discovered that apparently the -force-vector-width=32 argument on the asm generation was causing segfaults on the encoding tests, so let's let LLVM make it's own choice about the vector width and interleaving.

@github-actions
Copy link

@zeroshade
Copy link
Member Author

@emkornfield @sbinet @nickpoorman And here we go with the next chunk! time for the implementation of the encoding types, we're in the home stretch now, i think i've only got about 2 or 3 more of these: metadata, file reading/writing, and integration with arrow

@emkornfield
Copy link
Contributor

@zeroshade apologies this is likely going to be another rough week in terms of availability to review. If you don't hear anything by next tuesday please ping again.

@zeroshade
Copy link
Member Author

@emkornfield Pinging for tuesday update 😃

@emkornfield
Copy link
Contributor

Starting to take a look.

@emkornfield
Copy link
Contributor

@zeroshade still working my way through but at least I've started, will try to continue to review over the next few days.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one more thought, I'm not sure how you've organized the remaining code, but spacing values might more naturally fit with Arrow (not critical, and might not make sense)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the most part all of the code that interacts with Arrow arrays is isolated to a single package that i haven't put up yet, which utilizes decoders by being able to call the DecodeSpaced functions in order to easily populate arrow array bytes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not an issue for now, but it seems a nicer way of doing this (don't know about the performance implications for Go) is some sort of callback/visitor on Arrow arrays instead of bundling it here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what you mean, are you saying that the PutSpaced/DecodeSpaced functions only really make sense in the context of being used with Arrow Arrays? and basically having a layer above the encoders that implements the *Spaced functions by performing the compress/expansion and then just calling Put without having a PutSpaced/DecodeSpaced on every encoder?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that was what I was thinking, its minor in the grand scheme of things, so refactoring/changing isn't necessary just thought I would call it out.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's an idea that I can look into as a potential simplification refactor definitely. But i agree it's minor in the grand scheme of things :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

another int vs uint spot.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in this case, it's equivalent to doing memcpy(out, &static_cast<uint32>(intvalue), 4) in c++, even though it's writing a uint32 it's just a cast, the byte pattern isn't being changed by the cast so when read back it will still correctly be read as a little endian int32 value.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

small nit: there should be a check to make total length is less then int32 max?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

total length? or just the run.Length? The total length, ie: bufferRef.Len() is a 64-bit int on 64-bit architectures above since it's using int, but I'd actually rather have idxDecode be an int64 than add a comparison against int32 max in this loop as this is a low level tight loop so adding that comparison would actually affect performance more than just changing this to be an int64 to avoid the potential for int32 overflow

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"sets in the data" does quite make sense to me.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reworded, let me know your opinion on the new wording.

@emkornfield
Copy link
Contributor

@zeroshade still making my way through. I was wondering if you would be open to limiting this PR to the interfaces and the encoders already reviewed and we can have follow-up PRs for the remaining decoders?

@zeroshade
Copy link
Member Author

@emkornfield I'll see if i can easily remove the files for the remaining decoders, should i consider as "reviewed" any of the encoders/decoders you've commented on? Or were there any you reviewed but didn't leave comments on?

@zeroshade
Copy link
Member Author

@emkornfield so because of the way things are developed and interdependent, the most i'm able to remove while still being compileable in this PR is the unit tests, the level encoder, the memo table code and the hashing stuff for those memo tables.

All in all 13 files

@emkornfield
Copy link
Contributor

@emkornfield so because of the way things are developed and interdependent, the most i'm able to remove while still being compileable in this PR is the unit tests, the level encoder, the memo table code and the hashing stuff for those memo tables. All in all 13 files

if you don't mind that would be great, thats ~1/3 of the PR

@zeroshade
Copy link
Member Author

@emkornfield Done! Hope that helps

@zeroshade
Copy link
Member Author

@emkornfield bump

@zeroshade
Copy link
Member Author

Just bumping this again in the hopes for more reviews so i can get this merged

@emkornfield @sbinet @fsaintjacques @nickpoorman

@emkornfield
Copy link
Contributor

Sorry will try to finish it off this week. Please ping me again if you don't hear back anything by Wednesday morning.

@zeroshade
Copy link
Member Author

@emkornfield Pinging as requested since I didn't hear anything by wednesday :)

@emkornfield
Copy link
Contributor

Thanks for the ping, sorry this week went a bit haywire, going to review some more now.

}
d.lastVal = out[0]
out = out[1:]
d.prefixLengths = d.prefixLengths[1:]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

was this reverse engineered fro java (i.e. is zero actually stored for the first length?)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

zero is stored as the first length, look at line 63 in this file for the encoder. When initializing the encoder, we write 0 as the first length.

d.lastVal = make([]byte, 0, int(prefixLen)+len(suffixHolder[0]))
d.lastVal = append([]byte{}, prefix...)
d.lastVal = append(d.lastVal, suffixHolder[0]...)
out[0], out = d.lastVal, out[1:]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how expensive is slicing?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A slice is three values:

  1. A pointer to where the data starts
  2. An integer holding the length of the slice
  3. An integer holding the total allocated space for that slice (the capacity)

Slicing is no more expensive than a C++ case of maintaining a pointer + length, explicitly designed to be extremely efficient, think of them as a view on an array.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, in C++ to keep references alive for something like shared_ptr, is actually relatively expensive which is why I asked.

@zeroshade
Copy link
Member Author

@emkornfield Just giving another poke here for any other feedback, thanks!

@emkornfield
Copy link
Contributor

I'll double check but I think this is ok to merge, will try to do so this evening.

@zeroshade
Copy link
Member Author

@emkornfield just poking for update to get this merged so i can put the other half up

@emkornfield
Copy link
Contributor

Sorry, for the delay merging now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants