ARROW-12851: [Go][Parquet] Add Golang Parquet encoding package #10379

zeroshade · 2021-05-22T19:21:10Z

Adding the implementation of encoding types of data for Parquet including Plain, RLE, Dictionary, Delta Byte Array, Delta Packing types. It also includes hashing implementation for more efficient hash tables than using go's std map implementation as shown in the benchmarks included in the test files which do benchmark comparisons between a go-map based implementation and the hash table implementation that I ported from the C++

In addition, while adding some test cases I discovered that apparently the -force-vector-width=32 argument on the asm generation was causing segfaults on the encoding tests, so let's let LLVM make it's own choice about the vector width and interleaving.

github-actions · 2021-05-22T19:21:27Z

https://issues.apache.org/jira/browse/ARROW-12851

zeroshade · 2021-05-22T19:22:08Z

@emkornfield @sbinet @nickpoorman And here we go with the next chunk! time for the implementation of the encoding types, we're in the home stretch now, i think i've only got about 2 or 3 more of these: metadata, file reading/writing, and integration with arrow

emkornfield · 2021-05-25T03:08:48Z

@zeroshade apologies this is likely going to be another rough week in terms of availability to review. If you don't hear anything by next tuesday please ping again.

zeroshade · 2021-06-01T17:13:35Z

@emkornfield Pinging for tuesday update 😃

emkornfield · 2021-06-02T15:34:06Z

Starting to take a look.

go/parquet/internal/encoding/boolean_decoder.go

go/parquet/internal/encoding/boolean_encoder.go

emkornfield · 2021-06-02T16:01:05Z

@zeroshade still working my way through but at least I've started, will try to continue to review over the next few days.

go/parquet/internal/encoding/byte_array_decoder.go

emkornfield · 2021-06-04T15:45:56Z

go/parquet/internal/encoding/byte_array_decoder.go

one more thought, I'm not sure how you've organized the remaining code, but spacing values might more naturally fit with Arrow (not critical, and might not make sense)

For the most part all of the code that interacts with Arrow arrays is isolated to a single package that i haven't put up yet, which utilizes decoders by being able to call the DecodeSpaced functions in order to easily populate arrow array bytes.

go/parquet/internal/encoding/byte_array_encoder.go

emkornfield · 2021-06-04T15:51:08Z

go/parquet/internal/encoding/byte_array_encoder.go

not an issue for now, but it seems a nicer way of doing this (don't know about the performance implications for Go) is some sort of callback/visitor on Arrow arrays instead of bundling it here.

I'm not sure what you mean, are you saying that the PutSpaced/DecodeSpaced functions only really make sense in the context of being used with Arrow Arrays? and basically having a layer above the encoders that implements the *Spaced functions by performing the compress/expansion and then just calling Put without having a PutSpaced/DecodeSpaced on every encoder?

Yes, that was what I was thinking, its minor in the grand scheme of things, so refactoring/changing isn't necessary just thought I would call it out.

It's an idea that I can look into as a potential simplification refactor definitely. But i agree it's minor in the grand scheme of things :)

emkornfield · 2021-06-04T15:52:08Z

go/parquet/internal/encoding/byte_array_encoder.go

another int vs uint spot.

in this case, it's equivalent to doing memcpy(out, &static_cast<uint32>(intvalue), 4) in c++, even though it's writing a uint32 it's just a cast, the byte pattern isn't being changed by the cast so when read back it will still correctly be read as a little endian int32 value.

go/parquet/internal/encoding/decoder.go

emkornfield · 2021-06-08T15:35:58Z

go/parquet/internal/encoding/decoder.go

small nit: there should be a check to make total length is less then int32 max?

total length? or just the run.Length? The total length, ie: bufferRef.Len() is a 64-bit int on 64-bit architectures above since it's using int, but I'd actually rather have idxDecode be an int64 than add a comparison against int32 max in this loop as this is a low level tight loop so adding that comparison would actually affect performance more than just changing this to be an int64 to avoid the potential for int32 overflow

go/parquet/internal/encoding/decoder.go

emkornfield · 2021-06-08T15:43:55Z

go/parquet/internal/encoding/delta_bit_packing.go

"sets in the data" does quite make sense to me.

reworded, let me know your opinion on the new wording.

emkornfield · 2021-06-16T15:58:36Z

@zeroshade still making my way through. I was wondering if you would be open to limiting this PR to the interfaces and the encoders already reviewed and we can have follow-up PRs for the remaining decoders?

zeroshade · 2021-06-17T22:29:52Z

@emkornfield I'll see if i can easily remove the files for the remaining decoders, should i consider as "reviewed" any of the encoders/decoders you've commented on? Or were there any you reviewed but didn't leave comments on?

zeroshade · 2021-06-17T22:43:22Z

@emkornfield so because of the way things are developed and interdependent, the most i'm able to remove while still being compileable in this PR is the unit tests, the level encoder, the memo table code and the hashing stuff for those memo tables.

All in all 13 files

emkornfield · 2021-06-17T23:16:46Z

@emkornfield so because of the way things are developed and interdependent, the most i'm able to remove while still being compileable in this PR is the unit tests, the level encoder, the memo table code and the hashing stuff for those memo tables. All in all 13 files

if you don't mind that would be great, thats ~1/3 of the PR

zeroshade · 2021-06-17T23:32:46Z

@emkornfield Done! Hope that helps

zeroshade · 2021-06-23T13:57:18Z

@emkornfield bump

zeroshade · 2021-06-28T15:00:33Z

Just bumping this again in the hopes for more reviews so i can get this merged

@emkornfield @sbinet @fsaintjacques @nickpoorman

emkornfield · 2021-06-28T17:34:02Z

Sorry will try to finish it off this week. Please ping me again if you don't hear back anything by Wednesday morning.

zeroshade · 2021-07-01T14:24:52Z

@emkornfield Pinging as requested since I didn't hear anything by wednesday :)

emkornfield · 2021-07-03T03:33:31Z

Thanks for the ping, sorry this week went a bit haywire, going to review some more now.

go/parquet/internal/encoding/delta_byte_array.go

emkornfield · 2021-07-03T03:55:30Z

go/parquet/internal/encoding/delta_byte_array.go

+		}
+		d.lastVal = out[0]
+		out = out[1:]
+		d.prefixLengths = d.prefixLengths[1:]


was this reverse engineered fro java (i.e. is zero actually stored for the first length?)

zero is stored as the first length, look at line 63 in this file for the encoder. When initializing the encoder, we write 0 as the first length.

emkornfield · 2021-07-03T03:58:54Z

go/parquet/internal/encoding/delta_byte_array.go

+		d.lastVal = make([]byte, 0, int(prefixLen)+len(suffixHolder[0]))
+		d.lastVal = append([]byte{}, prefix...)
+		d.lastVal = append(d.lastVal, suffixHolder[0]...)
+		out[0], out = d.lastVal, out[1:]


how expensive is slicing?

A slice is three values:

A pointer to where the data starts

An integer holding the length of the slice

An integer holding the total allocated space for that slice (the capacity)

Slicing is no more expensive than a C++ case of maintaining a pointer + length, explicitly designed to be extremely efficient, think of them as a view on an array.

OK, in C++ to keep references alive for something like shared_ptr, is actually relatively expensive which is why I asked.

go/parquet/internal/encoding/delta_byte_array.go

go/parquet/internal/encoding/delta_length_byte_array.go

go/parquet/internal/encoding/plain_encoder_types.gen.go

…o other data.

…ding

zeroshade · 2021-07-09T15:30:55Z

@emkornfield Just giving another poke here for any other feedback, thanks!

emkornfield · 2021-07-09T15:34:11Z

I'll double check but I think this is ok to merge, will try to do so this evening.

zeroshade · 2021-07-12T19:09:23Z

@emkornfield just poking for update to get this merged so i can put the other half up

emkornfield · 2021-07-13T15:47:41Z

Sorry, for the delay merging now.

github-actions bot added the Component: Go label May 22, 2021

emkornfield reviewed Jun 2, 2021

View reviewed changes

go/parquet/internal/encoding/boolean_decoder.go Outdated Show resolved Hide resolved

emkornfield reviewed Jun 2, 2021

View reviewed changes

go/parquet/internal/encoding/boolean_decoder.go Outdated Show resolved Hide resolved

emkornfield reviewed Jun 2, 2021

View reviewed changes

go/parquet/internal/encoding/boolean_encoder.go Outdated Show resolved Hide resolved

emkornfield reviewed Jun 2, 2021

View reviewed changes

go/parquet/internal/encoding/boolean_encoder.go Outdated Show resolved Hide resolved

emkornfield reviewed Jun 3, 2021

View reviewed changes

go/parquet/internal/encoding/byte_array_decoder.go Outdated Show resolved Hide resolved

emkornfield reviewed Jun 3, 2021

View reviewed changes

go/parquet/internal/encoding/byte_array_decoder.go Outdated Show resolved Hide resolved

emkornfield reviewed Jun 3, 2021

View reviewed changes

go/parquet/internal/encoding/byte_array_decoder.go Outdated Show resolved Hide resolved

emkornfield reviewed Jun 3, 2021

View reviewed changes

go/parquet/internal/encoding/byte_array_decoder.go Outdated Show resolved Hide resolved

emkornfield reviewed Jun 3, 2021

View reviewed changes

go/parquet/internal/encoding/byte_array_decoder.go Outdated Show resolved Hide resolved

emkornfield reviewed Jun 4, 2021

View reviewed changes

go/parquet/internal/encoding/byte_array_encoder.go Outdated Show resolved Hide resolved

emkornfield reviewed Jun 4, 2021

View reviewed changes

go/parquet/internal/encoding/byte_array_encoder.go Outdated Show resolved Hide resolved

emkornfield reviewed Jun 4, 2021

View reviewed changes

go/parquet/internal/encoding/byte_array_encoder.go Outdated Show resolved Hide resolved

emkornfield reviewed Jun 4, 2021

View reviewed changes