-
Notifications
You must be signed in to change notification settings - Fork 70
PARQUET-490: Add file of DELTA_BINARY_PACKED encoding #19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
fdee996 to
6fd469d
Compare
delta_binary_packed.parquet is generated with parquet-mr version 1.10.0 The file contents are in delta_binary_packed_expect.csv
|
I'm curious: how was the data generated? It seems some columns are always negative, some not. |
|
Also, what is the "int_value" column at the end? |
I generated a SQL query like "insert overwrite table some_table values ..." using python script. for i in range(0, 66):
if i == 65: # for int_value
for j in range(0, 33):
values[j][i] = str(np.random.randint(-2147483648, 2147483647, dtype=np.int32))
elif i == 0: # for bitwidth0
x = np.random.randint(-9223372036854775808, 9223372036854775807, dtype=np.int64)
for j in range(0, 33):
values[j][i] = str(x)
else: # for bitwidth1 to bitwidth64
values[0][i] = str(0) # the first row is always 0
min_delta = -pow(2,i-1) # generate min_delta in a block, min_delta = -2^(i-1)
values[1][i] = str(min_delta) # the second row is always -2^(i-1)
delta_in_miniblock = np.random.randint(0, pow(2,i-1), dtype=np.int64)
delta_in_miniblock += pow(2,i-1) # delta_in_miniblock belongs to [2^(i-1), 2^(i))
values[2][i] = str(values[1][i]+min_delta+delta_in_miniblock) # make sure that max bitwidth of delta_in_miniblock is i
pre_val = values[2][i]
for j in range(3, 33):
delta_in_miniblock = np.random.randint(-pow(2,i-1), pow(2,i-1), dtype=np.int64)
delta_in_miniblock += pow(2,i-1) # delta_in_miniblock belongs to [0, 2^(i))
new_val = pre_val + min_delta + delta_in_miniblock
while (new_val > 9223372036854775807): # int64 overflow
delta_in_miniblock = np.random.randint(-pow(2,i-1), pow(2,i-1), dtype=np.int64)
delta_in_miniblock += pow(2,i-1)
new_val = pre_val + min_delta + delta_in_miniblock
values[j][i] = str(new_val)
pre_val = new_val |
|
Thank you. A thought: there are 33 values in each column. This seems it is exactly the size of a miniblock + the first value encoded in the header. Shouldn't you arrange to have several miniblocks, and possibly several blocks even? (also, perhaps better if the last miniblock isn't full). |
OK. I will generate a new file with 200 rows including 1 first value, 2 blocks and 3 miniblocks(the last miniblock has 7 values) |
data/delta_binary_packed.parquetis generated with parquet-mr version 1.10.0The file contents are in
data/delta_binary_packed_expect.csv.