Skip to content

Conversation

@shanhuuang
Copy link
Contributor

@shanhuuang shanhuuang commented Jun 30, 2021

data/delta_binary_packed.parquet is generated with parquet-mr version 1.10.0
The file contents are in data/delta_binary_packed_expect.csv.

@shanhuuang shanhuuang force-pushed the ARROW-13206 branch 2 times, most recently from fdee996 to 6fd469d Compare July 6, 2021 04:13
delta_binary_packed.parquet is generated with parquet-mr version 1.10.0
The file contents are in delta_binary_packed_expect.csv
@pitrou
Copy link
Member

pitrou commented Aug 17, 2021

I'm curious: how was the data generated? It seems some columns are always negative, some not.

@pitrou
Copy link
Member

pitrou commented Aug 17, 2021

Also, what is the "int_value" column at the end?

@shanhuuang
Copy link
Contributor Author

I'm curious: how was the data generated? It seems some columns are always negative, some not.

I generated a SQL query like "insert overwrite table some_table values ..." using python script.
The type of previous 64 columns are bigint, and the type of "int_value" column is int.
The script is roughly like this:

for i in range(0, 66):
    if i == 65: # for int_value
        for j in range(0, 33):
            values[j][i] = str(np.random.randint(-2147483648, 2147483647, dtype=np.int32))
    elif i == 0: # for bitwidth0
        x = np.random.randint(-9223372036854775808, 9223372036854775807, dtype=np.int64)
        for j in range(0, 33):
            values[j][i] = str(x)
    else: # for bitwidth1 to bitwidth64
        values[0][i] = str(0) # the first row is always 0
        min_delta = -pow(2,i-1) # generate min_delta in a block, min_delta = -2^(i-1)
        values[1][i] = str(min_delta) # the second row is always -2^(i-1)
        delta_in_miniblock = np.random.randint(0, pow(2,i-1), dtype=np.int64)
        delta_in_miniblock += pow(2,i-1) # delta_in_miniblock belongs to [2^(i-1), 2^(i))
        values[2][i] = str(values[1][i]+min_delta+delta_in_miniblock) # make sure that max bitwidth of delta_in_miniblock is i
        pre_val = values[2][i]
        for j in range(3, 33):
            delta_in_miniblock = np.random.randint(-pow(2,i-1), pow(2,i-1), dtype=np.int64)
            delta_in_miniblock += pow(2,i-1) # delta_in_miniblock belongs to [0, 2^(i))
            new_val = pre_val + min_delta + delta_in_miniblock
            while (new_val > 9223372036854775807): # int64 overflow
                delta_in_miniblock = np.random.randint(-pow(2,i-1), pow(2,i-1), dtype=np.int64)
                delta_in_miniblock += pow(2,i-1)
                new_val = pre_val + min_delta + delta_in_miniblock
            values[j][i] = str(new_val)
            pre_val = new_val

@pitrou
Copy link
Member

pitrou commented Aug 17, 2021

Thank you. A thought: there are 33 values in each column. This seems it is exactly the size of a miniblock + the first value encoded in the header. Shouldn't you arrange to have several miniblocks, and possibly several blocks even? (also, perhaps better if the last miniblock isn't full).

@shanhuuang
Copy link
Contributor Author

Thank you. A thought: there are 33 values in each column. This seems it is exactly the size of a miniblock + the first value encoded in the header. Shouldn't you arrange to have several miniblocks, and possibly several blocks even? (also, perhaps better if the last miniblock isn't full).

OK. I will generate a new file with 200 rows including 1 first value, 2 blocks and 3 miniblocks(the last miniblock has 7 values)

@pitrou pitrou changed the title ARROW-13206: Add file of DELTA_BINARY_PACKED encoding PARQUET-490: Add file of DELTA_BINARY_PACKED encoding Aug 19, 2021
@pitrou pitrou merged commit 600d437 into apache:master Aug 19, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants