Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 7 additions & 0 deletions cpp/src/arrow/util/rle_encoding.h
Original file line number Diff line number Diff line change
Expand Up @@ -126,6 +126,9 @@ class RleDecoder {
int GetBatchWithDict(const T* dictionary, T* values, int batch_size);

/// Like GetBatchWithDict but add spacing for null entries
///
/// Null entries will be zero-initialized in `values` to avoid leaking
/// private data.
template <typename T>
int GetBatchWithDictSpaced(const T* dictionary, T* values, int batch_size,
int null_count, const uint8_t* valid_bits,
Expand Down Expand Up @@ -433,6 +436,8 @@ inline int RleDecoder::GetBatchWithDictSpaced(const T* dictionary, T* out, int b
DCHECK_GE(bit_width_, 0);
int values_read = 0;
int remaining_nulls = null_count;
T zero;
memset(&zero, 0, sizeof(T));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This causes class-memaccess warning with g++ 9.2.1:

[1/8] Building CXX object src/parquet/CMakeFiles/parquet_objlib.dir/encoding.cc.o
FAILED: src/parquet/CMakeFiles/parquet_objlib.dir/encoding.cc.o 
/usr/bin/c++  -DARROW_EXTRA_ERROR_CONTEXT -DARROW_JEMALLOC -DARROW_JEMALLOC_INCLUDE_DIR="" -DARROW_USE_GLOG -DARROW_USE_SIMD -DARROW_WITH_ZSTD -DHAVE_INTTYPES_H -DHAVE_NETDB_H -DHAVE_NETINET_IN_H -DPARQUET_EXPORTING -Isrc -I../src -isystem jemalloc_ep-prefix/src -isystem flatbuffers_ep-prefix/src/flatbuffers_ep-install/include -isystem ../thirdparty/cares_ep-install/include -isystem ../thirdparty/hadoop/include -isystem orc_ep-install/include -Wno-noexcept-type  -fdiagnostics-color=always -ggdb -O0  -Wall -Wno-conversion -Wno-sign-conversion -Wno-unused-variable -Werror -msse4.2  -g -fPIC   -std=gnu++11 -MD -MT src/parquet/CMakeFiles/parquet_objlib.dir/encoding.cc.o -MF src/parquet/CMakeFiles/parquet_objlib.dir/encoding.cc.o.d -o src/parquet/CMakeFiles/parquet_objlib.dir/encoding.cc.o -c ../src/parquet/encoding.cc
In file included from ../src/parquet/encoding.cc:33:
../src/arrow/util/rle_encoding.h: In instantiation of 'int arrow::util::RleDecoder::GetBatchWithDictSpaced(const T*, T*, int, int, const uint8_t*, int64_t) [with T = parquet::FixedLenByteArray; uint8_t = unsigned char; int64_t = long int]':
../src/parquet/encoding.cc:1079:20:   required from 'int parquet::DictDecoderImpl<Type>::DecodeSpaced(parquet::DictDecoderImpl<Type>::T*, int, int, const uint8_t*, int64_t) [with Type = parquet::PhysicalType<parquet::Type::FIXED_LEN_BYTE_ARRAY>; parquet::DictDecoderImpl<Type>::T = parquet::FixedLenByteArray; uint8_t = unsigned char; int64_t = long int]'
../src/parquet/encoding.cc:1076:7:   required from here
../src/arrow/util/rle_encoding.h:440:9: error: 'void* memset(void*, int, size_t)' clearing an object of non-trivial type 'struct parquet::FixedLenByteArray'; use assignment or value-initialization instead [-Werror=class-memaccess]
  440 |   memset(&zero, 0, sizeof(T));
      |   ~~~~~~^~~~~~~~~~~~~~~~~~~~~
In file included from ../src/parquet/encoding.h:27,
                 from ../src/parquet/encoding.cc:18:
../src/parquet/types.h:515:8: note: 'struct parquet::FixedLenByteArray' declared here
  515 | struct FixedLenByteArray {
      |        ^~~~~~~~~~~~~~~~~
In file included from ../src/parquet/encoding.cc:33:
../src/arrow/util/rle_encoding.h: In instantiation of 'int arrow::util::RleDecoder::GetBatchWithDictSpaced(const T*, T*, int, int, const uint8_t*, int64_t) [with T = parquet::ByteArray; uint8_t = unsigned char; int64_t = long int]':
../src/parquet/encoding.cc:1079:20:   required from 'int parquet::DictDecoderImpl<Type>::DecodeSpaced(parquet::DictDecoderImpl<Type>::T*, int, int, const uint8_t*, int64_t) [with Type = parquet::PhysicalType<parquet::Type::BYTE_ARRAY>; parquet::DictDecoderImpl<Type>::T = parquet::ByteArray; uint8_t = unsigned char; int64_t = long int]'
../src/parquet/encoding.cc:1076:7:   required from here
../src/arrow/util/rle_encoding.h:440:9: error: 'void* memset(void*, int, size_t)' clearing an object of non-trivial type 'struct parquet::ByteArray'; use assignment or value-initialization instead [-Werror=class-memaccess]
  440 |   memset(&zero, 0, sizeof(T));
      |   ~~~~~~^~~~~~~~~~~~~~~~~~~~~
In file included from ../src/parquet/encoding.h:27,
                 from ../src/parquet/encoding.cc:18:
../src/parquet/types.h:495:8: note: 'struct parquet::ByteArray' declared here
  495 | struct ByteArray {
      |        ^~~~~~~~~
cc1plus: all warnings being treated as errors

Can we use T zero = {} instead of memset() here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or is memset(static_cast<void*>(&zero), 0, sizeof(T)) better?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I didn't know that T zero = {} caused zero initialization. We can use that then.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kou Do you want to submit a PR or should I do it?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll do because I can confirm the fix locally.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Created: #5414


arrow::internal::BitmapReader bit_reader(valid_bits, valid_bits_offset, batch_size);

Expand Down Expand Up @@ -484,6 +489,7 @@ inline int RleDecoder::GetBatchWithDictSpaced(const T* dictionary, T* out, int b
*out = dictionary[indices[literals_read]];
literals_read++;
} else {
*out = zero;
skipped++;
}
++out;
Expand All @@ -494,6 +500,7 @@ inline int RleDecoder::GetBatchWithDictSpaced(const T* dictionary, T* out, int b
remaining_nulls -= skipped;
}
} else {
*out = zero;
++out;
values_read++;
remaining_nulls--;
Expand Down
8 changes: 7 additions & 1 deletion cpp/src/parquet/arrow/reader_internal.cc
Original file line number Diff line number Diff line change
Expand Up @@ -738,7 +738,13 @@ Status TransferInt96(RecordReader* reader, MemoryPool* pool,
RETURN_NOT_OK(::arrow::AllocateBuffer(pool, length * sizeof(int64_t), &data));
auto data_ptr = reinterpret_cast<int64_t*>(data->mutable_data());
for (int64_t i = 0; i < length; i++) {
*data_ptr++ = Int96GetNanoSeconds(values[i]);
if (values[i].value[2] == 0) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I understand why 0 is an invalid value (I guess this is specific semantics of int96 type)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not invalid per se, it just doesn't convert to a valid int64_t timestamp.
A better fix might to be to detect all non-convertible values, and raise an error or use a placeholder, but that's overkill for a deprecated feature IMHO.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, OK.

// Happens for null entries: avoid triggering UBSAN as that Int96 timestamp
// isn't representable as a 64-bit Unix timestamp.
*data_ptr++ = 0;
} else {
*data_ptr++ = Int96GetNanoSeconds(values[i]);
}
}
*out = std::make_shared<TimestampArray>(type, length, data, reader->ReleaseIsValid(),
reader->null_count());
Expand Down