Skip to content

[C++][Parquet] Page sizes for repeated columns can overflow int32 with page index enabled #47027

@adamreeve

Description

@adamreeve

Describe the bug, including details regarding any error messages, version, and platform.

I noticed a regression when upgrading from Arrow 19.0.1 to 20.0.0 and writing Parquet files with a repeated column. It appears that the change to enable writing the page index by default (#45249) has caused the logic for starting new data pages to change, and so page sizes can become very large, and can overflow int32.

Repro code, from commit adamreeve@8444bf6:

TEST(TestColumnWriter, WriteLargeLists) {
  auto sink = CreateOutputStream();
  auto schema = std::static_pointer_cast<GroupNode>(GroupNode::Make(
      "schema", Repetition::REQUIRED,
      {
          GroupNode::Make(
              "x", Repetition::OPTIONAL,
              {
                  GroupNode::Make("list", Repetition::REPEATED,
                                  {
                                      schema::Float("element", Repetition::REQUIRED),
                                  },
                                  nullptr),
              },
              LogicalType::List()),
      }));
  auto properties = WriterProperties::Builder()
                        .disable_dictionary()
                        //->disable_write_page_index()
                        ->build();
  auto file_writer = ParquetFileWriter::Open(sink, schema, properties);
  auto rg_writer = file_writer->AppendRowGroup();

  constexpr int64_t num_rows = 1000 * 1000;
  constexpr int64_t num_list_elements = 1000;

  std::vector<int16_t> def_levels(num_list_elements);
  std::vector<int16_t> rep_levels(num_list_elements);
  std::vector<float> values(num_list_elements);

  for (int32_t i = 0; i < num_list_elements; ++i) {
    def_levels[i] = 2;
    rep_levels[i] = i == 0 ? 0 : 1;
  }

  auto col_writer = dynamic_cast<FloatWriter*>(rg_writer->NextColumn());
  for (int32_t i = 0; i < num_rows; i++) {
    random_numbers(num_list_elements, i, -100.0f, 100.0f, values.data());
    col_writer->WriteBatch(num_list_elements, def_levels.data(), rep_levels.data(),
                           values.data());
  }
  col_writer->Close();

  rg_writer->Close();
  file_writer->Close();
  ASSERT_OK_AND_ASSIGN(auto buffer, sink->Finish());
}

This runs with disable_write_page_index(), but otherwise crashes with:

C++ exception with description "Uncompressed data page size overflows INT32_MAX. Size:4005000014" thrown in the test body.

Component(s)

C++, Parquet

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions