-
Notifications
You must be signed in to change notification settings - Fork 25
feat(format): optimize avro read performance and support reading various nested types #76
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
Please enhance the PR description with more details, such as:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Copilot reviewed 63 out of 91 changed files in this pull request and generated no new comments.
Comments suppressed due to low confidence (2)
test/inte/scan_and_read_inte_test.cpp:2716
- The new Avro integration tests (TestAvroWithAppendTable and TestAvroWithPkTable) check for file_format == "avro" at the beginning, but the GetTestValuesForScanAndReadInteTest function doesn't include avro in the returned test values. This means these tests will never execute. Consider adding avro test values similar to how ORC is conditionally added, for example:
#ifdef PAIMON_ENABLE_AVRO
values.emplace_back("avro", false);
values.emplace_back("avro", true);
#endifsrc/paimon/format/avro/avro_format_writer.cpp:123
- Inconsistent error message formatting: The error messages at lines 72, 84, 99, and 123 use "{}" placeholders but don't call fmt::format(). These should either use fmt::format() like line 70, 82, 97, and 121, or should use simple string concatenation with "+". For example, line 72 should be:
return Status::Invalid(fmt::format("avro format writer create failed: {}", e.what()));The same fix applies to lines 84, 99, and 123.
return Status::Invalid("avro format writer create failed: {}", e.what());
} catch (...) {
return Status::Invalid("avro format writer create failed: unknown exception");
}
}
Status AvroFormatWriter::Flush() {
try {
writer_->flush();
} catch (const ::avro::Exception& e) {
return Status::Invalid(fmt::format("avro writer flush failed. {}", e.what()));
} catch (const std::exception& e) {
return Status::Invalid("avro writer flush failed: {}", e.what());
} catch (...) {
return Status::Invalid("avro writer flush failed: unknown exception");
}
return Status::OK();
}
Status AvroFormatWriter::Finish() {
try {
avro_output_stream_->FlushBuffer(); // we need flush buffer before close writer
writer_->close();
} catch (const ::avro::Exception& e) {
return Status::Invalid(fmt::format("avro writer close failed. {}", e.what()));
} catch (const std::exception& e) {
return Status::Invalid("avro writer close failed: {}", e.what());
} catch (...) {
return Status::Invalid("avro writer close failed: unknown exception");
}
return Status::OK();
}
Result<bool> AvroFormatWriter::ReachTargetSize(bool suggested_check, int64_t target_size) const {
return Status::NotImplemented("not support yet");
}
Status AvroFormatWriter::AddBatch(ArrowArray* batch) {
assert(batch);
PAIMON_ASSIGN_OR_RAISE_FROM_ARROW(std::shared_ptr<arrow::Array> arrow_array,
arrow::ImportArray(batch, data_type_));
PAIMON_ASSIGN_OR_RAISE(std::vector<::avro::GenericDatum> datums,
adaptor_->ConvertArrayToGenericDatums(arrow_array, avro_schema_));
try {
for (const auto& datum : datums) {
writer_->write(datum);
}
} catch (const ::avro::Exception& e) {
return Status::Invalid(fmt::format("avro writer add batch failed. {}", e.what()));
} catch (const std::exception& e) {
return Status::Invalid("avro writer add batch failed: {}", e.what());
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Purpose
Linked issue: #31
Performance implications of the current approach:
In avro's default impl
GenericDatum, it usestd::anyfor data store. When read and write, it usestd::any_castto cast to destination data type. It accounts for 40% to 50% of the performance hotspots.Detailed optimize strategy:
Refer to the implementation of Apache Iceberg C++: Directly decode Avro data to Arrow array builders without GenericDatum. Eliminates the GenericDatum intermediate layer by directly calling Avro decoder methods and immediately appending to Arrow builders.
More TODO:
Support avro write in next PR.
Tests
API and Format
Documentation