Skip to content

Conversation

@zjw1111
Copy link
Collaborator

@zjw1111 zjw1111 commented Jan 26, 2026

Purpose

Linked issue: #31

Performance implications of the current approach:

In avro's default impl GenericDatum, it use std::any for data store. When read and write, it use std::any_cast to cast to destination data type. It accounts for 40% to 50% of the performance hotspots.

Detailed optimize strategy:

Refer to the implementation of Apache Iceberg C++: Directly decode Avro data to Arrow array builders without GenericDatum. Eliminates the GenericDatum intermediate layer by directly calling Avro decoder methods and immediately appending to Arrow builders.

More TODO:

Support avro write in next PR.

Tests

API and Format

Documentation

@zjw1111 zjw1111 changed the title Support avro read feat(format): optimize avro read performance and support reading various nested types Jan 26, 2026
@lxy-9602
Copy link
Collaborator

Please enhance the PR description with more details, such as:

  • Performance implications of the current approach
  • Detailed optimize strategy
  • More TODO?

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 63 out of 91 changed files in this pull request and generated no new comments.

Comments suppressed due to low confidence (2)

test/inte/scan_and_read_inte_test.cpp:2716

  • The new Avro integration tests (TestAvroWithAppendTable and TestAvroWithPkTable) check for file_format == "avro" at the beginning, but the GetTestValuesForScanAndReadInteTest function doesn't include avro in the returned test values. This means these tests will never execute. Consider adding avro test values similar to how ORC is conditionally added, for example:
#ifdef PAIMON_ENABLE_AVRO
    values.emplace_back("avro", false);
    values.emplace_back("avro", true);
#endif

src/paimon/format/avro/avro_format_writer.cpp:123

  • Inconsistent error message formatting: The error messages at lines 72, 84, 99, and 123 use "{}" placeholders but don't call fmt::format(). These should either use fmt::format() like line 70, 82, 97, and 121, or should use simple string concatenation with "+". For example, line 72 should be:
return Status::Invalid(fmt::format("avro format writer create failed: {}", e.what()));

The same fix applies to lines 84, 99, and 123.

        return Status::Invalid("avro format writer create failed: {}", e.what());
    } catch (...) {
        return Status::Invalid("avro format writer create failed: unknown exception");
    }
}

Status AvroFormatWriter::Flush() {
    try {
        writer_->flush();
    } catch (const ::avro::Exception& e) {
        return Status::Invalid(fmt::format("avro writer flush failed. {}", e.what()));
    } catch (const std::exception& e) {
        return Status::Invalid("avro writer flush failed: {}", e.what());
    } catch (...) {
        return Status::Invalid("avro writer flush failed: unknown exception");
    }

    return Status::OK();
}

Status AvroFormatWriter::Finish() {
    try {
        avro_output_stream_->FlushBuffer();  // we need flush buffer before close writer
        writer_->close();
    } catch (const ::avro::Exception& e) {
        return Status::Invalid(fmt::format("avro writer close failed. {}", e.what()));
    } catch (const std::exception& e) {
        return Status::Invalid("avro writer close failed: {}", e.what());
    } catch (...) {
        return Status::Invalid("avro writer close failed: unknown exception");
    }
    return Status::OK();
}

Result<bool> AvroFormatWriter::ReachTargetSize(bool suggested_check, int64_t target_size) const {
    return Status::NotImplemented("not support yet");
}

Status AvroFormatWriter::AddBatch(ArrowArray* batch) {
    assert(batch);
    PAIMON_ASSIGN_OR_RAISE_FROM_ARROW(std::shared_ptr<arrow::Array> arrow_array,
                                      arrow::ImportArray(batch, data_type_));
    PAIMON_ASSIGN_OR_RAISE(std::vector<::avro::GenericDatum> datums,
                           adaptor_->ConvertArrayToGenericDatums(arrow_array, avro_schema_));
    try {
        for (const auto& datum : datums) {
            writer_->write(datum);
        }
    } catch (const ::avro::Exception& e) {
        return Status::Invalid(fmt::format("avro writer add batch failed. {}", e.what()));
    } catch (const std::exception& e) {
        return Status::Invalid("avro writer add batch failed: {}", e.what());

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants