Skip to content

Low efficiency in parsing Protobuf and a possible optimization #9984

@xhl0726

Description

@xhl0726

Affected Version

0.12+ [(In all versions that support protobuf-extension)

Description

Protobuf (protocol buffers) is known as a faster mechanism for serializing structured data. For higher efficiency in ingestion, we tried protobuf-extension and wrote a simple benchmark to compare it with Json. However, it turns out that protobuf is much slower.
pb-json-original

After investigating the function parseBatch in class ProtobufInputRowParser, we found that the parser would first transform protobuf to Json(specifically, a String), and then use jsonParser to parse it. Despite of the huge transmission advantage of protobuf, this parsing mechanism would lead to slower ingestion due to the extra process.

In order to achieve faster ingestion, we optimized the function parseBatch by transforming the protobuf to a map directly:

DynamicMessage message = DynamicMessage.parseFrom(descriptor, ByteString.copyFrom(input));

Map<String, Object> record = CollectionUtils.mapKeys(message.getAllFields(), k -> k.getJsonName());

Then we wrote a benchmark to compare them. It turns out that the optimized one can reduce the ingestion time by about 80%. The result is shown below:
protobuf_optimized

We also run the ProtobufInputRowParserTest to test if the parsing result is correct. It shows that if there is no need of setting JsonPathSpec (to rename the key or get a subset of the value), the result is correct. We think that users can decide if they have such need and then choose a proper parsing method for higher efficiency.

  • Machine info:
    1.7GHz Intel Core i7
    16 GB 2133 MHz LPDDR3

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions