optimize InputRowSerde#2047
Conversation
There was a problem hiding this comment.
type can be encoded as a number instead of string, you can probably use ValueType.XXX.ordinal() as encoding.
io.druid.segment.column.ValueType is an enum defined with the types.
There was a problem hiding this comment.
I just change and not write the type.
|
looks good overall, faster is better |
|
@binlijin for the seek of knowledge, why there is a performance gain ? |
|
@b-slim The performance gain from:(1)As less as new Text(String) which will encode String to UTF-8 (2)Write dimension name only once. |
|
@binlijin @himanshug does this mean that the InputRowSerde introduced in #1472 caused performance degradation? How does this new ser/de compare to just passing Text between mapper and reducer (ignoring the combiner for a moment)? |
|
@xvrl this is used in batch ingestion only . batch ingestion runtime is mostly dependent upon actual indexing and trips of data to/from hdfs, serding time here plays little role . We did not notice any significant differences in our batch ingestion times post that patch. Also, this was really added so that we could support batch delta ingestion (and other arbitrary InputFormats which would produce non-Text data), combiner optimization was an added benefit. |
There was a problem hiding this comment.
please add a log.warn(..) here so that we know if in a real use case this fall backs to for loop.
also it will be nice to have a UT that verifies if ordering of AggregatorFactory[] in serialization and deserialization stayed same then the for loop is not run.
There was a problem hiding this comment.
sign, I don't know how to add a UT to verify it...
|
👍 |
I test it with some data(1million record,100 dimesions, 20 metrics):
serTime 68122 ms (InputRowSerde.toBytes)
deTime 76761 ms (InputRowSerde.fromBytes)
serDe size 2721126345 byte
With the patch:
serTime 34231 ms (InputRowSerde.toBytes)
deTime 27868 ms (InputRowSerde.fromBytes)
serDe size 2048581349 byte