Skip to content

optimize InputRowSerde#2047

Merged
himanshug merged 1 commit intoapache:masterfrom
binlijin:master
Dec 9, 2015
Merged

optimize InputRowSerde#2047
himanshug merged 1 commit intoapache:masterfrom
binlijin:master

Conversation

@binlijin
Copy link
Copy Markdown
Contributor

@binlijin binlijin commented Dec 5, 2015

I test it with some data(1million record,100 dimesions, 20 metrics):
serTime 68122 ms (InputRowSerde.toBytes)
deTime 76761 ms (InputRowSerde.fromBytes)
serDe size 2721126345 byte
With the patch:
serTime 34231 ms (InputRowSerde.toBytes)
deTime 27868 ms (InputRowSerde.fromBytes)
serDe size 2048581349 byte

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

type can be encoded as a number instead of string, you can probably use ValueType.XXX.ordinal() as encoding.
io.druid.segment.column.ValueType is an enum defined with the types.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just change and not write the type.

@himanshug
Copy link
Copy Markdown
Contributor

looks good overall, faster is better

@b-slim
Copy link
Copy Markdown
Contributor

b-slim commented Dec 5, 2015

@binlijin for the seek of knowledge, why there is a performance gain ?

@binlijin
Copy link
Copy Markdown
Contributor Author

binlijin commented Dec 7, 2015

@b-slim The performance gain from:(1)As less as new Text(String) which will encode String to UTF-8 (2)Write dimension name only once.

@binlijin binlijin closed this Dec 8, 2015
@binlijin binlijin reopened this Dec 8, 2015
@xvrl
Copy link
Copy Markdown
Member

xvrl commented Dec 8, 2015

@binlijin @himanshug does this mean that the InputRowSerde introduced in #1472 caused performance degradation? How does this new ser/de compare to just passing Text between mapper and reducer (ignoring the combiner for a moment)?

@himanshug
Copy link
Copy Markdown
Contributor

@xvrl this is used in batch ingestion only . batch ingestion runtime is mostly dependent upon actual indexing and trips of data to/from hdfs, serding time here plays little role . We did not notice any significant differences in our batch ingestion times post that patch. Also, this was really added so that we could support batch delta ingestion (and other arbitrary InputFormats which would produce non-Text data), combiner optimization was an added benefit.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please add a log.warn(..) here so that we know if in a real use case this fall backs to for loop.
also it will be nice to have a UT that verifies if ordering of AggregatorFactory[] in serialization and deserialization stayed same then the for loop is not run.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sign, I don't know how to add a UT to verify it...

@binlijin binlijin closed this Dec 9, 2015
@binlijin binlijin reopened this Dec 9, 2015
@fjy
Copy link
Copy Markdown
Contributor

fjy commented Dec 9, 2015

👍

himanshug added a commit that referenced this pull request Dec 9, 2015
@himanshug himanshug merged commit f29c25b into apache:master Dec 9, 2015
@fjy fjy modified the milestone: 0.9.0 Feb 4, 2016
@fjy fjy mentioned this pull request Feb 5, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants