-
Notifications
You must be signed in to change notification settings - Fork 14
Improve performance of hive path parsing by using extractKeyValuePairs instead of regex #734
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
I have not build it locally (only with the upstream version). Please read the comments in upstream PR before reviewing this one |
|
Linking is failing because library bridge seems to depend on VirtualColumnUtils, but is not linking KeyValuePairExtractor. In upstream this does not happen because of ClickHouse#76225 I suppose |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please check comments
src/Storages/VirtualColumnUtils.cpp
Outdated
| .withItemDelimiters({'/'}) | ||
| .withKeyValueDelimiter('=') | ||
| .build(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't recall if KeyValuePairExtractor instance has any state (and it looks like CHKeyValuePairExtractor does), but potentially here that state is going to be implicitly shared between multiple threads without any synchronization.
So either you have to do one of those:
- there is really no sharing of
extractorbetween threads - prove that there is no harm of sharing
extractor - modify your code to make such sharing either impossible or safe (
mutex? ) - create a new instance of
extractorfor each function call
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We've discussed this offline, state does not seem to be a problem, but let's wait for CICD
| "/yet/another/path/k1=v1/k2=v2/k3=v3/k4=v4/k5=v5/" | ||
| }; | ||
|
|
||
| TEST(VirtualColumnUtils, BenchmarkRegexParser) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Results:
[BenchmarkExtractkvParser] 1000000 iterations across 5 paths took 131 ms
[BenchmarkRegexParser] 1000000 iterations across 5 paths took 729 ms
Process finished with exit code 0
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regex impl from #735
…_hive Improve performance of hive path parsing by using extractKeyValuePairs instead of regex
…_hive Improve performance of hive path parsing by using extractKeyValuePairs instead of regex
…_hive_25.3 25.3 Antalya port of #734 - Improve performance of hive path parsing
Changelog category (leave one):
Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):
Improve performance of hive path parsing by using extractKeyValuePairs instead of regex (ClickHouse#79067 by @arthurpassos )