Skip to content

Conversation

@arthurpassos
Copy link
Collaborator

@arthurpassos arthurpassos commented Apr 11, 2025

Changelog category (leave one):

  • Improvement

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):

Improve performance of hive path parsing by using extractKeyValuePairs instead of regex (ClickHouse#79067 by @arthurpassos )

@arthurpassos
Copy link
Collaborator Author

I have not build it locally (only with the upstream version). Please read the comments in upstream PR before reviewing this one

@svb-alt svb-alt added the antalya-25.2.2 Planned for 25.2.2 release label Apr 11, 2025
@arthurpassos
Copy link
Collaborator Author

Linking is failing because library bridge seems to depend on VirtualColumnUtils, but is not linking KeyValuePairExtractor. In upstream this does not happen because of ClickHouse#76225 I suppose

ianton-ru
ianton-ru previously approved these changes Apr 13, 2025
Copy link
Member

@Enmk Enmk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please check comments

Comment on lines 154 to 156
.withItemDelimiters({'/'})
.withKeyValueDelimiter('=')
.build();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't recall if KeyValuePairExtractor instance has any state (and it looks like CHKeyValuePairExtractor does), but potentially here that state is going to be implicitly shared between multiple threads without any synchronization.

So either you have to do one of those:

  • there is really no sharing of extractor between threads
  • prove that there is no harm of sharing extractor
  • modify your code to make such sharing either impossible or safe (mutex ? )
  • create a new instance of extractor for each function call

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We've discussed this offline, state does not seem to be a problem, but let's wait for CICD

"/yet/another/path/k1=v1/k2=v2/k3=v3/k4=v4/k5=v5/"
};

TEST(VirtualColumnUtils, BenchmarkRegexParser)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Results:

[BenchmarkExtractkvParser] 1000000 iterations across 5 paths took 131 ms
[BenchmarkRegexParser] 1000000 iterations across 5 paths took 729 ms
Process finished with exit code 0

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regex impl from #735

@Enmk Enmk merged commit 2e344cb into antalya Apr 17, 2025
227 of 322 checks passed
@ianton-ru ianton-ru mentioned this pull request Apr 23, 2025
arthurpassos pushed a commit that referenced this pull request May 23, 2025
…_hive

Improve performance of hive path parsing by using extractKeyValuePairs instead of regex
arthurpassos pushed a commit that referenced this pull request May 23, 2025
…_hive

Improve performance of hive path parsing by using extractKeyValuePairs instead of regex
Enmk added a commit that referenced this pull request May 29, 2025
…_hive_25.3

25.3 Antalya port of #734 - Improve performance of hive path parsing
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants