-
Notifications
You must be signed in to change notification settings - Fork 3.7k
[feature](functions) impl scalar functions trim_in、ltrim_in and rtrim_in #41681
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Thank you for your contribution to Apache Doris. Since 2024-03-18, the Document has been moved to doris-website. |
|
clang-tidy review says "All clean, LGTM! 👍" |
|
run buildall |
|
TeamCity be ut coverage result: |
| order_qt_52 "SELECT rtrim_in(' hello ', 'lo ');" | ||
| order_qt_53 "SELECT rtrim_in('hello ', ' ');" | ||
| order_qt_54 "SELECT rtrim_in('hello ', 'l o');" | ||
| order_qt_55 "SELECT rtrim_in('hello ', 'l');" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add more test
- create a table and then insert the values to a table, and use select func(col) from table
- the column in the table, could be a nullable column and not null column.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
| }; | ||
| template <bool is_ltrim_in, bool is_rtrim_in, bool trim_single> | ||
| struct TrimInUtil { | ||
| static Status vector(const ColumnString::Chars& str_data, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
为什么不直接使用column string,而是把chars和offsets 分开
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In order to maintain the same processing method as the TrimUtil method, so no changes are made here.
struct TrimUtil { static Status vector(const ColumnString::Chars& str_data, const ColumnString::Offsets& str_offsets, const StringRef& remove_str, ColumnString::Chars& res_data, ColumnString::Offsets& res_offsets)
|
clang-tidy review says "All clean, LGTM! 👍" |
| const size_t offset_size = str_offsets.size(); | ||
| res_offsets.resize(offset_size); | ||
| res_data.reserve(str_data.size()); | ||
| std::bitset<256> char_lookup; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add comment to explain magic number 256
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently, there are two processing logics.
- The
simd::VStringFunctions::is_asciimethod is used to judge that when the string is all ascii, bitset<128> will be used for processing. The character range of the standard ASCII table is 0 to 127. Bitset<128> has exactly 128 bits, which is enough to represent all standard ASCII characters. - When the string is not all standard ascii, the utf-8 logic will be used for processing.It is especially important to note that when trimming on the right, according to the rules of UTF-8, the format of the UTF-8 trailing byte is 10xxxxxx. Use
(*prev_char_pos & 0xC0) == 0x80to determine whether the current byte is a trailing byte. When a byte that is not a trailing byte is found, this byte is the starting byte of the current character.
| res_data.reserve(str_data.size()); | ||
| std::bitset<256> char_lookup; | ||
| for (size_t i = 0; i < remove_str.size; ++i) { | ||
| char_lookup[static_cast<unsigned char>(remove_str.data[i])] = true; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what if chinese
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added utf-8 processing logic
|
clang-tidy review says "All clean, LGTM! 👍" |
zhiqiang-hhhh
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
|
|
||
| for (size_t i = 0; i < offset_size; ++i) { | ||
| const char* str_begin = reinterpret_cast<const char*>( | ||
| str_data.data() + (i == 0 ? 0 : str_offsets[i - 1])); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
str_data.data() + str_offsets[i] is better, since PODArray guarantees access to -1 is safe.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
| res_offsets.resize(offset_size); | ||
| res_data.reserve(str_data.size()); | ||
|
|
||
| std::unordered_set<std::string> char_lookup; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
potential large momory comsuption
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Change to string_view
|
PR approved by anyone and no changes requested. |
|
run buildall |
|
TeamCity be ut coverage result: |
TPC-H: Total hot run time: 41467 ms |
TPC-DS: Total hot run time: 191690 ms |
ClickBench: Total hot run time: 33.23 s |
|
clang-tidy review says "All clean, LGTM! 👍" |
1 similar comment
|
clang-tidy review says "All clean, LGTM! 👍" |
|
run buildall |
|
TeamCity be ut coverage result: |
TPC-H: Total hot run time: 41520 ms |
TPC-DS: Total hot run time: 192329 ms |
ClickBench: Total hot run time: 32.76 s |
|
clang-tidy review says "All clean, LGTM! 👍" |
|
PR approved by at least one committer and no changes requested. |
|
run buildall |
|
TeamCity be ut coverage result: |
TPC-H: Total hot run time: 41398 ms |
TPC-DS: Total hot run time: 192384 ms |
ClickBench: Total hot run time: 32.53 s |
|
run buildall |
|
TeamCity be ut coverage result: |
TPC-H: Total hot run time: 41202 ms |
TPC-DS: Total hot run time: 191866 ms |
ClickBench: Total hot run time: 33.81 s |
|
run external |
…_in (apache#41681) trim_in is different from trim Find and remove any characters in a set of characters at both ends of a string (regardless of order) mysql> SELECT TRIM('abcd', 'cde'); +---------------------+ | trim('abcd', 'cde') | +---------------------+ | abcd | +---------------------+ 1 row in set (0.02 sec) mysql> SELECT TRIM_IN('abcd', 'cde'); +------------------------+ | trim_in('abcd', 'cde') | +------------------------+ | ab | +------------------------+ 1 row in set (0.02 sec)
…_in (apache#41681) trim_in is different from trim Find and remove any characters in a set of characters at both ends of a string (regardless of order) mysql> SELECT TRIM('abcd', 'cde'); +---------------------+ | trim('abcd', 'cde') | +---------------------+ | abcd | +---------------------+ 1 row in set (0.02 sec) mysql> SELECT TRIM_IN('abcd', 'cde'); +------------------------+ | trim_in('abcd', 'cde') | +------------------------+ | ab | +------------------------+ 1 row in set (0.02 sec)
apache/doris#41681 # Versions - [x] dev - [x] 3.0 - [x] 2.1 - [ ] 2.0 # Languages - [x] Chinese - [x] English --------- Co-authored-by: KassieZ <139741991+KassieZ@users.noreply.github.com>
trim_in is different from trim
Find and remove any characters in a set of characters at both ends of a string (regardless of order)
mysql> SELECT TRIM('abcd', 'cde');
+---------------------+
| trim('abcd', 'cde') |
+---------------------+
| abcd |
+---------------------+
1 row in set (0.02 sec)
mysql> SELECT TRIM_IN('abcd', 'cde');
+------------------------+
| trim_in('abcd', 'cde') |
+------------------------+
| ab |
+------------------------+
1 row in set (0.02 sec)