Skip to content

Conversation

@beryllw
Copy link

@beryllw beryllw commented Jun 9, 2023

What changes were proposed in this pull request?

The trim logic in Cast expression introduced in #29375 trim ASCII control characters unexpectly.

Before this patch
image
And hive
image

Why are the changes needed?

The behavior described above doesn't consistent with the behavior of Hive

Does this PR introduce any user-facing change?

Yes

How was this patch tested?

add ut

@github-actions github-actions bot added the SQL label Jun 9, 2023
@beryllw
Copy link
Author

beryllw commented Jun 9, 2023

@cloud-fan @yaooqinn @gengliangwang @WangGuangxin Could you please help review this?

@beryllw beryllw changed the title Fix the trim logic did't handle ASCII control characters correctly [SPARK-32559]Fix the trim logic did't handle ASCII control characters correctly Jun 9, 2023
@srowen
Copy link
Member

srowen commented Jun 9, 2023

I'm not sure this is correct. SQL TRIM from other DBs seems to default to removing whitespace or the space char, not control codes.

@beryllw
Copy link
Author

beryllw commented Jun 10, 2023

I'm not sure this is correct. SQL TRIM from other DBs seems to default to removing whitespace or the space char, not control codes.

I conducted tests on both MySQL and Hive, and found that Hive supports trimming control characters and spaces but does not support trimming characters within the range 0x7F to 0x9F. MySQL, on the other hand, supports trimming specific control characters.
image

@yaooqinn yaooqinn changed the title [SPARK-32559]Fix the trim logic did't handle ASCII control characters correctly [SPARK-32559][SQL] Fix the trim logic did't handle ASCII control characters correctly Jun 12, 2023
Copy link
Member

@yaooqinn yaooqinn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As Spark is a general engine supporting various kinds of data formats and sources, IMO, it's necessary to trim controls that are very likely to be involved during data exchange.

So, LGTM.

…/DateTimeUtilsSuite.scala

Co-authored-by: Kent Yao <yao@apache.org>
@yaooqinn yaooqinn closed this in 80588e4 Jun 13, 2023
yaooqinn pushed a commit that referenced this pull request Jun 13, 2023
…acters correctly

### What changes were proposed in this pull request?
The trim logic in Cast expression introduced in #29375 trim ASCII control characters unexpectly.

Before this patch
![image](https://github.com/apache/spark/assets/25627922/ca6a2fb1-2143-4264-84d1-70b6bb755ec7)
And hive
![image](https://github.com/apache/spark/assets/25627922/017aaa4a-133e-4396-9694-79f03f027bbe)

### Why are the changes needed?
The behavior described above doesn't consistent with the behavior of Hive

### Does this PR introduce _any_ user-facing change?
Yes

### How was this patch tested?
add ut

Closes #41535 from Kwafoor/trim_bugfix.

Lead-authored-by: wangjunbo <wangjunbo@qiyi.com>
Co-authored-by: Junbo wang <1042815068@qq.com>
Signed-off-by: Kent Yao <yao@apache.org>
(cherry picked from commit 80588e4)
Signed-off-by: Kent Yao <yao@apache.org>
yaooqinn pushed a commit that referenced this pull request Jun 13, 2023
…acters correctly

### What changes were proposed in this pull request?
The trim logic in Cast expression introduced in #29375 trim ASCII control characters unexpectly.

Before this patch
![image](https://github.com/apache/spark/assets/25627922/ca6a2fb1-2143-4264-84d1-70b6bb755ec7)
And hive
![image](https://github.com/apache/spark/assets/25627922/017aaa4a-133e-4396-9694-79f03f027bbe)

### Why are the changes needed?
The behavior described above doesn't consistent with the behavior of Hive

### Does this PR introduce _any_ user-facing change?
Yes

### How was this patch tested?
add ut

Closes #41535 from Kwafoor/trim_bugfix.

Lead-authored-by: wangjunbo <wangjunbo@qiyi.com>
Co-authored-by: Junbo wang <1042815068@qq.com>
Signed-off-by: Kent Yao <yao@apache.org>
(cherry picked from commit 80588e4)
Signed-off-by: Kent Yao <yao@apache.org>
@yaooqinn
Copy link
Member

Thanks, merged to master and 3.4/3.3

@beryllw beryllw deleted the trim_bugfix branch June 14, 2023 02:06
czxm pushed a commit to czxm/spark that referenced this pull request Jun 19, 2023
…acters correctly

### What changes were proposed in this pull request?
The trim logic in Cast expression introduced in apache#29375 trim ASCII control characters unexpectly.

Before this patch
![image](https://github.com/apache/spark/assets/25627922/ca6a2fb1-2143-4264-84d1-70b6bb755ec7)
And hive
![image](https://github.com/apache/spark/assets/25627922/017aaa4a-133e-4396-9694-79f03f027bbe)

### Why are the changes needed?
The behavior described above doesn't consistent with the behavior of Hive

### Does this PR introduce _any_ user-facing change?
Yes

### How was this patch tested?
add ut

Closes apache#41535 from Kwafoor/trim_bugfix.

Lead-authored-by: wangjunbo <wangjunbo@qiyi.com>
Co-authored-by: Junbo wang <1042815068@qq.com>
Signed-off-by: Kent Yao <yao@apache.org>
Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, @Kwafoor , @yaooqinn and all.
SPARK-32559 is delivered as Apache Spark 3.0.1 patch.
We need a new JIRA issue because this will land at Apache Spark 3.5.0/3.4.1/3.3.3 newly.

@yaooqinn
Copy link
Member

yaooqinn commented Jun 20, 2023

@dongjoon-hyun Oops, My bad. For the current status, is it OK to link this pull request to a new jira?

snmvaughan pushed a commit to snmvaughan/spark that referenced this pull request Jun 20, 2023
…acters correctly

### What changes were proposed in this pull request?
The trim logic in Cast expression introduced in apache#29375 trim ASCII control characters unexpectly.

Before this patch
![image](https://github.com/apache/spark/assets/25627922/ca6a2fb1-2143-4264-84d1-70b6bb755ec7)
And hive
![image](https://github.com/apache/spark/assets/25627922/017aaa4a-133e-4396-9694-79f03f027bbe)

### Why are the changes needed?
The behavior described above doesn't consistent with the behavior of Hive

### Does this PR introduce _any_ user-facing change?
Yes

### How was this patch tested?
add ut

Closes apache#41535 from Kwafoor/trim_bugfix.

Lead-authored-by: wangjunbo <wangjunbo@qiyi.com>
Co-authored-by: Junbo wang <1042815068@qq.com>
Signed-off-by: Kent Yao <yao@apache.org>
(cherry picked from commit 80588e4)
Signed-off-by: Kent Yao <yao@apache.org>
@beryllw
Copy link
Author

beryllw commented Jul 12, 2023

Hi, @Kwafoor , @yaooqinn and all. SPARK-32559 is delivered as Apache Spark 3.0.1 patch. We need a new JIRA issue because this will land at Apache Spark 3.5.0/3.4.1/3.3.3 newly.

My mistake, should I raise a new issue to link this pr?

@yaooqinn
Copy link
Member

@Kwafoor Yes

@beryllw
Copy link
Author

beryllw commented Jul 12, 2023

@Kwafoor Yes

I created an issue https://issues.apache.org/jira/browse/SPARK-44383, but how to link this pull request to a new jira?

I saw that spark 3.4.1 was released, with this pr and the wrong issue name SPARK-32559.I’m very sorry about this.

image

@yaooqinn
Copy link
Member

I have manually resolved and re-targeted the links.

We probably need to push a new PR to the spark-website repo to fix the release note @Kwafoor

@beryllw
Copy link
Author

beryllw commented Jul 13, 2023

I have manually resolved and re-targeted the links.

We probably need to push a new PR to the spark-website repo to fix the release note @Kwafoor

I created PR-466 to the spark-website repo to fix the release note.

@yaooqinn
Copy link
Member

thank you @Kwafoor

GladwinLee pushed a commit to lyft/spark that referenced this pull request Oct 10, 2023
…acters correctly

### What changes were proposed in this pull request?
The trim logic in Cast expression introduced in apache#29375 trim ASCII control characters unexpectly.

Before this patch
![image](https://github.com/apache/spark/assets/25627922/ca6a2fb1-2143-4264-84d1-70b6bb755ec7)
And hive
![image](https://github.com/apache/spark/assets/25627922/017aaa4a-133e-4396-9694-79f03f027bbe)

### Why are the changes needed?
The behavior described above doesn't consistent with the behavior of Hive

### Does this PR introduce _any_ user-facing change?
Yes

### How was this patch tested?
add ut

Closes apache#41535 from Kwafoor/trim_bugfix.

Lead-authored-by: wangjunbo <wangjunbo@qiyi.com>
Co-authored-by: Junbo wang <1042815068@qq.com>
Signed-off-by: Kent Yao <yao@apache.org>
(cherry picked from commit 80588e4)
Signed-off-by: Kent Yao <yao@apache.org>
catalinii pushed a commit to lyft/spark that referenced this pull request Oct 10, 2023
…acters correctly

### What changes were proposed in this pull request?
The trim logic in Cast expression introduced in apache#29375 trim ASCII control characters unexpectly.

Before this patch
![image](https://github.com/apache/spark/assets/25627922/ca6a2fb1-2143-4264-84d1-70b6bb755ec7)
And hive
![image](https://github.com/apache/spark/assets/25627922/017aaa4a-133e-4396-9694-79f03f027bbe)

### Why are the changes needed?
The behavior described above doesn't consistent with the behavior of Hive

### Does this PR introduce _any_ user-facing change?
Yes

### How was this patch tested?
add ut

Closes apache#41535 from Kwafoor/trim_bugfix.

Lead-authored-by: wangjunbo <wangjunbo@qiyi.com>
Co-authored-by: Junbo wang <1042815068@qq.com>
Signed-off-by: Kent Yao <yao@apache.org>
(cherry picked from commit 80588e4)
Signed-off-by: Kent Yao <yao@apache.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants