Skip to content

[GLUTEN-6867][CH] Fix Bug that cann't read file on minio#9332

Merged
zzcclp merged 6 commits intoapache:mainfrom
baibaichen:feature/issue-6867
Apr 16, 2025
Merged

[GLUTEN-6867][CH] Fix Bug that cann't read file on minio#9332
zzcclp merged 6 commits intoapache:mainfrom
baibaichen:feature/issue-6867

Conversation

@baibaichen
Copy link
Copy Markdown
Contributor

What changes were proposed in this pull request?

The endpoint of Minio may not start with https://, so let's support http:// endpoint

(Fixes: #6867)

How was this patch tested?

Introduce GlutenClickHouseCacheBaseTestSuite and let GlutenClickHouseMINIOSuite and GlutenClickHouseHDFSSuite inherit from it.

@github-actions
Copy link
Copy Markdown

#6867

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot reviewed 1 out of 14 changed files in this pull request and generated no comments.

Files not reviewed (13)
  • backends-clickhouse/src/test/scala/org/apache/gluten/execution/GlutenClickHouseExcelFormatSuite.scala: Language not supported
  • backends-clickhouse/src/test/scala/org/apache/gluten/execution/GlutenClickHouseWholeStageTransformerSuite.scala: Language not supported
  • backends-clickhouse/src/test/scala/org/apache/gluten/execution/cache/GlutenClickHouseCacheBaseTestSuite.scala: Language not supported
  • backends-clickhouse/src/test/scala/org/apache/gluten/execution/cache/GlutenClickHouseHDFSSuite.scala: Language not supported
  • backends-clickhouse/src/test/scala/org/apache/gluten/execution/cache/GlutenClickHouseMINIOSuite.scala: Language not supported
  • backends-clickhouse/src/test/scala/org/apache/gluten/execution/mergetree/GlutenClickHouseMergeTreeCacheDataSuite.scala: Language not supported
  • backends-clickhouse/src/test/scala/org/apache/gluten/execution/mergetree/GlutenClickHouseMergeTreeWriteOnHDFSSuite.scala: Language not supported
  • backends-clickhouse/src/test/scala/org/apache/gluten/execution/mergetree/GlutenClickHouseMergeTreeWriteOnHDFSWithRocksDBMetaSuite.scala: Language not supported
  • backends-clickhouse/src/test/scala/org/apache/gluten/execution/mergetree/GlutenClickHouseMergeTreeWriteOnS3Suite.scala: Language not supported
  • backends-clickhouse/src/test/scala/org/apache/gluten/execution/mergetree/GlutenClickHouseMergetreeWriteStatsSuite.scala: Language not supported
  • backends-clickhouse/src/test/scala/org/apache/gluten/utils/CacheTestHelper.scala: Language not supported
  • backends-clickhouse/src/test/scala/org/apache/gluten/utils/HDFSTestHelper.scala: Language not supported
  • backends-clickhouse/src/test/scala/org/apache/gluten/utils/MinioTestHelper.scala: Language not supported
Comments suppressed due to low confidence (1)

cpp-ch/local-engine/Storages/SubstraitSource/ReadBufferBuilder.cpp:528

  • [nitpick] The variable name 'end_point_start_with_http_or_https' is quite long; consider renaming it to a more concise and readable identifier such as 'validEndpointPrefix'.
bool end_point_start_with_http_or_https = endpoint.starts_with("https://") || endpoint.starts_with("http://");

@github-actions
Copy link
Copy Markdown

Run Gluten Clickhouse CI on x86

@baibaichen
Copy link
Copy Markdown
Contributor Author

@CodiumAI-Agent /review

@QodoAI-Agent
Copy link
Copy Markdown

PR Reviewer Guide 🔍

Here are some key observations to aid the review process:

🎫 Ticket compliance analysis 🔶

6867 - Partially compliant

Compliant requirements:

  • Support parquet table creation on minio by accepting http:// endpoints
  • Integration of helper classes (HDFSTestHelper, MinioTestHelper, CacheTestHelper) for reset and cleanup in tests

Non-compliant requirements:

  • None identified

Requires further human verification:

  • Validate that all caching and file system configurations are fully tested in various test suites
⏱️ Estimated effort to review: 4 🔵🔵🔵🔵⚪
🧪 PR contains tests
🔒 No security concerns identified
⚡ Recommended focus areas for review

Endpoint Format Check
The new code in the S3 read buffer builder now checks if the endpoint starts with "http://" or "https://". Verify that this logic correctly handles all expected endpoint formats without breaking existing S3 configurations.

Macro and Buffer Creation

The modifications under the USE_HDFS conditional and changes to the build methods introduce alternate paths in buffer creation. Ensure that the conditional compilation correctly covers both HDFS and non-HDFS scenarios and that the new lambda signatures meet the intended behaviors.

        start_end.second);
#if USE_HDFS
    /// If read buffer doesn't support right bounded reads, wrap it with BoundedReadBuffer to enable right bounded reads.
    if (dynamic_cast<DB::ReadBufferFromHDFS *>(read_buffer.get()) || dynamic_cast<DB::AsynchronousReadBufferFromHDFS *>(read_buffer.get())
        || dynamic_cast<DB::ReadBufferFromFile *>(read_buffer.get()))
        read_buffer = std::make_unique<DB::BoundedReadBuffer>(std::move(read_buffer));
#else
    if (dynamic_cast<DB::ReadBufferFromFile *>(read_buffer.get()))
        read_buffer = std::make_unique<DB::BoundedReadBuffer>(std::move(read_buffer));
#endif

@baibaichen
Copy link
Copy Markdown
Contributor Author

@CodiumAI-Agent /improve

@QodoAI-Agent
Copy link
Copy Markdown

PR Code Suggestions ✨

CategorySuggestion                                                                                                                                    Impact
General
Ensure correct trailing slash

Verify that the URL returned by hdfsHelper.getHdfsUrl includes a trailing slash when
needed, so that subsequent file path concatenations work as expected.

backends-clickhouse/src/test/scala/org/apache/gluten/execution/GlutenClickHouseExcelFormatSuite.scala [1475]

-val tablePath = hdfsHelper.getHdfsUrl(s"$SPARK_DIR_NAME/write_into_hdfs/")
+val path = hdfsHelper.getHdfsUrl(s"$SPARK_DIR_NAME/write_into_hdfs")
+val tablePath = if (path.endsWith("/")) path else s"$path/"
Suggestion importance[1-10]: 6

__

Why: The suggestion improves robustness by ensuring the trailing slash is present for proper path concatenation. Although the original code already includes a trailing slash, the added check can prevent potential issues if the helper's return value ever varies.

Low
Normalize Minio endpoint URL

Adjust URL concatenation to avoid duplicate slashes by trimming the trailing slash
from MINIO_ENDPOINT before forming wholePath.

backends-clickhouse/src/test/scala/org/apache/gluten/utils/MinioTestHelper.scala [96]

-val wholePath: String = MINIO_ENDPOINT + BUCKET_NAME + "/"
+val wholePath: String = s"${MINIO_ENDPOINT.stripSuffix("/")}/$BUCKET_NAME/"
Suggestion importance[1-10]: 6

__

Why: This suggestion refines the URL formation by removing potential duplicate slashes, making the code more resilient to different endpoint formats. The change is minor but can help avoid errors in environments where MINIO_ENDPOINT might not already include a trailing slash.

Low

@github-actions
Copy link
Copy Markdown

Run Gluten Clickhouse CI on x86

@QodoAI-Agent
Copy link
Copy Markdown

PR Code Suggestions ✨

No code suggestions found for the PR.

@github-actions
Copy link
Copy Markdown

Run Gluten Clickhouse CI on x86

@github-actions
Copy link
Copy Markdown

Run Gluten Clickhouse CI on x86

@github-actions
Copy link
Copy Markdown

Run Gluten Clickhouse CI on x86

@zzcclp zzcclp merged commit c275b1e into apache:main Apr 16, 2025
8 checks passed
@baibaichen baibaichen deleted the feature/issue-6867 branch April 16, 2025 14:34
@GlutenPerfBot
Copy link
Copy Markdown
Contributor

===== Performance report for TPCDS SF2000 with Velox backend, for reference only ====

query log/native_master_04_16_2025_time.csv log/native_master_04_15_2025_c1d12e0c1f_time.csv difference percentage
q1 11.64 11.88 0.240 102.06%
q2 12.22 12.83 0.617 105.05%
q3 2.85 2.79 -0.055 98.07%
q4 52.37 50.67 -1.702 96.75%
q5 8.11 9.49 1.376 116.96%
q6 6.55 5.24 -1.314 79.94%
q7 6.07 5.57 -0.505 91.69%
q8 3.85 4.29 0.435 111.28%
q9 15.86 15.93 0.070 100.44%
q10 13.21 11.52 -1.689 87.22%
q11 27.56 28.02 0.456 101.66%
q12 2.09 2.10 0.008 100.38%
q13 6.34 5.89 -0.451 92.88%
q14a 45.29 43.07 -2.214 95.11%
q14b 39.64 40.17 0.524 101.32%
q15 2.58 3.09 0.512 119.85%
q16 5.49 6.17 0.680 112.39%
q17 5.38 4.97 -0.411 92.36%
q18 7.29 7.43 0.137 101.88%
q19 4.05 3.58 -0.470 88.39%
q20 1.67 3.08 1.409 184.41%
q21 1.56 0.77 -0.794 49.19%
q22 3.27 2.92 -0.349 89.32%
q23a 61.55 62.04 0.486 100.79%
q23b 71.28 71.81 0.530 100.74%
q24a 69.30 72.92 3.621 105.22%
q24b 65.27 65.05 -0.222 99.66%
q25 6.72 5.77 -0.945 85.93%
q26 2.47 2.39 -0.078 96.85%
q27 3.11 3.26 0.144 104.63%
q28 17.65 17.53 -0.120 99.32%
q29 7.51 7.79 0.285 103.79%
q30 6.14 5.11 -1.037 83.12%
q31 8.12 8.24 0.120 101.48%
q32 1.23 1.74 0.508 141.32%
q33 3.85 3.63 -0.212 94.47%
q34 4.08 3.87 -0.210 94.85%
q35 7.78 7.41 -0.377 95.16%
q36 2.10 2.65 0.548 126.13%
q37 3.08 2.94 -0.146 95.28%
q38 12.03 11.82 -0.211 98.25%
q39a 4.46 4.47 0.015 100.33%
q39b 3.79 3.52 -0.267 92.96%
q40 3.94 4.19 0.251 106.38%
q41 0.84 0.67 -0.165 80.30%
q42 0.59 0.93 0.332 155.93%
q43 1.94 1.94 0.003 100.16%
q44 6.05 5.64 -0.412 93.19%
q45 4.07 3.77 -0.304 92.54%
q46 3.83 3.96 0.130 103.40%
q47 10.53 12.42 1.889 117.93%
q48 3.24 3.49 0.246 107.58%
q49 5.49 5.96 0.466 108.48%
q50 18.11 17.26 -0.851 95.30%
q51 7.14 7.13 -0.010 99.87%
q52 1.11 0.86 -0.255 77.14%
q53 1.63 1.37 -0.261 83.97%
q54 5.76 5.54 -0.225 96.10%
q55 0.93 0.60 -0.326 64.91%
q56 3.92 3.80 -0.119 96.97%
q57 6.24 6.56 0.316 105.06%
q58 3.41 3.50 0.091 102.67%
q59 4.54 3.84 -0.700 84.59%
q60 4.27 4.33 0.055 101.29%
q61 4.47 4.19 -0.285 93.63%
q62 2.60 2.79 0.190 107.30%
q63 1.56 1.45 -0.108 93.05%
q64 36.86 36.94 0.081 100.22%
q65 11.14 10.98 -0.153 98.62%
q66 3.51 3.78 0.261 107.42%
q67 57.53 59.27 1.734 103.01%
q68 2.84 3.22 0.375 113.21%
q69 5.01 5.12 0.108 102.15%
q70 6.42 5.55 -0.877 86.34%
q71 3.67 5.41 1.740 147.41%
q72 20.58 20.04 -0.537 97.39%
q73 3.04 2.24 -0.796 73.78%
q74 17.38 16.99 -0.390 97.76%
q75 23.05 22.91 -0.137 99.41%
q76 6.56 6.48 -0.076 98.84%
q77 2.00 2.39 0.389 119.50%
q78 33.61 33.58 -0.026 99.92%
q79 3.12 3.32 0.198 106.35%
q80 12.72 10.36 -2.361 81.44%
q81 7.23 6.26 -0.974 86.54%
q82 5.41 5.38 -0.028 99.48%
q83 1.98 1.58 -0.394 80.05%
q84 2.41 3.00 0.595 124.71%
q85 6.83 5.89 -0.942 86.20%
q86 1.99 1.95 -0.031 98.44%
q87 11.37 11.81 0.442 103.89%
q88 15.71 15.53 -0.181 98.85%
q89 1.92 1.98 0.053 102.77%
q90 2.18 2.26 0.084 103.83%
q91 5.64 3.48 -2.166 61.61%
q92 1.44 1.92 0.477 133.13%
q93 24.45 23.76 -0.691 97.18%
q94 8.48 8.65 0.164 101.93%
q9 56.40 55.10 -1.294 97.71%
q5 2.11 1.97 -0.141 93.33%
q96 10.54 10.57 0.034 100.32%
q97 3.18 2.28 -0.892 71.91%
q98 5.39 5.27 -0.121 97.75%
q99 0.48 0.26 -0.222 53.83%
total 1176.87 1169.07 -7.804 99.34%

@GlutenPerfBot
Copy link
Copy Markdown
Contributor

===== Performance report for TPCH SF2000 with Velox backend, for reference only ====

query log/native_master_04_16_2025_time.csv log/native_master_04_15_2025_c1d12e0c1f_time.csv difference percentage
q1 25.59 25.58 -0.001 100.00%
q2 26.27 26.04 -0.229 99.13%
q3 33.00 33.26 0.267 100.81%
q4 27.81 27.07 -0.747 97.32%
q5 59.07 59.94 0.877 101.49%
q6 9.81 8.87 -0.937 90.45%
q7 37.72 38.10 0.375 101.00%
q8 63.91 64.94 1.037 101.62%
q9 88.66 90.65 1.990 102.24%
q10 40.91 42.95 2.046 105.00%
q11 17.00 16.06 -0.938 94.48%
q12 17.22 16.23 -0.991 94.25%
q13 24.20 24.82 0.621 102.56%
q14 12.54 12.48 -0.055 99.56%
q15 26.09 26.72 0.634 102.43%
q16 12.97 12.93 -0.048 99.63%
q17 73.49 74.14 0.646 100.88%
q18 113.97 112.23 -1.739 98.47%
q19 22.77 15.70 -7.066 68.96%
q20 23.47 25.84 2.371 110.10%
q21 170.29 170.94 0.650 100.38%
q22 10.56 10.21 -0.350 96.69%
total 937.30 935.71 -1.587 99.83%

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[CH] Parquet table on minio not support

5 participants