[improvement](hdfs) support hedged read #22634

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Merged

yiguolei merged 7 commits into apache:master from morningman:hedged_read

Aug 6, 2023

Contributor

morningman commented Aug 5, 2023

Proposed changes

In some cases, the high load of HDFS may lead to a long time to read the data on HDFS,
thereby slowing down the overall query efficiency. HDFS Client provides Hedged Read.
This function can start another read thread to read the same data when a read request
exceeds a certain threshold and is not returned, and whichever is returned first will use the result.

eg:

create catalog regression properties (
    'type'='hms',
    'hive.metastore.uris' = 'thrift://172.21.16.47:7004',
    'dfs.client.hedged.read.threadpool.size' = '128',
    'dfs.client.hedged.read.threshold.millis' = "500"
);

Further comments

If this is a relatively large or complex change, kick off the discussion at dev@doris.apache.org by explaining why you chose the solution you did and what alternatives you considered, etc...

morningman added 2 commits

August 5, 2023 15:19


          [improvement](hdfs) support hdfs hedged read

c4f4061

615c605

morningman added the dev/2.0.1 label

morningman added 2 commits

August 5, 2023 15:28

75c5d84

Contributor

github-actions bot commented Aug 5, 2023

clang-tidy review says "All clean, LGTM! 👍"

90e134d

Contributor

github-actions bot commented Aug 5, 2023

clang-tidy review says "All clean, LGTM! 👍"

2 similar comments

Contributor

github-actions bot commented Aug 5, 2023

clang-tidy review says "All clean, LGTM! 👍"

Contributor

github-actions bot commented Aug 5, 2023

clang-tidy review says "All clean, LGTM! 👍"

50f67c2

Contributor Author

morningman commented Aug 5, 2023

run buildall

Contributor

github-actions bot commented Aug 5, 2023

clang-tidy review says "All clean, LGTM! 👍"

6125cf2

Contributor

github-actions bot commented Aug 5, 2023

clang-tidy review says "All clean, LGTM! 👍"

Contributor Author

morningman commented Aug 5, 2023

run buildall

Contributor

hello-stephen commented Aug 5, 2023

(From new machine)TeamCity pipeline, clickbench performance test result:
the sum of best hot time: 45.23 seconds
stream load tsv: 514 seconds loaded 74807831229 Bytes, about 138 MB/s
stream load json: 21 seconds loaded 2358488459 Bytes, about 107 MB/s
stream load orc: 64 seconds loaded 1101869774 Bytes, about 16 MB/s
stream load parquet: 31 seconds loaded 861443392 Bytes, about 26 MB/s
insert into select: 28.9 seconds inserted 10000000 Rows, about 346K ops/s
storage size: 17162094491 Bytes

AshinGau approved these changes

View reviewed changes

github-actions bot added the approved label

Contributor

github-actions bot commented Aug 5, 2023

PR approved by at least one committer and no changes requested.

github-actions bot added the reviewed label

Contributor

github-actions bot commented Aug 5, 2023

PR approved by anyone and no changes requested.

yiguolei approved these changes

View reviewed changes

Contributor

yiguolei left a comment

LGTM

yiguolei merged commit d628bab into apache:master

xiaokang added merge_conflict dev/2.0.1-merged and removed dev/2.0.1 labels

xiaokang pushed a commit to xiaokang/doris that referenced this pull request


          [improvement](hdfs) support hedged read (apache#22634)

5c82dd1

In some cases, the high load of HDFS may lead to a long time to read the data on HDFS,
thereby slowing down the overall query efficiency. HDFS Client provides Hedged Read.
This function can start another read thread to read the same data when a read request
exceeds a certain threshold and is not returned, and whichever is returned first will use the result.

eg:

create catalog regression properties (
    'type'='hms',
    'hive.metastore.uris' = 'thrift://172.21.16.47:7004',
    'dfs.client.hedged.read.threadpool.size' = '128',
    'dfs.client.hedged.read.threshold.millis' = "500"
);

xiaokang pushed a commit that referenced this pull request


          [improvement](hdfs) support hedged read (#22634)

f601afa

In some cases, the high load of HDFS may lead to a long time to read the data on HDFS,
thereby slowing down the overall query efficiency. HDFS Client provides Hedged Read.
This function can start another read thread to read the same data when a read request
exceeds a certain threshold and is not returned, and whichever is returned first will use the result.

eg:

create catalog regression properties (
    'type'='hms',
    'hive.metastore.uris' = 'thrift://172.21.16.47:7004',
    'dfs.client.hedged.read.threadpool.size' = '128',
    'dfs.client.hedged.read.threshold.millis' = "500"
);

morningman added a commit to morningman/doris that referenced this pull request


          [improvement](hdfs) support hedged read (apache#22634)

b9eecbb

In some cases, the high load of HDFS may lead to a long time to read the data on HDFS,
thereby slowing down the overall query efficiency. HDFS Client provides Hedged Read.
This function can start another read thread to read the same data when a read request
exceeds a certain threshold and is not returned, and whichever is returned first will use the result.

eg:

create catalog regression properties (
    'type'='hms',
    'hive.metastore.uris' = 'thrift://172.21.16.47:7004',
    'dfs.client.hedged.read.threadpool.size' = '128',
    'dfs.client.hedged.read.threshold.millis' = "500"
);

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved dev/2.0.1-merged merge_conflict reviewed