Skip to content

Conversation

@dutyu
Copy link
Contributor

@dutyu dutyu commented Sep 16, 2023

Proposed changes

Relevant pr #23391 #21873

This pr mainly has these changes:

  • rename ExternalTable.getLatestUpdateTime() to ExternalTable.getUpdateTime(), because there is already a method called getUpdateTime() existed at TableIf, and the meaning is the same, better to merge these two methods to avoid ambiguity.

  • rename the field ExternalTable.lastestUpdateTime to schemaUpdateTime, and the default impl of ExternalTable.getUpdateTime() is just get the value of schemaUpdateTime, bacause schemaUpdateTime is the timestamp after scheme loading of external tables.

  • add a field named partitionUpdateTime at HMSExternalTable, update partitionUpdateTime when processing hms partition events, override getUpdateTime() of HMSExternalTable, return the max value between schemaUpdateTime and partitionUpdateTime. The partitionUpdateTime will be refreshed when (1. add partitions 2. delete partitions 3. alter partitions) with hms event listener enabled.

Now FE does not record the update time of hms tbl's partitons, so the sql cache may be hit even the hive table's partitions have changed. This pr add a field to record the partition update time, and use it when enable sql-cache.
The cache will be missed if any partition has changed at hive side.

Use System.currentTimeMillis() but not the event time of hms event because we would better keep the same measurement with the schemaUpdateTime of external table. Add this value to ExternalObjectLog and let slave FEs replay it because it is better to keep the same value with all FEs, so the sql-cache can be hit by the querys through different FEs.

I have test with following steps:

  1. Enable hms event listener, enable sql cache and query profile, submit a query with hms catalog two times (lh_test_p is a hive partition-table):
select count(0) from hive_safe_lycc.test.lh_test_p;

image

The second time will hit the sql cache:
image

The last update time of this hive table:
image

  1. Add a partition as hive side by :
alter table lh_test_p add partition (pday='20230920');

image

Wait some time for processing hms events, execute select count(0) from hive_safe_lycc.test.lh_test_p; again and the cache will be missed:
image

And the last update time has changed too:
image

Further comments

If this is a relatively large or complex change, kick off the discussion at dev@doris.apache.org by explaining why you chose the solution you did and what alternatives you considered, etc...

@dutyu
Copy link
Contributor Author

dutyu commented Sep 16, 2023

run buildall

1 similar comment
@dutyu
Copy link
Contributor Author

dutyu commented Sep 16, 2023

run buildall

@doris-robot
Copy link

(From new machine)TeamCity pipeline, clickbench performance test result:
the sum of best hot time: 47.87 seconds
stream load tsv: 610 seconds loaded 74807831229 Bytes, about 116 MB/s
stream load json: 20 seconds loaded 2358488459 Bytes, about 112 MB/s
stream load orc: 64 seconds loaded 1101869774 Bytes, about 16 MB/s
stream load parquet: 31 seconds loaded 861443392 Bytes, about 26 MB/s
insert into select: 29.2 seconds inserted 10000000 Rows, about 342K ops/s
storage size: 17162251255 Bytes

@dutyu
Copy link
Contributor Author

dutyu commented Sep 17, 2023

run buildall

@doris-robot
Copy link

(From new machine)TeamCity pipeline, clickbench performance test result:
the sum of best hot time: 51.36 seconds
stream load tsv: 615 seconds loaded 74807831229 Bytes, about 116 MB/s
stream load json: 20 seconds loaded 2358488459 Bytes, about 112 MB/s
stream load orc: 64 seconds loaded 1101869774 Bytes, about 16 MB/s
stream load parquet: 32 seconds loaded 861443392 Bytes, about 25 MB/s
insert into select: 29.0 seconds inserted 10000000 Rows, about 344K ops/s
storage size: 17162179919 Bytes

@dutyu dutyu force-pushed the add-partition-updatetime-for-hmstbl branch from ced8fb8 to 3aa5afb Compare September 19, 2023 02:56
@dutyu
Copy link
Contributor Author

dutyu commented Sep 19, 2023

run buildall

@doris-robot
Copy link

(From new machine)TeamCity pipeline, clickbench performance test result:
the sum of best hot time: 49.31 seconds
stream load tsv: 603 seconds loaded 74807831229 Bytes, about 118 MB/s
stream load json: 21 seconds loaded 2358488459 Bytes, about 107 MB/s
stream load orc: 64 seconds loaded 1101869774 Bytes, about 16 MB/s
stream load parquet: 31 seconds loaded 861443392 Bytes, about 26 MB/s
insert into select: 29.2 seconds inserted 10000000 Rows, about 342K ops/s
storage size: 17162429709 Bytes

@dutyu dutyu force-pushed the add-partition-updatetime-for-hmstbl branch from 3aa5afb to 8420019 Compare September 26, 2023 11:51
@dutyu
Copy link
Contributor Author

dutyu commented Sep 26, 2023

run buildall

@doris-robot
Copy link

(From new machine)TeamCity pipeline, clickbench performance test result:
the sum of best hot time: 45.41 seconds
stream load tsv: 557 seconds loaded 74807831229 Bytes, about 128 MB/s
stream load json: 20 seconds loaded 2358488459 Bytes, about 112 MB/s
stream load orc: 64 seconds loaded 1101869774 Bytes, about 16 MB/s
stream load parquet: 31 seconds loaded 861443392 Bytes, about 26 MB/s
insert into select: 28.9 seconds inserted 10000000 Rows, about 346K ops/s
storage size: 17161868593 Bytes

morningman
morningman previously approved these changes Oct 5, 2023
Copy link
Contributor

@morningman morningman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@github-actions github-actions bot added the approved Indicates a PR has been approved by one committer. label Oct 5, 2023
@github-actions
Copy link
Contributor

github-actions bot commented Oct 5, 2023

PR approved by at least one committer and no changes requested.

@github-actions
Copy link
Contributor

github-actions bot commented Oct 5, 2023

PR approved by anyone and no changes requested.

xinyiZzz
xinyiZzz previously approved these changes Oct 10, 2023
Copy link
Contributor

@xinyiZzz xinyiZzz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@xinyiZzz xinyiZzz closed this Oct 10, 2023
@xinyiZzz xinyiZzz reopened this Oct 10, 2023
@xinyiZzz
Copy link
Contributor

run buildall

@dutyu dutyu force-pushed the add-partition-updatetime-for-hmstbl branch from 0ad7aae to 72885f3 Compare October 10, 2023 10:05
@xinyiZzz
Copy link
Contributor

run buildall

@dutyu dutyu dismissed stale reviews from xinyiZzz and morningman via 103e980 October 10, 2023 11:07
@github-actions github-actions bot removed the approved Indicates a PR has been approved by one committer. label Oct 10, 2023
@dutyu
Copy link
Contributor Author

dutyu commented Oct 10, 2023

run buildall

@dutyu dutyu requested review from morningman and xinyiZzz October 10, 2023 11:22
@dutyu
Copy link
Contributor Author

dutyu commented Oct 10, 2023

run buildall

@doris-robot
Copy link

(From new machine)TeamCity pipeline, clickbench performance test result:
the sum of best hot time: 45 seconds
stream load tsv: 560 seconds loaded 74807831229 Bytes, about 127 MB/s
stream load json: 21 seconds loaded 2358488459 Bytes, about 107 MB/s
stream load orc: 65 seconds loaded 1101869774 Bytes, about 16 MB/s
stream load parquet: 31 seconds loaded 861443392 Bytes, about 26 MB/s
insert into select: 29.2 seconds inserted 10000000 Rows, about 342K ops/s
storage size: 17162326076 Bytes

@doris-robot
Copy link

(From new machine)TeamCity pipeline, clickbench performance test result:
the sum of best hot time: 45.12 seconds
stream load tsv: 555 seconds loaded 74807831229 Bytes, about 128 MB/s
stream load json: 21 seconds loaded 2358488459 Bytes, about 107 MB/s
stream load orc: 65 seconds loaded 1101869774 Bytes, about 16 MB/s
stream load parquet: 32 seconds loaded 861443392 Bytes, about 25 MB/s
insert into select: 29.0 seconds inserted 10000000 Rows, about 344K ops/s
storage size: 17162391773 Bytes

Copy link
Contributor

@xinyiZzz xinyiZzz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@github-actions github-actions bot added the approved Indicates a PR has been approved by one committer. label Oct 10, 2023
@github-actions
Copy link
Contributor

PR approved by at least one committer and no changes requested.

Copy link
Contributor

@morningman morningman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@xinyiZzz xinyiZzz merged commit 1e6d34d into apache:master Oct 11, 2023
@xiaokang xiaokang added dev/2.0.3 usercase Important user case type label merge_conflict labels Oct 11, 2023
morningman pushed a commit to morningman/doris that referenced this pull request Oct 12, 2023
…use it at sql-cache. (apache#24491)

Now FE does not record the update time of hms tbl's partitons, so the sql cache may be hit even the hive table's partitions have changed. This pr add a field to record the partition update time, and use it when enable sql-cache.
The cache will be missed if any partition has changed at hive side.

Use System.currentTimeMillis() but not the event time of hms event because we would better keep the same measurement with the schemaUpdateTime of external table. Add this value to ExternalObjectLog and let slave FEs replay it because it is better to keep the same value with all FEs, so the sql-cache can be hit by the querys through different FEs.
xiaokang pushed a commit that referenced this pull request Oct 13, 2023
…use it at sql-cache. (#24491) (#25382)

Now FE does not record the update time of hms tbl's partitons, so the sql cache may be hit even the hive table's partitions have changed. This pr add a field to record the partition update time, and use it when enable sql-cache.
The cache will be missed if any partition has changed at hive side.

Use System.currentTimeMillis() but not the event time of hms event because we would better keep the same measurement with the schemaUpdateTime of external table. Add this value to ExternalObjectLog and let slave FEs replay it because it is better to keep the same value with all FEs, so the sql-cache can be hit by the querys through different FEs.

Co-authored-by: Xiangyu Wang <dut.xiangyu@gmail.com>
dutyu added a commit to dutyu/doris that referenced this pull request Oct 28, 2023
…use it at sql-cache. (apache#24491)

Now FE does not record the update time of hms tbl's partitons, so the sql cache may be hit even the hive table's partitions have changed. This pr add a field to record the partition update time, and use it when enable sql-cache.
The cache will be missed if any partition has changed at hive side.

Use System.currentTimeMillis() but not the event time of hms event because we would better keep the same measurement with the schemaUpdateTime of external table. Add this value to ExternalObjectLog and let slave FEs replay it because it is better to keep the same value with all FEs, so the sql-cache can be hit by the querys through different FEs.
@xiaokang xiaokang mentioned this pull request Dec 4, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by one committer. dev/2.0.3-merged merge_conflict reviewed usercase Important user case type label

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants