TAJO-2069: Implement finding the total size of all objects in a bucket with AWS SDK.#1024
TAJO-2069: Implement finding the total size of all objects in a bucket with AWS SDK.#1024blrunner wants to merge 30 commits intoapache:masterfrom
Conversation
|
Why you close the PR? the conversation is lost. |
|
@blrunner OK |
…into TAJO-2069
|
I generated partitioned tables on HDFS, and then uploaded output files to S3 with aws sdk, finally created external table on ec2. Here are my test environment.
|
|
I will test. Thanks |
…into TAJO-2069
|
I updated this PR as following:
I found that it ran as expected on local cluster and EMR. Also it calculated the volume of multi level partitioned table successfully with following table: Additionally, I added codes for comparing this PR and |
|
Guys, I found the improved performance reason. If there is not set the delimiter, the listObjects return a list of summary information about the objects. it reduce the requests to aws. please see the below comments |
|
Thanks for sharing. It sounds reasonable. |
|
rebase please |
…into TAJO-2069 Conflicts: tajo-project/pom.xml
|
Rebased. :-) |
|
+1 LGTM! |
|
Thanks for your review. |
See following issues
When creating external table, Tajo calls
FileSystem::getContentSummaryto get the table volume inTableSpace::createTable. This API will call S3 client api to loop recursively all sub directories of the specified path. It will become a huge bottleneck with a large partitioned table. We need to improve it for AWS Tajo users. Here is my benchmark results as follows.Configuration
Contents summary time
of directories | S3AFileSystem | S3FileTableSpace | Improvement
-------------------|----------------------|--------------------------|-------------------
5 | 1056.5 ms | 136.2 ms | 7.8x
365 | 56549 ms | 153.8 ms | 367.7x
730 | 113007.5 ms | 193.2 ms | 585x
1095 | 168567 ms | 215.7 ms | 781.5x
1460 | 228129.5 ms | 234.2 ms | 974.1x