Skip to content
This repository was archived by the owner on May 12, 2021. It is now read-only.

Comments

TAJO-2030: Use list S3 files using AmazonS3Client instead of using S3A.#932

Closed
blrunner wants to merge 4 commits intoapache:masterfrom
blrunner:TAJO-2030
Closed

TAJO-2030: Use list S3 files using AmazonS3Client instead of using S3A.#932
blrunner wants to merge 4 commits intoapache:masterfrom
blrunner:TAJO-2030

Conversation

@blrunner
Copy link
Contributor

@blrunner blrunner commented Jan 7, 2016

The code for S3 bulk listing is fully implemented in TajoS3FileSystem. Honestly, my code is heavily based on PrestoS3FileSystem. And TajoS3FileSystem extends S3AFileSystem because PrestoS3FileSystem doesn't support some methods for file writing, for example, FileSystem::mkdir.

Here is my benchmark results as follows.

Configuration

  • EC2 instance type : c3.xlarge
  • Tajo version : 0.12.0-SNAPSHOT
  • Cluster: 1 master, 1 worker
  • partitions had been generated by Hive

Queries

1 partition: select count(*) from lineitem where l_shipdate = '1992-01-02';
30 partitions: select count(*) from lineitem  where l_shipdate > '1992-01-01' and l_shipdate < '1992-02-01';
90 partitions: select count(*) from lineitem  where l_shipdate >= '1992-01-01' and l_shipdate < '1992-04-01';
151 partitions: select count(*) from lineitem where l_shipdate >= '1992-01-01' and l_shipdate < '1992-06-01';
334 partitions: select count(*) from lineitem where l_shipdate >= '1992-01-01' and l_shipdate < '1992-12-01';

Results : Partition Pruning

of partitions | S3AFileSystem | TajoS3FileSystem | Improvement

-------------------|----------------------|--------------------------|-------------------
1 | 1088 ms | 607 ms | 1.79x
30 | 5421 ms | 3414 ms | 1.58x
90 | 15776 ms | 7927 ms | 1.99x
151 | 24060 ms | 14912 ms | 1.61x
334 | 45397 ms | 32247 ms | 1.40x

Results : Query Finished time

of partitions | S3AFileSystem | TajoS3FileSystem | Improvement

-------------------|----------------------|--------------------------|-------------------
1 | 3.99 sec | 2.726 sec | 1.46x
30 | 15.447 sec | 12.416 sec | 1.24
90 | 40.153 sec | 31.593 sec | 1.27x
151 | 66.038 sec | 44.604 sec | 1.48x
334 | 137.137 sec | 90.419 sec | 1.51x

@blrunner
Copy link
Contributor Author

This patch depends on hadoop-aws. I'm going to implement it afresh after resolving #953.

@blrunner blrunner closed this Feb 12, 2016
@blrunner blrunner deleted the TAJO-2030 branch February 17, 2016 15:21
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant