-
Notifications
You must be signed in to change notification settings - Fork 3.7k
[branch-4.0](hive-insert) backport hive insert related PRs to branch-4.0 #34371
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
dataroaring
merged 13 commits into
apache:branch-4.0-preview
from
morningman:branch-4.0-preview-writeback
May 1, 2024
Merged
[branch-4.0](hive-insert) backport hive insert related PRs to branch-4.0 #34371
dataroaring
merged 13 commits into
apache:branch-4.0-preview
from
morningman:branch-4.0-preview-writeback
May 1, 2024
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
…tioned tables (apache#33338) support partition by : ``` create table tb1 (c1 string, ts datetime) engine = iceberg partition by (c1, day(ts)) () properties ("a"="b") ```
…ms. (apache#33397) Issue Number: apache#31442 Change table sink exchange rebalancer params to node level and adjust these params to improve write performance by better balance. rebalancer params: ``` DEFINE_mInt64(table_sink_partition_write_min_data_processed_rebalance_threshold, "26214400"); // 25MB // Minimum partition data processed to rebalance writers in exchange when partition writing DEFINE_mInt64(table_sink_partition_write_min_partition_data_processed_rebalance_threshold, "15728640"); // 15MB ```
1. commit total time
2. fs operator total time
rename file count
rename dir count
delete dir count
3. add partition total time
add partition count
4. update partition total time
update partition count
like:
```
- Transaction Commit Time: 906ms
- FileSystem Operator Time: 833ms
- Rename File Count: 4
- Rename Dir Count: 0
- Delete Dir Count: 0
- HMS Add Partition Time: 0ms
- HMS Add Partition Count: 0
- HMS Update Partition Time: 68ms
- HMS Update Partition Count: 4
```
Issue apache#31442 add iceberg transaction
…#33666) Issue Number: apache#31442 hive3 support create table with column's default value if use hive3, we can write default value to table
1. Remame`list` to `globList` . The path of this `list` needs to have a wildcard character, and the corresponding hdfs interface is `globStatus`, so the modified name is `globList`. 2. If you only need to view files based on paths, you can use the `listFiles` operation. 3. Merge `listLocatedFiles` function into `listFiles` function.
1. Use `caffeine` instead of `guava cache` to get better performace
2. Add a new class `CacheFactory`
All (Async)LoadingCache should be built from `CacheFactory`
3. Use separator executor for different caches
1. rowCountRefreshExecutor
For row count cache.
Row count cache is an async loading cache, and we can ignore the result
if cache missing or thread pool is full.
So use a separate executor for this cache.
2. commonRefreshExecutor
For other caches. Other caches are sync loading cache.
But commonRefreshExecutor will be used for async refresh.
That is, if cache entry is missing, the cache value will be loaded in caller thread, sychronously.
if cache entry need refresh, it will be reloaded in commonRefreshExecutor.
3. fileListingExecutor
File listing is a heavy operation, so use a separate executor for it.
For fileCache, the refresh operation will still use commonRefreshExecutor to trigger refresh.
And fileListingExecutor will be used to list file.
4. Change the refresh and expire logic of caches
For most of caches, set `refreshAfterWrite` strategy, so that
even if the cache entry is expired, the old entry can still be
used while new entry is being loaded.
5. Add new global variable `enable_get_row_count_from_file_list`
Default is true, if false, will disable getting row count from file list
Issue apache#31442 1. delete file according query id 2. delete write path after insert
… bucket mechanism and support different uri styles by flags. (apache#33858) Many domestic cloud vendors are compatible with the s3 protocol. However, early versions of s3 client will only generate path style http requests (aws/aws-sdk-java-v2#763) when encountering endpoints that do not start with s3, while some cloud vendors only support virtual host style http request. Therefore, Doris used `forceVirtualHosted` in `S3URI` to convert it into a virtual hosted path and implemented it through path style. For example: For s3 uri `s3://my-bucket/data/file.txt`, It will eventually be parsed into: - virtualBucket: my-bucket - Bucket: data (bucket must be set, otherwise the s3 client will report an error) Especially this step is particularly tricky because of the limitations of the s3 client. - Key: file.txt The path style mode is used to generate an http request similar to the virtual host by setting the endpoint to virtualBucket + original endpoint, setting the bucket and key. **However, the bucket and key here are inconsistent with the original concepts of s3, but the aws client happens to be able to generate an http request similar to the virtual host through the path style mode.** However, after apache#30799 we have upgrade the aws sdk version from 2.17.257 to 2.20.131. The current aws s3 client can already generate a virtual host by third party by default style of http request. So in apache#31111 need to set the path style option, let the s3 client use doris' virtual bucket mechanism to continue working. **Finally, the virtual bucket mechanism is too confusing and tricky, and we no longer need it with the new version of s3 client.** ### Resolution: Rewrite `S3URI` to remove tricky virtual bucket mechanism and support different uri styles by flags. This class represents a fully qualified location in S3 for input/output operations expressed as as URI. #### For AWS S3, URI common styles: - AWS Client Style(Hadoop S3 Style): `s3://my-bucket/path/to/file?versionId=abc123&partNumber=77&partNumber=88` - Virtual Host Style: `https://my-bucket.s3.us-west-1.amazonaws.com/resources/doc.txt?versionId=abc123&partNumber=77&partNumber=88` - Path Style: `https://s3.us-west-1.amazonaws.com/my-bucket/resources/doc.txt?versionId=abc123&partNumber=77&partNumber=88` Regarding the above-mentioned common styles, we can use <code>isPathStyle</code> to control whether to use path style or virtual host style. "Virtual host style" is the currently mainstream and recommended approach to use, so the default value of <code>isPathStyle</code> is false. #### Other Styles: - Virtual Host AWS Client (Hadoop S3) Mixed Style: `s3://my-bucket.s3.us-west-1.amazonaws.com/resources/doc.txt?versionId=abc123&partNumber=77&partNumber=88` - Path AWS Client (Hadoop S3) Mixed Style: `s3://s3.us-west-1.amazonaws.com/my-bucket/resources/doc.txt?versionId=abc123&partNumber=77&partNumber=88` For these two styles, we can use <code>isPathStyle</code> and <code>forceParsingByStandardUri</code> to control whether to use. Virtual Host AWS Client (Hadoop S3) Mixed Style: <code>isPathStyle = false && forceParsingByStandardUri = true</code> Path AWS Client (Hadoop S3) Mixed Style: <code>isPathStyle = true && forceParsingByStandardUri = true</code> When the incoming location is url encoded, the encoded string will be returned. For <code>getKey()</code>, <code>getQueryParams()</code> will return the encoding string
…he#34278) `_temp_<table_name>` to `_temp_<queryid>_<table_name>`. Prevent users from having a table with the name `_temp_<table_name>`. So as to partition temp dir
Issue Number: apache#31442 [Feature] (hive-writer) Implements s3 file committer. S3 committer will start multipart uploading all files on BE side, and then complete multipart upload these files on FE side. If you do not complete multi parts of a file, the file will not be visible. So in this way, the atomicity of a single file can be guaranteed. But it still cannot guarantee the atomicity of multiple files. Because hive committers have best-effort semantics, this shortens the inconsistent time window. ## ChangeList: - Add `used_by_s3_committer` in `FileWriterOptions` on BE side to start multi-part uploading files, then complete multi-part uploading files on FE side. - `cosn://`use s3 client on FE side, because it need to complete multi-part uploading files on FE side. - Add `Status directoryExists(String dir)` and `Status deleteDirectory` in `FileSystem`.
|
Thank you for your contribution to Apache Doris. Since 2024-03-18, the Document has been moved to doris-website. |
Contributor
|
clang-tidy review says "All clean, LGTM! 👍" |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
backport:
Feature Implements s3 file committer. (#33937)
feature Load index data into index cache when writing data (#34046)
improvementadd the
queryidto the temporary file path (#34278)Enhancement Rewrite
S3URIto remove tricky virtual bucket mechanism and support different uri styles by flags. (#33858)bugfixdelete write path after hive insert (#33798)
opt refine the meta cache (#33449)
refactorrefactor
filesysteminterface (#33361)featuresupport default value when create hive table (#33666)
feature add iceberg transaction implement (#33629)
feature add transaction statistics for profile (#33488)
Enhancement Adjust table sink exchange rebalancer params. (#33397)
featureThe new DDL syntax is added to create iceberg partitioned tables (#33338)
featureuse optional location and add hive regression test (#33153)