-
Notifications
You must be signed in to change notification settings - Fork 3.7k
Description
Behavior changes
- The default connection pool size for JDBC Catalog has been adjusted from 10 to 30. ([branch-2.1][improvement](jdbc catalog) Modify the maximum number of connections in the connection pool to 30 by default #37023). When creating a JDBC Catalog, the default value of the connection_pool_max_size parameter has been changed to 30 to avoid connection pool exhaustion in high-concurrency scenarios.
- The minimum value of the system's reserved memory, also known as the low water mark, has been adjusted to min(6.4G, MemTotal * 5%) to better prevent BE OOM (Out-Of-Memory) issues.
- The processing logic for multiple statements in a single request has been modified. When the client does not set the CLIENT_MULTI_STATEMENTS flag, only the result of the last statement will be returned instead of all statements.
- Direct modification of data in asynchronous materialized views is no longer allowed. ([enhance](mtmv) not allow modify data of MTMV (#35870) #37129)
- A session variable use_max_length_of_varchar_in_ctas has been added to control the behavior of varchar and char type length generation during CTAS (Create Table As Select). The default value is true. When set to false, the derived varchar length is used instead of the maximum length. ([opt](ctas) add a variable to control varchar length in ctas (#37069) #37284)
- Statistics collection now defaults to enabling the functionality of estimating the number of rows in Hive tables based on file size. ([improvement](statistics)Enable estimate hive table row count using file size. (#37218) #37694)
- The transparent rewrite mechanism for asynchronous materialized views is now enabled by default. ([opt](mtmv) Set query rewrite by materialized view default enable #35897)
- Transparent rewrite utilizes partitioned materialized views. If some partitions of the partitioned materialized view fail, the default behavior is to union all the base tables with the materialized view to ensure the correctness of query data. ([opt](mtmv) Set query rewrite by materialized view default enable #35897)
New features
Lakehouse
- The session variable read_csv_empty_line_as_null can be used to control whether empty lines are ignored when reading CSV format files. ([Fix](csv_reader) Add a session variable to control whether empty rows in CSV files are read as NULL values #37153) By default, empty lines are ignored. When set to true, empty lines will be read as rows where all columns are null.
- Added compatibility with Presto's complex type output format ([feature](serde) support presto compatible output format (#37039) #37253)
You can control the output format of complex types to be consistent with Presto by setting set serde_dialect="presto". This is useful for smoothly migrating Presto business operations.
Multi Table Materialized View
-
Support for using non-deterministic functions in building materialized views: [feature](mtmv) pick some mtmv pr from master #37651
-
Support for atomically replacing the definition of asynchronous materialized views: [enhance](mtmv)support replace materialized view (#36749) #37147
-
Support for viewing the creation statement of asynchronous materialized views via show create materialized view: [enhance](mtmv)show create materialized view (#36188) #37125
-
Support for transparent rewriting of multi-dimensional aggregation queries: [feat](mtmv) Support grouping_sets rewrite when query rewrite by materialized view (#36056) #37436
-
Support for transparent rewriting of aggregation queries using non-aggregate materialized views: [feature](mtmv) Support query rewrite by materialized view when query is aggregate and materialized view has no aggregate (#36278) #37497
-
Support for transparent rewriting of DISTINCT aggregations in queries using key columns: [feature](mtmv) pick some mtmv pr from master #37651
-
Support for partitioning materialized views to roll up partitions using date_trunc
[enhance](mtmv)Mtmv rollup #31812
[improvement](mtmv) Materialized view partition track supports date_trunc and optimize the fail reason #35562
Support for partition TVF (Table-Valued Functions): [enhance](mtmv)support partition tvf #36479
Semi-Structured Data Management
- Tables using the VARIANT type now support partial column updates: [Refactor](Variant) refactor flush logic to support partial update #34925
- PreparedStatement support is now enabled by default: [Feature](Prepared Statement) fix and enable enable_server_side_prepared_statement by default #36581
- The VARIANT type now supports export to CSV format: [Feature](Variant) support export csv format #37857
- Support for the explode_json_object function to transpose JSON Object rows into columns: [feature](json)support explode_json_object func #36887
- The ES Catalog now maps ES nested or object types to the Doris JSON type: [feature](ES Catalog) map nested/object type in ES to JSON type in Doris #37101
- By default, support_phrase is enabled for inverted indexes with specified analyzers to improve the performance of match_phrase series queries: [opt](inverted index) set support_phrase default true if parser is set #37949
Query optimizer
- Support for explaining DELETE FROM statements: [feat](nereids) support explain delete from clause #36782 #37100
- Support for hint form of constant expression parameters: [opt](Nereids) support no-key hint parameter (#37720) #37988
Memory Management
- Added an HTTP API to clear the cache. [api](cache) Add HTTP API to clear data cache #36599
Permissions
- Support for authorization of resources within Table-Valued Functions (TVFs): [fix](auth)support check priv when tvf use resource (#36928) #37132
Improvements
Lakehouse
- Upgraded Paimon to version 0.8.1
- Fixed an issue where querying Paimon tables sometimes resulted in a ClassNotFound error for org.apache.commons.lang.StringUtils ([bugfix](paimon)adding dependencies for
clang#37512) - Added support for Tencent Cloud LakeFS: [Improvement](multicatalog) support read tencent dlc table on lakefs #36891
- Optimized the timeout duration when fetching file lists for external table queries ([opt](split) add max wait time of getting splits #36842)
- Configurable via the session variable fetch_splits_max_wait_time_ms
- Improved default connection logic for SQLServer JDBC Catalog ([branch-2.1][improvement](sqlserver catalog) Configurable whether to use encrypt when connecting to SQL Server using the catalog #36971)
By default, the connection encryption settings are not intervened. Only when force_sqlserver_jdbc_encrypt_false is set to true, encrypt=false is forcibly added to the JDBC URL to reduce authentication errors. This allows for more flexible control over encryption behavior, enabling it to be turned on or off as needed. - Added serde properties to the show create table statement for Hive tables ([chore](multi catalog) Print serde properties when show create hive-external-table (#34966) #37096)
Changed the default cache time for Hive table lists on the FE from 1 day to 4 hours
Data export (Export/Outfile) now supports specifying compression formats for Parquet and ORC - When creating a table using CTAS+TVF, partition columns in the TVF are automatically mapped to Varchar(65533) instead of String, allowing them to be used as partition columns for internal tables ([fix](tvf) Partition columns in CTAS need to be compatible with the STRING type of external tables/TVF #37161)
- Optimized the number of metadata accesses for Hive write operations ([opt](hive) save hive table schema in transaction for 2.1 #37127)
- ES Catalog now supports mapping nested/object types to Doris's Json type ([feature](ES Catalog) map nested/object type in ES to JSON type in Doris (#37101) #37182)
- Improved error messages when connecting to Oracle using older versions of the ojdbc driver ([branch-2.1][improvement](jdbc catalog) Catch
AbstractMethodErroringetColumnValueMethod and Suggest Updating to ojdbc8+ #37634) - When Hudi tables return an empty set during Incremental Read, Doris now also returns an empty set instead of an error ([fix](split) remove retry when fetch split batch failed #37636)
- Fixed an issue where inner-outer table join queries could lead to FE timeouts in some cases ([fix](fe) fix several blocking bugs #37756 #37757)
- Fixed an issue with FE metadata replay errors during upgrades from older versions to newer versions when the Hive metastore event listener is enabled ([fix](fe) fix several blocking bugs #37756 #37757)
Multi Table Materialized View
- Support for automatically selecting key columns when creating asynchronous materialized views: [fix](mtmv)mtmv support default key (#36221) #36601
- Asynchronous materialized view partition refresh now supports using the date_trunc function in definitions: [improvement](mtmv) Materialized view partition track supports date_trunc and optimize the fail reason #35562
- In nested materialized views, when the lower level hits a roll-up rewrite for aggregation, the upper level can now continue with transparent rewrites: [feature](mtmv) pick some mtmv pr from master #37651
- Asynchronous materialized views remain available when schema changes do not affect the correctness of their data: [enhance](mtmv)reduce the behavior of triggering the mtmv state to change to schema_change (#36513) #37122
- Improved planning speed for transparent rewrites: [improvement](mtmv) improve mv rewrite performance by reuse the shuttled expression (#37197) #37935
- When calculating the availability of asynchronous materialized views, the current refresh status is no longer taken into account: [enhance](mtmv)when calculating the availability of MTMV, no longer c… #36617
Semi-Structured Data Management
- Optimize DESC performance for viewing VARIANT sub-columns through sampling: [Optimize] Add session variable
max_fetch_remote_schema_tablet_countto limit tablets size for remote schema fetch #37217 - Support for special JSON data with empty keys in the JSON type: [improve](json)improve json support empty keys #36762
Inverted Index
- Reduce latency by minimizing the invocation of inverted index exists to avoid delays in accessing object storage: [Improvement](inverted index) Remove the check for inverted index file exists #36945
Optimize the overhead of the inverted index query process: [opt](inverted index) reduce generation of the rowid_result if not necessary #35357 - Do not create inverted indices in materialized views: [fix](inverted index)Make build index operation only affect base index #36869
Query optimizer
- When both sides of a comparison expression are literals, the string literal will attempt to convert to the type of the other side: [fix](Nereids) processCharacterLiteral even if both side are literal (#36729) #36921
- Refactored the sub-path pushdown functionality for the variant type, now better supporting complex pushdown scenarios: [refactor](variant) refactor sub path push down on variant type (#36478) #36923
- Optimized the logic for calculating the cost of materialized views, enabling more accurate selection of lower-cost materialized views: [pick](nereids) using mv's derived stats (#35721) #37098
- Improved the SQL cache planning speed when using user variables in SQL: [enhancement](nereids) speedup sql cache with variable (#37090) #37119
- Optimized the row estimation logic for NOT NULL expressions, resulting in better performance when NOT NULL is present in queries: [fix](nereids) derive column stats for 'expr and A is not null' (#37235) #37498
- Optimized the null rejection derivation logic for LIKE expressions: [fix](Nereids) fix fe fold constant failed when using like function #37864
- Improved error messages when querying a specific partition fails, making it clearer which table is causing the issue: [chore](Nereids) opt part not exists error msg in bind relation (#36792)(#37160) #37280
Query Execution
- Improved the performance of the bitmap_union operator by up to 3 times in certain scenarios.
- Enhanced the reading performance of Arrow Flight in ARM environments.
- Optimized the execution performance of the explode, explode_map, and explode_json functions.
Data Loading
- Support setting max_filter_ratio for INSERT INTO ... FROM TABLE VALUE FUNCTION
Bug fixes
Lakehouse
- Fixed an issue that caused BE crashes in some cases when querying Parquet format ([fix](multi-catalog) Revert #36575 and check nullptr of data column #37086)
- Fixed an issue where BE printed excessive logs when querying Parquet format ([fix](parquet) prevent parquet page reader print much warning logs #37012)
- Fixed an issue where the FE side created a large number of duplicate FileSystem objects in some cases ([bugfix](hive)Prevent multiple fs from being generated for 2.1 #37142)
- Fixed an issue where transaction information was not cleaned up after writing to Hive in some cases ([Fix](multi-catalog) Fix the transaction is not removed in abnormal situations by removing transaction in finally block. #37172)
- Fixed a thread leak issue caused by Hive table write operations in some cases ([bugfix]thread pool resource leak for 2.1 #36990 #37247)
- Fixed an issue where Hive Text format row and column delimiters could not be correctly obtained in some cases ([fix](hive) support find serde info from both tbl properties and serde properties (#37043) #37188)
- Fixed a concurrency issue when reading lz4 compressed blocks in some cases ([fix](HadoopLz4BlockCompression)Fixed the bug that HadoopLz4BlockCompression creates _decompressor every time it decompresses. #37187)
- Fixed an issue where count(*) on Iceberg tables returned incorrect results in some cases ([opt](iceberg)Add a new appearance to display the pushDown
countfor 2.1 (#37046) (#34928) #37810) - Fixed an issue where creating a Paimon catalog based on MinIO caused FE metadata replay errors in some cases ([fix](multi-catalog)fix paimon meta properties convert #37249)
- Fixed an issue where using Ranger to create a catalog caused the client to hang in some cases ([fix](catalog)Fix internal program error causing client to get stuck #37551)
Multi Table Materialized View
- Fixed an issue where adding new partitions to the base table could lead to incorrect results after partition aggregation roll-up rewrites. [feature](mtmv) pick some mtmv pr from master #37651
- Fixed an issue where the materialized view partition status was not set to out-of-sync after deleting associated base table partitions. [fix](mtmv)fix when related table drop partition,mv partition is sync… #36602
- Fixed an occasional deadlock issue during asynchronous materialized view builds. [fix](mtmv)fix mtmv dead lock (#37009) #37133
- Fixed an occasional "nereids cost too much time" error when refreshing a large number of partitions in a single asynchronous materialized view refresh. [fix](mtmv)fix mtmv task nereids cost too much time #37589
- Fixed an issue where an asynchronous materialized view could not be created if the final select list contained a null literal. [fix](Nereids) null type in result set will be cast to tinyint (#37019) #37281
- Fixed an issue with single-table materialized views where, even though the aggregation materialized view was successfully rewritten, the CBO did not select it.
[opt](nereids)using mv's derived stats #35721
[fix](mtmv) Mapping materialization statistics's expressionToColumnStats to mv scan plan based #36058 - Fixed an issue where partition derivation failed when building a partitioned materialized view with both join inputs being aggregations. [fix](mtmv) Fix getting related partition table wrongly when multi base partition table exists #34781
Semi-Structured Data Management
- Fixed issues with VARIANT in special cases such as concurrency and abnormal data. [Fix](Variant) fix potential heap use after free when concurrently flush segments on one tablet #37976, [Refactor](Variant) should not call finalize in const functions #37839, [Fix](Variant) handle scalar variant with none string root #37794, [Fix](Variant) ensure variant column finalized before reading the root column #37674, [Fix](variant) ignore serialization of nothing type #36997
- Fixed coredump issues when using VARIANT in unsupported SQL. [Refactor](Variant) make many insterfaces exception safe #37640
- Fixed coredump issues related to MAP data type when upgrading from 1.x to 2.x or higher versions. [fix](map)fix upgrade behavior from 1.2 version #36937
- Improved ES Catalog support for Array types. [fix](ES Catalog)Add array types support in esquery function #36936
Inverted Index
- Fixed an issue where DROP INDEX for Inverted Index v2 did not delete metadata. [fix](build index)Remove index_meta in tablet schema when the index is dropped. #37646
- Fixed query accuracy issues when string length exceeded the "ignore above" threshold. [fix] (inverted index) fix query errors caused by ignore_above #37679
- Fixed issues with index size statistics. [fix] (inverted index ) Fix the incorrect index size during compaction #37232, [fix] (index compaction) Fix inverted index file size #37564
Query optimizer
- Fixed an issue that prevented import operations from executing due to the use of reserved keywords. [fix](keyword) let some keyword be non-reserved between old parser and new parser #35938
- Fixed a type error where char(255) was incorrectly recorded as char(1) when creating a table. [Fix](planner) fix bug of char(255) toSql (#37340) #37671
- Fixed incorrect results when the join expression in a correlated subquery was a complex expression. [fix](nereids)subquery unnesting get wrong result if correlated conjuncts is not slot_a = slot_b #37683
- Fixed a potential issue with incorrect bucket pruning for decimal types. [fix](Nereids) tablet prune wrong when decimal value scale is nagtive (#37889) #38013
- Fixed incorrect aggregation operator results when pipeline local shuffle was enabled in certain scenarios. [fix](nereids) fix aggr node colocate flag local shuffle depends on #38016
- Fixed planning errors that could occur when equal expressions existed in aggregation operators. [Fix](nereids) fix NormalizeAgg, change the upper project projections rewrite logic (#36161) #36622
- Fixed planning errors that could occur when lambda expressions were present in aggregation operators. [fix](Nereids) normalize aggregate should not push down lambda's param (#37109) #37285
- Fixed an issue where a literal generated from a window function being optimized to a literal had the wrong type, preventing execution. [fix](Nereids) simplify window expression should inherit data type (#37061) #37283
- Fixed an issue with the null attribute being incorrectly output by the aggregate function foreach combinator. [fix](nereids)fix nullable property of ForEachCombinator #37980
- Fixed an issue where the acos function could not be planned when its parameter was a literal out of range. [fix](nereids)acos function should return null literal instead of NaN value #37996
- Fixed planning errors when specifying partitions for a query on a synchronized materialized view. [fix](statistics)Fix select mv with specified partitions bug. (#36817) #36982
- Fixed occasional Null Pointer Exceptions (NPEs) during planning. [fix](nereids) bug: after is-null stats derive, other column stats are dropped (#37809) #38024
Query Execution
- Fixed an error in delete where statements when using decimal data types as conditions. [fix](delete) Incorrect precision detection for the decimal type in condition. #37801
- Fixed an issue where BE memory was not released after query execution ended. [Bug](join) fix broadcast join running when hash table build not finished #37792, [pipeline](fix) Set upstream operators always runnable once source operator closed #37297
- Fixed a problem where audit logs occupied too much FE memory under high QPS scenarios. [Fix]Add audit log event queue size limit #37786
- Fixed BE core dumps when the sleep function received illegal input values. [fix](sleep) sleep with character const make be crash #37681
- Fixed an error encountered during sync filter size execution. [Chore](runtime-filter) enlarge sync filter size rpc timeout limit #37103
- Fixed incorrect results when using time zones during execution. [Refactor](timezone) refactor tzdata load to accelerate and unify timezone parsing #37062
- Fixed incorrect results when casting strings to integers. [Bug](cast) fix cast string to int return wrong result #36788
- Fixed query errors when using the Arrow Flight protocol with pipelinex enabled. [fix](arrow-flight-sql) Fix pipelineX Unknown result sink type #35804
- Fixed errors when casting strings to dates/datetimes. [Bug](function) Fix function for cast string as date/datetime #35637
- Fixed BE core dumps during large table join queries using <=>. [fix](null safe equal join) fix coredump if both sides of the conjunct is not nullable #36263
Storage Management
- Fixed the issue of invisible DELETE SIGN data encountered during column update and write operations ([branch-2.1](cherry-pick) partial update should not read old fileds from rows with delete sign (#36210) #36755)
- Optimized FE's memory usage during schema changes ([fix](schema change) reduce memory usage in schema change process #30231 #36285 #33073 #36756)
- Fixed the issue where BE would hang during restart due to transactions not being aborted ([fix](txn) Fix coordidator be restart not abort txn #35342 #36437)
- Fixed occasional errors when changing from NOT NULL to NULL data types ([fix](schema-change) Fix schema-change from non-null to null #36389)
- Optimized replica repair scheduling when BE goes down ([improvement](clone) dead be will abort sched task #36795 #36897)
- Supported round-robin disk selection for tablet creation on a single BE ([improvement](balance) partition rebalance chose disk by rr #36826 #36900)
- Fixed query error -230 caused by slow publishing ([improvement](compaction) be do not compact invisible version to avoid query error -230 #28082 #36222)
- Improved the speed of partition balancing ([improvement](partition rebalance) improve partition rebalance choose candidate speed #36509 #36976)
- Controlled segment cache using the number of file descriptors (FDs) and memory to avoid FD exhaustion ([improvement](segmentcache) limit segment cache by memory or segment … #37035)
- Fixed potential replica loss caused by concurrent clone and alter operations ([fix](clone) Fix clone and alter tablet use same tablet path #34889 #36858)
- Fixed the issue of not being able to adjust column order ([branch-2.1] Picks "[Fix](schema change) Fix can't do reorder column schema change for MOW table and duplicate key table #37067" #37226)
- Prohibited certain schema change operations on auto-increment columns ([branch-2.1] Picks "[opt](autoinc) Forbid some schema change when the table has auto-increment column #37186" #37331)
- Fixed inaccurate error reporting for DELETE operations ([branch-2.1] Picks "[Fix](delete) Fix delete job timeout when executing delete from ... #37363" #37374)
- Adjusted the trash expiration time on BE side to one day ([enhancement](trash) support skip trash, update trash default expire time (#37170) #37409)
- Optimized compaction memory usage and scheduling ([[enhancement](compaction) adjust compaction concurrency based on compaction score and workload #37491, 37496)
- Checked for potential oversized backups causing FE restarts ([fix](fe) Add check editlog size mechanism for backupJob (#35653) #37466)
- Restored dynamic partition deletion policies and cross-partition behaviors to 2.1.3 ([fix](dynamic partition) drop partition exclude history_partition_num #37539 #37570, [fix](create table) create table fail not write drop table editlog #37488 #37506, 37964)
- Fixed errors related to decimal types in DELETE predicates ([fix](delete) fix the error message for valid decimal data for 2.1 #37710)
Data Loading
- Fixed data invisibility issues caused by race conditions in error handling during imports ([Pick 2.1] "Fix data loss when node channel been cancelled before close wait (#36662)" #36744, 37527, 37536)
- Added support for hhl_from_base64 in streamload imports ([improvement](stream load)(cherry-pick) support hll_from_base64 for stream load column mapping #36819)
- Fixed potential FE OOM issues when importing very large numbers of tablets for a single table ([fix](oom) avoid oom when a lot of tablets fail on load #36944)
- Fixed possible auto-increment column duplication during FE master-slave switchovers ([fix](autoinc) avoid duplicated auto inc when role of fe changes #36961)
- Fixed errors when inserting into select with auto-increment columns ([branch-2.1] PIck "[Fix](autoinc) Hanlde the processing of auto_increment column on exchange node rather than on TabletWriter when using TABLET_SINK_SHUFFLE_PARTITIONED #36836" #37029)
- Reduced the number of data flush threads to optimize memory usage ([pick]reset memtable flush thread num #37092)
- Improved automatic recovery and error messaging for routineload tasks ([branch-2.1](routine-load) add retry when get Kafka meta info #37371, 37372, 37373, 37391)
- Increased the default batch size for routineload ([branch-2.1](routine-load) increase routine load job default max batch size and rows #37388)
- Fixed routineload task stoppage due to Kafka EOF expiration ([fix](routine-load) fix routine load pause when Kafka data deleted after TTL (#37288) #37983)
- Fixed coredump issues in multi-table streaming ([branch-2.1](move-memtable) fix move memtable core when use multi table load #37370)
- Fixed premature backpressure caused by inaccurate memory estimation in groupcommit ([chery-pick](branch-2.1) Pick "[Fix](group commit) Fix group commit block queue mem estimate fault" #37379)
- Optimized BE-side thread usage in groupcommit ([cherry-pick](branch-2.1) Pick "Use async group commit rpc call (#36499)" #37380)
- Fixed the issue of no error URL when data was not partitioned ([branch-2.1](load) fix no error url if no partition can be found (#36831) #37401)
- Fixed potential memory misoperations during imports ([fix](load) fix memtable agg functions (#38017) #38021, 37939)
Merge on Write Unique Key
- Reduced memory usage during compaction for primary key tables ([pick21][opt](mow) reduce memory usage for mow table compaction #36968)
- Fixed potential duplicate data issues when primary key replica cloning fails ([fix](merge-on-write) when full clone failed, duplicate key might occur (#37001) #37229)
Permissions
- Fixed the issue of missing authorization when a table-valued function references a resource. ([fix](auth)support check priv when tvf use resource (#36928) #37132)
- Fixed the issue where the SHOW ROLE statement did not include workload group permissions. [Fix]Fix show role stmt missing grouo info #36032
- Fixed the issue where executing two statements simultaneously when creating a row policy could cause FE to fail to restart. ([fix](auth)fix fe can not restart when replay create row policy log #37342)
- Fixed the issue where, in some cases, upgrading from an older version could result in FE metadata replay failures due to row policies. ([fix](auth)fix fe can not restart when replay create row policy log #37342)
Others
- Fixed the issue of compute nodes participating in internal table creation. ([Fix](InternalSchema) Compute nodes should not be used for Internal schema three replica (#36130) #37961)
- Fixed the read lag issue when enable_strong_read_consistency is set to true. ([fix](readconsistency) avoid table not exist error (#37593) #37641)
Code of Conduct
- I agree to follow this project's Code of Conduct