-
Notifications
You must be signed in to change notification settings - Fork 3.7k
[WIP](dynamic-table) support dynamic schema table #16335
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
994d87e to
9c1b0eb
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
clang-tidy made some suggestions
There were too many comments to post at once. Showing the first 25 out of 40. Check the log or trigger a new build to see more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
clang-tidy made some suggestions
260b12d to
a32c1d5
Compare
|
TeamCity pipeline, clickbench performance test result: |
|
DOCS and regression test will be added in next PR |
a32c1d5 to
e2b9500
Compare
xiaokang
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fe code reviewed
| distribution, tblProperties, extProperties, tableComment, index, false); | ||
| :} | ||
| | KW_CREATE opt_external:isExternal KW_TABLE opt_if_not_exists:ifNotExists table_name:name | ||
| LPAREN column_definition_list:columns COMMA index_definition_list:indexes COMMA DOTDOTDOT RPAREN opt_engine:engineName |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we move ... just after column_definition_list to keep consistent
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but it's a little bit difficult to get the bool value from column_definition_list, it's simple to just add a DOTDOTDOT token after column_definition_list
| } | ||
| if (defaultValue.isSet && defaultValue != DefaultValue.NULL_DEFAULT_VALUE) { | ||
| throw new AnalysisException("Array type column default value only support null"); | ||
| if (defaultValue.isSet && defaultValue != DefaultValue.NULL_DEFAULT_VALUE |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cambyzju can you check the logic for default value?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's ok for array type. null for NULL(Array), [] for NOT NULL(Array).
|
|
||
| private float avgSerializedSize; // in bytes; includes serialization overhead | ||
|
|
||
| private int tableId = -1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what's purpose to assing tableId in TupleDescriptor
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
schema rpc need tableId to identify a specific table
| try { | ||
| if (!env.isMaster()) { | ||
| status.setStatusCode(TStatusCode.ILLEGAL_STATE); | ||
| status.addToErrorMsgs("retry rpc request to master."); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not forward to master fe?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let the backend retry could be ok
| } | ||
| // ignore this column | ||
| if (hasSameNameColumn) { | ||
| continue; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How to process write if the type is not the same as the existed column?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the original will be casted to the existed
| } | ||
| } | ||
|
|
||
| // add a implict container column "__dynamic__" for dynamic columns |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DORIS_DYNAMIC_COL
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
| } | ||
| } | ||
|
|
||
| if (table.isDynamicSchema()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add comment to explain the different to Load.initColumns()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for s3 load etc..
| } | ||
| } | ||
|
|
||
| if (destTable.isDynamicSchema()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add comment to explain difference with Load.initColumns
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
|
|
||
| // prepare columnDefs | ||
| for (TColumnDef tColumnDef : addColumns) { | ||
| if (request.isTypeConflictFree()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The name 'isTypeConflictFree' is not so intuitive. 'allowTypeConflict' may be better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
xiaokang
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fe common code reviewed
| int64_t newest_write_timestamp() const { return _rowset_meta_pb.newest_write_timestamp(); } | ||
|
|
||
| void set_tablet_schema(const TabletSchemaSPtr& tablet_schema) { | ||
| DCHECK(_schema == nullptr); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why delete DCHECK?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
set_tablet_schema could override original _schema
|
|
||
| // send an empty add columns rpc, the rpc response will fill with base schema info | ||
| // maybe we could seperate this rpc from add columns rpc | ||
| Status send_fetch_full_base_schema_view_rpc(FullBaseSchemaView* schema_view) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The name is misleading, since it's not just send rpc but also get result.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fetch could describe get result
| auto target_block = input_block->copy_block(_column_offset); | ||
| vectorized::Block target_block = *input_block; | ||
| // maybe rollup tablet, dynamic table's tablet need full columns | ||
| if (!_tablet_schema->is_dynamic_schema()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add comment for why not copy_block for dynamic schema
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
| std::move(column_ptr), slot_desc->get_data_type_ptr(), slot_desc->col_name())); | ||
| } | ||
|
|
||
| // handle dynamic generated columns |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does new scanners still use BaseScanner?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no new load scan will use vfile_scanner
| _rpc_timeout_ms = state->query_options().query_timeout * 1000; | ||
| _timeout_watch.start(); | ||
|
|
||
| _cur_mutable_block.reset(new vectorized::MutableBlock({_tuple_desc})); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why delete it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
need handle dynamic block
be/src/vec/core/block.cpp
Outdated
| } | ||
|
|
||
| void MutableBlock::add_rows(const Block* block, size_t row_begin, size_t length) { | ||
| if (_type == BlockType::DYNAMIC) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It may be more efficient to add UNLIKELY since most tables are without dynamic schema.
|
|
||
| #include <iostream> | ||
|
|
||
| #include "vec/common/string_buffer.hpp" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not used include?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just make compiler happy
| DataTypePtr get_return_type_impl(const DataTypes& arguments) const override { | ||
| return get_least_supertype({arguments[1], arguments[2]}); | ||
| DataTypePtr type = nullptr; | ||
| get_least_supertype(DataTypes {arguments[1], arguments[2]}, &type); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not keep return value?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
type == null means encounter conflict
| return column.template index_impl<UInt16>(*data_uint16, limit); | ||
| } else if (auto* data_uint32 = detail::get_indexes_data<UInt32>(indexes)) { | ||
| return column.template index_impl<UInt32>(*data_uint32, limit); | ||
| } else if (auto* data_uint64 = detail::get_indexes_data<UInt64>(indexes)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the method suitable for Int8/16/32, Int/UInt128?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no index column could only be usigned integers
| * Examples: least common supertype for UInt8, Int8 - Int16. | ||
| * Examples: there is no least common supertype for Array(UInt8), Int8. | ||
| */ | ||
| DataTypePtr get_least_supertype(const DataTypes& types); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should keep original version for backward compatability.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there is no need for backward compatability?
e2b9500 to
0023cc1
Compare
gensrc/proto/types.proto
Outdated
| FIXEDLENGTHOBJECT = 30; | ||
| JSONB = 31; | ||
| DECIMAL128I = 32; | ||
| VARIANT = 33; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
indent
| distribution, tblProperties, extProperties, tableComment, index, false); | ||
| :} | ||
| | KW_CREATE opt_external:isExternal KW_TABLE opt_if_not_exists:ifNotExists table_name:name | ||
| LPAREN column_definition_list:columns COMMA index_definition_list:indexes COMMA DOTDOTDOT RPAREN opt_engine:engineName |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but it's a little bit difficult to get the bool value from column_definition_list, it's simple to just add a DOTDOTDOT token after column_definition_list
| std::move(column_ptr), slot_desc->get_data_type_ptr(), slot_desc->col_name())); | ||
| } | ||
|
|
||
| // handle dynamic generated columns |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no new load scan will use vfile_scanner
| auto target_block = input_block->copy_block(_column_offset); | ||
| vectorized::Block target_block = *input_block; | ||
| // maybe rollup tablet, dynamic table's tablet need full columns | ||
| if (!_tablet_schema->is_dynamic_schema()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
| int64_t newest_write_timestamp() const { return _rowset_meta_pb.newest_write_timestamp(); } | ||
|
|
||
| void set_tablet_schema(const TabletSchemaSPtr& tablet_schema) { | ||
| DCHECK(_schema == nullptr); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
set_tablet_schema could override original _schema
| try { | ||
| if (!env.isMaster()) { | ||
| status.setStatusCode(TStatusCode.ILLEGAL_STATE); | ||
| status.addToErrorMsgs("retry rpc request to master."); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let the backend retry could be ok
| } | ||
| // ignore this column | ||
| if (hasSameNameColumn) { | ||
| continue; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the original will be casted to the existed
| } | ||
|
|
||
| if (!queryMode && !columnDefs.isEmpty()) { | ||
| //3.create AddColumnsClause |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
|
|
||
| // prepare columnDefs | ||
| for (TColumnDef tColumnDef : addColumns) { | ||
| if (request.isTypeConflictFree()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
0023cc1 to
601f1a8
Compare
xiaokang
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
|
PR approved by anyone and no changes requested. |
a8cbdb3 to
894478f
Compare
dbbedac to
0efbff0
Compare
…t code for ArrayType
1. Load plan, inject variant type in the tuple descriptor end slot 2. VJsonScanner, parse json docs to dynamic column and extract columns 3. BaseScanner _materialize_dest_block 4. VBrokerScannode, adapt variant Block, since block schema may change during scan 5. VTabletSink, adapt variant Block, since block schema may change during sink
* [fix](test) fix dynamic_table regression case * [improvement](inverted index) when excute compound predicate logic, check predicates which not apply by index
corret tablet should be used
…he#1331) * (improvement)[dynamic-table] support load in new_load_scan_node * [bugfix](thirdparty) patch simdjson to avoid conflict with odbc macro BOOL (apache#15223) fix conflit name BOOL in odbc sqltypes.h and simdjson element.h. Change BOOL to BOOLEAN in simdjson. Co-authored-by: Kang <kxiao.tiger@gmail.com>
…ache#1361) * [improvement](dynamic-table) refine some logic 1. support filter malformed json lines 2. refactor block align logic
0efbff0 to
37d7134
Compare
| 1: required Status.TStatus status | ||
| 2: required i64 table_id | ||
| 3: required list<Descriptors.TColumn> allColumns | ||
| 4: required i32 schema_version |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use optional
morningman
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, with 2 TODO:
- use
optionalfor all proto and thrift field - Add document for new configs
Issue Number: close apache#16351 Dynamic schema table is a special type of table, it's schema change with loading procedure.Now we implemented this feature mainly for semi-structure data such as JSON, since JSON is schema self-described we could extract schema info from the original documents and inference the final type infomation.This speical table could reduce manual schema change operation and easily import semi-structure data and extends it's schema automatically.
* [WIP](dynamic-table) support dynamic schema table (apache#16335) Issue Number: close apache#16351 Dynamic schema table is a special type of table, it's schema change with loading procedure.Now we implemented this feature mainly for semi-structure data such as JSON, since JSON is schema self-described we could extract schema info from the original documents and inference the final type infomation.This speical table could reduce manual schema change operation and easily import semi-structure data and extends it's schema automatically. * [improve](dynamic-table) change `addColumns` RPC interface fields from `required` to `optional` and and config doc (apache#16632) * [doc](dynamic-table) Add docs for dynamic-table (apache#16669) * [improve](dynamic table) refine SegmentWriter columns writer generate (apache#16816) * [improve](dynamic table) refine SegmentWriter columns writer generate ``` Dynamic Block consists of two parts, dynamic part of columns and static part of columns static dynamic | ----- | ------- | the static ones are original _tablet_schame columns the dynamic ones are auto generated and extended from file scan. ``` **We should only consisder to use Block info to generte columns when it's a dynamic table load procudure.** And seperate the static ones and dynamic ones * test * [typo](docs)fix dynamic Table version label (apache#16895) * [Feature](Dynamic schema table) step1 support schema change expression (apache#17494) 1. introduce a new type `VARIANT` to encapsulate dynamic generated columns for hidding the detail of types and names of newly generated columns 2. introduce a new expression `SchemaChangeExpr` for doing schema change for extensibility * Fix compile * [Bug](dynamic-table) Fix column alignment logic and support filtering null values when slot is not null Before this PR when encountering null values with some columns which is specified as `NOT NULL`, null values will not be filtered,thi behavior does not match with the original load behavior. Second column alignment logic has bug : ``` template <typename ColumnInserterFn> void align_variant_by_name_and_type(ColumnObject& dst, const ColumnObject& src, size_t row_cnt, ColumnInserterFn inserter) { CHECK(dst.is_finalized() && src.is_finalized()); // Use rows() here instead of size(), since size() will check_consistency // but we could not check_consistency since num_rows will be upgraded even // if src and dst is empty, we just increase the num_rows of dst and fill // num_rows of default values when meet new data size_t num_rows = dst.rows(); ``` --------- Co-authored-by: jiafeng.zhang <zhangjf1@gmail.com>
Proposed changes
Issue Number: close #16351
Dynamic schema table is a special type of table, it's schema change with loading procedure.Now we implemented this feature mainly for semi-structure data such as JSON, since JSON is schema self-described we could extract schema info from the original documents and inference the final type infomation.This speical table could reduce manual schema change operation and easily import semi-structure data and extends it's schema automatically.
Problem summary
Describe your changes.
Checklist(Required)
Further comments
If this is a relatively large or complex change, kick off the discussion at dev@doris.apache.org by explaining why you chose the solution you did and what alternatives you considered, etc...