Skip to content

Conversation

@eldenmoon
Copy link
Member

@eldenmoon eldenmoon commented Feb 1, 2023

Proposed changes

Issue Number: close #16351

Dynamic schema table is a special type of table, it's schema change with loading procedure.Now we implemented this feature mainly for semi-structure data such as JSON, since JSON is schema self-described we could extract schema info from the original documents and inference the final type infomation.This speical table could reduce manual schema change operation and easily import semi-structure data and extends it's schema automatically.

Problem summary

Describe your changes.

Checklist(Required)

  1. Does it affect the original behavior:
    • Yes
    • No
    • I don't know
  2. Has unit tests been added:
    • Yes
    • No
    • No Need
  3. Has document been added or modified:
    • Yes
    • No
    • No Need
  4. Does it need to update dependencies:
    • Yes
    • No
  5. Are there any changes that cannot be rolled back:
    • Yes (If Yes, please explain WHY)
    • No

Further comments

If this is a relatively large or complex change, kick off the discussion at dev@doris.apache.org by explaining why you chose the solution you did and what alternatives you considered, etc...

@github-actions github-actions bot added area/load Issues or PRs related to all kinds of load area/planner Issues or PRs related to the query planner area/sql/function Issues or PRs related to the SQL functions area/vectorization labels Feb 1, 2023
Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clang-tidy made some suggestions

There were too many comments to post at once. Showing the first 25 out of 40. Check the log or trigger a new build to see more.

Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clang-tidy made some suggestions

@hello-stephen
Copy link
Contributor

hello-stephen commented Feb 1, 2023

TeamCity pipeline, clickbench performance test result:
the sum of best hot time: 35.61 seconds
stream load tsv: 477 seconds loaded 74807831229 Bytes, about 149 MB/s
stream load json: 40 seconds loaded 2358488459 Bytes, about 56 MB/s
stream load orc: 69 seconds loaded 1101869774 Bytes, about 15 MB/s
https://doris-community-test-1308700295.cos.ap-hongkong.myqcloud.com/tmp/20230210125225_clickbench_pr_94072.html

@eldenmoon
Copy link
Member Author

DOCS and regression test will be added in next PR

@eldenmoon eldenmoon marked this pull request as ready for review February 2, 2023 04:17
Copy link
Contributor

@xiaokang xiaokang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fe code reviewed

distribution, tblProperties, extProperties, tableComment, index, false);
:}
| KW_CREATE opt_external:isExternal KW_TABLE opt_if_not_exists:ifNotExists table_name:name
LPAREN column_definition_list:columns COMMA index_definition_list:indexes COMMA DOTDOTDOT RPAREN opt_engine:engineName
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we move ... just after column_definition_list to keep consistent

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but it's a little bit difficult to get the bool value from column_definition_list, it's simple to just add a DOTDOTDOT token after column_definition_list

}
if (defaultValue.isSet && defaultValue != DefaultValue.NULL_DEFAULT_VALUE) {
throw new AnalysisException("Array type column default value only support null");
if (defaultValue.isSet && defaultValue != DefaultValue.NULL_DEFAULT_VALUE
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cambyzju can you check the logic for default value?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's ok for array type. null for NULL(Array), [] for NOT NULL(Array).


private float avgSerializedSize; // in bytes; includes serialization overhead

private int tableId = -1;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's purpose to assing tableId in TupleDescriptor

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

schema rpc need tableId to identify a specific table

try {
if (!env.isMaster()) {
status.setStatusCode(TStatusCode.ILLEGAL_STATE);
status.addToErrorMsgs("retry rpc request to master.");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not forward to master fe?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let the backend retry could be ok

}
// ignore this column
if (hasSameNameColumn) {
continue;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How to process write if the type is not the same as the existed column?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the original will be casted to the existed

}
}

// add a implict container column "__dynamic__" for dynamic columns
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DORIS_DYNAMIC_COL

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

}
}

if (table.isDynamicSchema()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add comment to explain the different to Load.initColumns()

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for s3 load etc..

}
}

if (destTable.isDynamicSchema()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add comment to explain difference with Load.initColumns

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done


// prepare columnDefs
for (TColumnDef tColumnDef : addColumns) {
if (request.isTypeConflictFree()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The name 'isTypeConflictFree' is not so intuitive. 'allowTypeConflict' may be better.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

Copy link
Contributor

@xiaokang xiaokang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fe common code reviewed

int64_t newest_write_timestamp() const { return _rowset_meta_pb.newest_write_timestamp(); }

void set_tablet_schema(const TabletSchemaSPtr& tablet_schema) {
DCHECK(_schema == nullptr);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why delete DCHECK?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

set_tablet_schema could override original _schema


// send an empty add columns rpc, the rpc response will fill with base schema info
// maybe we could seperate this rpc from add columns rpc
Status send_fetch_full_base_schema_view_rpc(FullBaseSchemaView* schema_view) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The name is misleading, since it's not just send rpc but also get result.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fetch could describe get result

auto target_block = input_block->copy_block(_column_offset);
vectorized::Block target_block = *input_block;
// maybe rollup tablet, dynamic table's tablet need full columns
if (!_tablet_schema->is_dynamic_schema()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add comment for why not copy_block for dynamic schema

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

std::move(column_ptr), slot_desc->get_data_type_ptr(), slot_desc->col_name()));
}

// handle dynamic generated columns
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does new scanners still use BaseScanner?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no new load scan will use vfile_scanner

_rpc_timeout_ms = state->query_options().query_timeout * 1000;
_timeout_watch.start();

_cur_mutable_block.reset(new vectorized::MutableBlock({_tuple_desc}));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why delete it

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need handle dynamic block

}

void MutableBlock::add_rows(const Block* block, size_t row_begin, size_t length) {
if (_type == BlockType::DYNAMIC) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It may be more efficient to add UNLIKELY since most tables are without dynamic schema.


#include <iostream>

#include "vec/common/string_buffer.hpp"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not used include?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just make compiler happy

DataTypePtr get_return_type_impl(const DataTypes& arguments) const override {
return get_least_supertype({arguments[1], arguments[2]});
DataTypePtr type = nullptr;
get_least_supertype(DataTypes {arguments[1], arguments[2]}, &type);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not keep return value?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

type == null means encounter conflict

return column.template index_impl<UInt16>(*data_uint16, limit);
} else if (auto* data_uint32 = detail::get_indexes_data<UInt32>(indexes)) {
return column.template index_impl<UInt32>(*data_uint32, limit);
} else if (auto* data_uint64 = detail::get_indexes_data<UInt64>(indexes)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the method suitable for Int8/16/32, Int/UInt128?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no index column could only be usigned integers

* Examples: least common supertype for UInt8, Int8 - Int16.
* Examples: there is no least common supertype for Array(UInt8), Int8.
*/
DataTypePtr get_least_supertype(const DataTypes& types);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should keep original version for backward compatability.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there is no need for backward compatability?

FIXEDLENGTHOBJECT = 30;
JSONB = 31;
DECIMAL128I = 32;
VARIANT = 33;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

indent

distribution, tblProperties, extProperties, tableComment, index, false);
:}
| KW_CREATE opt_external:isExternal KW_TABLE opt_if_not_exists:ifNotExists table_name:name
LPAREN column_definition_list:columns COMMA index_definition_list:indexes COMMA DOTDOTDOT RPAREN opt_engine:engineName
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but it's a little bit difficult to get the bool value from column_definition_list, it's simple to just add a DOTDOTDOT token after column_definition_list

std::move(column_ptr), slot_desc->get_data_type_ptr(), slot_desc->col_name()));
}

// handle dynamic generated columns
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no new load scan will use vfile_scanner

auto target_block = input_block->copy_block(_column_offset);
vectorized::Block target_block = *input_block;
// maybe rollup tablet, dynamic table's tablet need full columns
if (!_tablet_schema->is_dynamic_schema()) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

int64_t newest_write_timestamp() const { return _rowset_meta_pb.newest_write_timestamp(); }

void set_tablet_schema(const TabletSchemaSPtr& tablet_schema) {
DCHECK(_schema == nullptr);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

set_tablet_schema could override original _schema

try {
if (!env.isMaster()) {
status.setStatusCode(TStatusCode.ILLEGAL_STATE);
status.addToErrorMsgs("retry rpc request to master.");
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let the backend retry could be ok

}
// ignore this column
if (hasSameNameColumn) {
continue;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the original will be casted to the existed

}

if (!queryMode && !columnDefs.isEmpty()) {
//3.create AddColumnsClause
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done


// prepare columnDefs
for (TColumnDef tColumnDef : addColumns) {
if (request.isTypeConflictFree()) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

xiaokang
xiaokang previously approved these changes Feb 7, 2023
Copy link
Contributor

@xiaokang xiaokang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@github-actions
Copy link
Contributor

github-actions bot commented Feb 7, 2023

PR approved by anyone and no changes requested.

@eldenmoon eldenmoon force-pushed the dynamic-table branch 2 times, most recently from a8cbdb3 to 894478f Compare February 8, 2023 05:12
xiaokang
xiaokang previously approved these changes Feb 8, 2023
@eldenmoon eldenmoon force-pushed the dynamic-table branch 4 times, most recently from dbbedac to 0efbff0 Compare February 10, 2023 08:40
eldenmoon and others added 24 commits February 10, 2023 17:59
1. Load plan, inject variant type in the tuple descriptor end slot
2. VJsonScanner, parse json docs to dynamic column and extract columns
3. BaseScanner _materialize_dest_block
4. VBrokerScannode, adapt variant Block, since block schema may change during scan
5. VTabletSink, adapt variant Block, since block schema may change during sink
* [fix](test) fix dynamic_table regression case

* [improvement](inverted index) when excute compound predicate logic, check predicates which not apply by index
…he#1331)

* (improvement)[dynamic-table] support load in new_load_scan_node

* [bugfix](thirdparty) patch simdjson to avoid conflict with odbc macro BOOL (apache#15223)

fix conflit name BOOL in odbc sqltypes.h and simdjson element.h. Change BOOL to BOOLEAN in simdjson.

Co-authored-by: Kang <kxiao.tiger@gmail.com>
…ache#1361)

* [improvement](dynamic-table) refine some logic
1. support filter malformed json lines
2. refactor block align logic
1: required Status.TStatus status
2: required i64 table_id
3: required list<Descriptors.TColumn> allColumns
4: required i32 schema_version
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use optional

Copy link
Contributor

@morningman morningman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, with 2 TODO:

  1. use optional for all proto and thrift field
  2. Add document for new configs

@morningman morningman merged commit 37d1519 into apache:master Feb 11, 2023
YangShaw pushed a commit to YangShaw/doris that referenced this pull request Feb 17, 2023
Issue Number: close apache#16351

Dynamic schema table is a special type of table, it's schema change with loading procedure.Now we implemented this feature mainly for semi-structure data such as JSON, since JSON is schema self-described we could extract schema info from the original documents and inference the final type infomation.This speical table could reduce manual schema change operation and easily import semi-structure data and extends it's schema automatically.
luwei16 pushed a commit to luwei16/Doris that referenced this pull request Apr 7, 2023
* [WIP](dynamic-table) support dynamic schema table (apache#16335)

Issue Number: close apache#16351

Dynamic schema table is a special type of table, it's schema change with loading procedure.Now we implemented this feature mainly for semi-structure data such as JSON, since JSON is schema self-described we could extract schema info from the original documents and inference the final type infomation.This speical table could reduce manual schema change operation and easily import semi-structure data and extends it's schema automatically.

* [improve](dynamic-table) change `addColumns` RPC interface fields from `required` to `optional` and and config doc (apache#16632)

* [doc](dynamic-table) Add docs for dynamic-table (apache#16669)

* [improve](dynamic table) refine SegmentWriter columns writer generate (apache#16816)

* [improve](dynamic table) refine SegmentWriter columns writer generate

```
Dynamic Block consists of two parts, dynamic part of columns and static part of columns
static   dynamic
| ----- | ------- |
the static ones are original _tablet_schame columns
the dynamic ones are auto generated and extended from file scan.
```
**We should only consisder to use Block info to generte columns when it's a dynamic table load procudure.**
And seperate the static ones and dynamic ones

* test

* [typo](docs)fix dynamic Table version label (apache#16895)

* [Feature](Dynamic schema table) step1 support schema change expression (apache#17494)

1. introduce a new type `VARIANT` to encapsulate dynamic generated columns for hidding the detail of types and names of newly generated columns
2. introduce a new expression `SchemaChangeExpr` for doing schema change for extensibility

* Fix compile

* [Bug](dynamic-table) Fix column alignment logic and support filtering null values when slot is not null

Before this PR when encountering null values with some columns which is specified as `NOT NULL`, null values will not be filtered,thi behavior does not match with the original load behavior.
Second column alignment logic has bug :

```
template <typename ColumnInserterFn>
void align_variant_by_name_and_type(ColumnObject& dst, const ColumnObject& src, size_t row_cnt,
                                    ColumnInserterFn inserter) {
    CHECK(dst.is_finalized() && src.is_finalized());
    // Use rows() here instead of size(), since size() will check_consistency
    // but we could not check_consistency since num_rows will be upgraded even
    // if src and dst is empty, we just increase the num_rows of dst and fill
    // num_rows of default values when meet new data
    size_t num_rows = dst.rows();
```

---------

Co-authored-by: jiafeng.zhang <zhangjf1@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/load Issues or PRs related to all kinds of load area/planner Issues or PRs related to the query planner area/sql/function Issues or PRs related to the SQL functions area/vectorization reviewed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature] support dynami schema table

6 participants