Skip to content

Conversation

@eldenmoon
Copy link
Member

@eldenmoon eldenmoon commented Aug 11, 2022

Proposed changes

Currently we use rapidjson to parse json document, It's fast but not fast enough compare to simdjson.And I found that the simdjson has a parsing front-end called simdjson::ondemand which will parse json when accessing fields and could strip the field token from the original document, using this feature we could reduce the cost of string copy(eg. we convert everthing to a string literal in _write_data_to_column by sprintf, I saw a hotspot from the flamegrame in this function, using simdjson::to_json_string will strip the token(a string piece) which is std::string_view and this is exactly we need).And second in _set_column_value we could iterate through the json document by for (auto field: object_val) {xxx}, this is much faster than looking up a field by it's field name like objectValue.FindMember("k1").The third optimization is the at_pointer interface simdjson provided, this could directly get the json field from original document.

Issue Number: close #11663

Problem summary

using simdjson to parse in vjsonscanner.

Checklist(Required)

  1. Does it affect the original behavior:
    • Yes
    • No
    • I don't know
  2. Has unit tests been added:
    • Yes
    • No
    • No Need
  3. Has document been added or modified:
    • Yes
    • No
    • No Need
  4. Does it need to update dependencies:
    • Yes
    • No
  5. Are there any changes that cannot be rolled back:
    • Yes (If Yes, please explain WHY)
    • No

Further comments

If this is a relatively large or complex change, kick off the discussion at dev@doris.apache.org by explaining why you chose the solution you did and what alternatives you considered, etc...

Copy link
Contributor

@yiguolei yiguolei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@yiguolei yiguolei merged commit 01383c3 into apache:master Aug 16, 2022
eldenmoon added a commit to eldenmoon/incubator-doris that referenced this pull request Feb 18, 2023
be config:
enable_simdjson_reader=true

related PR apache#11665
eldenmoon added a commit to eldenmoon/incubator-doris that referenced this pull request Feb 18, 2023
be config:
enable_simdjson_reader=true

related PR apache#11665
qidaye pushed a commit that referenced this pull request Feb 21, 2023
be config:
enable_simdjson_reader=true

related PR #11665
yagagagaga pushed a commit to yagagagaga/doris that referenced this pull request Mar 9, 2023
luwei16 pushed a commit to luwei16/Doris that referenced this pull request Apr 7, 2023
* (Enhancement)[load-json] support simdjson in new json reader (apache#16903)

be config:
enable_simdjson_reader=true

related PR apache#11665

* [Optimize](simd json reader) Cached search results for previous row (keyed as index in JSON object) - used as a hint. (apache#17124)

* [Optimize](simd json reader) Cached search results for previous row (keyed as index in JSON object) - used as a hint.

`_simdjson_set_column_value` could become a hot spot while parsing json in simdjson mode,
introduce `_prev_positions` to cache results for previous row (keyed as index in JSON object) due to the json name field order,
should be quite the same between each lines

* fix case
swjtu-zhanglei pushed a commit to swjtu-zhanglei/incubator-doris that referenced this pull request Jul 25, 2023
… (apache#1445)

* (Enhancement)[load-json] support simdjson in new json reader (apache#16903)

be config:
enable_simdjson_reader=true

related PR apache#11665

* [Optimize](simd json reader) Cached search results for previous row (keyed as index in JSON object) - used as a hint. (apache#17124)

* [Optimize](simd json reader) Cached search results for previous row (keyed as index in JSON object) - used as a hint.

`_simdjson_set_column_value` could become a hot spot while parsing json in simdjson mode,
introduce `_prev_positions` to cache results for previous row (keyed as index in JSON object) due to the json name field order,
should be quite the same between each lines

* fix case
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Enhancement] support simdjson to parse json document when load

2 participants