Skip to content

Support Json unquote function#8407

Merged
ti-chi-bot[bot] merged 29 commits into
pingcap:masterfrom
yibin87:json_unquote
Nov 28, 2023
Merged

Support Json unquote function#8407
ti-chi-bot[bot] merged 29 commits into
pingcap:masterfrom
yibin87:json_unquote

Conversation

@yibin87
Copy link
Copy Markdown
Contributor

@yibin87 yibin87 commented Nov 22, 2023

What problem does this PR solve?

Issue Number: close #8334

Problem Summary:

What is changed and how it works?

Check List

Tests

  • Unit test
  • Integration test
  • Manual test (add detailed scripts or steps below)
  • No code

Side effects

  • Performance regression: Consumes more CPU
  • Performance regression: Consumes more Memory
  • Breaking backward compatibility

Documentation

  • Affects user behaviors
  • Contains syntax changes
  • Contains variable changes
  • Contains experimental features
  • Changes MySQL compatibility

Release note

None

@ti-chi-bot ti-chi-bot Bot added release-note-none Denotes a PR that doesn't merit a release note. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Nov 22, 2023
@yibin87
Copy link
Copy Markdown
Contributor Author

yibin87 commented Nov 22, 2023

/run-all-tests

4 similar comments
@yibin87
Copy link
Copy Markdown
Contributor Author

yibin87 commented Nov 23, 2023

/run-all-tests

@yibin87
Copy link
Copy Markdown
Contributor Author

yibin87 commented Nov 23, 2023

/run-all-tests

@yibin87
Copy link
Copy Markdown
Contributor Author

yibin87 commented Nov 23, 2023

/run-all-tests

@purelind
Copy link
Copy Markdown
Contributor

/run-all-tests

@purelind
Copy link
Copy Markdown
Contributor

/rebuild

@yibin87
Copy link
Copy Markdown
Contributor Author

yibin87 commented Nov 23, 2023

/run-integration-test

@yibin87 yibin87 changed the title [WIP] Support Json unquote function Support Json unquote function Nov 23, 2023
@ti-chi-bot ti-chi-bot Bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Nov 23, 2023
@yibin87
Copy link
Copy Markdown
Contributor Author

yibin87 commented Nov 23, 2023

/run-all-tests

@yibin87
Copy link
Copy Markdown
Contributor Author

yibin87 commented Nov 23, 2023

/hold

@ti-chi-bot ti-chi-bot Bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Nov 23, 2023
@yibin87
Copy link
Copy Markdown
Contributor Author

yibin87 commented Nov 23, 2023

run-integration-test

Comment thread dbms/src/TiDB/Decode/JsonScanner.cpp
Comment on lines +38 to +67
auto & factory = FunctionFactory::instance();
ColumnsWithTypeAndName columns({input_column});
ColumnNumbers argument_column_numbers;
for (size_t i = 0; i < columns.size(); ++i)
argument_column_numbers.push_back(i);

ColumnsWithTypeAndName arguments;
for (const auto argument_column_number : argument_column_numbers)
arguments.push_back(columns.at(argument_column_number));

const String func_name = "cast_json_as_string";
auto builder = factory.tryGet(func_name, context);
if (!builder)
throw TiFlashTestException(fmt::format("Function {} not found!", func_name));
auto func = builder->build(arguments, nullptr);
auto * function_build_ptr = builder.get();
if (auto * default_function_builder = dynamic_cast<DefaultFunctionBuilder *>(function_build_ptr);
default_function_builder)
{
auto * function_impl = default_function_builder->getFunctionImpl().get();
if (auto * function_cast_json_as_string = dynamic_cast<FunctionsCastJsonAsString *>(function_impl);
function_cast_json_as_string)
{
function_cast_json_as_string->setOutputTiDBFieldType(field_type);
}
else
{
throw TiFlashTestException(fmt::format("Function {} not found!", func_name));
}
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems useless because DAGExpressionAnalyerHelper will be called when raw_function_test is false

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't get your point here, just introduce this method for test to set tidb field type here

Comment thread dbms/src/Flash/Coprocessor/DAGExpressionAnalyzerHelper.cpp Outdated
@yibin87
Copy link
Copy Markdown
Contributor Author

yibin87 commented Nov 24, 2023

/run-all-tests

@yibin87
Copy link
Copy Markdown
Contributor Author

yibin87 commented Nov 24, 2023

/rebuild

@yibin87 yibin87 requested a review from SeaRise November 24, 2023 02:35
@yibin87
Copy link
Copy Markdown
Contributor Author

yibin87 commented Nov 24, 2023

/run-all-tests

3 similar comments
@yibin87
Copy link
Copy Markdown
Contributor Author

yibin87 commented Nov 24, 2023

/run-all-tests

@yibin87
Copy link
Copy Markdown
Contributor Author

yibin87 commented Nov 24, 2023

/run-all-tests

@yibin87
Copy link
Copy Markdown
Contributor Author

yibin87 commented Nov 24, 2023

/run-all-tests

Comment thread dbms/src/Functions/FunctionsJson.h Outdated
Comment thread dbms/src/Functions/FunctionsJson.h Outdated
byte_length = std::min(byte_length, orig_length);
if (byte_length < element_write_buffer.count())
context.getDAGContext()->handleTruncateError("Data Too Long");
write_buffer.write(reinterpret_cast<char *>(container_per_element.data()), byte_length);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like if byte_length > element_write_buffer.count(), it will append random bytes, is it the expected behavior?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And Looks like if there is a method to get current pos in write_buffer, we don't need to write tmp result into element_write_buffer and copy it to write_buffer after the byte length check?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

byte_length is expected to be equal or fewer than orig_length, thus shouldn't be byte_length > element_write_buffer.count() case.
And it is not common to set char length here, thus use tmp result to make code more readable.

Copy link
Copy Markdown
Contributor

@windtalker windtalker Nov 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But theoretical speaking, we still need to handle the case of byte_length > element_write_buffer.count()? Maybe we should throw Exception in charLengthToByteLengthFromUTF8 if ret > length?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

byte_length = std::min(byte_length, orig_length); is executed after charLengthToByteLengthFromUTF8, thus byte_length <= orig_length. Not sure if this answer your question.

Copy link
Copy Markdown
Contributor

@windtalker windtalker Nov 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But charLengthToByteLengthFromUTF8 can not guarantee this if it is not a valid utf8 string, so I suggest to throw an exception in charLengthToByteLengthFromUTF8 if ret > length

FormatImpl<FromDataType>::execute(vec_from[i], element_write_buffer, &type, nullptr);
size_t byte_length = element_write_buffer.count();
if (tp.flen() > 0)
if (tp.flen() >= 0)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it a bug fix here?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it is a existing bug.

@yibin87 yibin87 requested a review from windtalker November 27, 2023 03:34
Comment thread dbms/src/Functions/FunctionsJson.h Outdated
@yibin87
Copy link
Copy Markdown
Contributor Author

yibin87 commented Nov 27, 2023

/hold

@yibin87
Copy link
Copy Markdown
Contributor Author

yibin87 commented Nov 28, 2023

/run-all-tests

Signed-off-by: yibin <huyibin@pingcap.com>
@yibin87
Copy link
Copy Markdown
Contributor Author

yibin87 commented Nov 28, 2023

/run-all-tests

@yibin87 yibin87 requested a review from SeaRise November 28, 2023 06:07
Comment thread dbms/src/Functions/tests/gtest_json_array.cpp
Comment thread dbms/src/Functions/FunctionsJson.h Outdated
json_binary.toStringInBuffer(element_write_buffer);
}

size_t orig_length = element_write_buffer.count();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

L475-L483 should be inside the above else branch?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it can reduce useless code for null case. I'll move it.

Comment thread dbms/src/Functions/FunctionsJson.h Outdated
byte_length = std::min(byte_length, orig_length);
if (byte_length < element_write_buffer.count())
context.getDAGContext()->handleTruncateError("Data Too Long");
write_buffer.write(reinterpret_cast<char *>(container_per_element.data()), byte_length);
Copy link
Copy Markdown
Contributor

@windtalker windtalker Nov 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But theoretical speaking, we still need to handle the case of byte_length > element_write_buffer.count()? Maybe we should throw Exception in charLengthToByteLengthFromUTF8 if ret > length?

Comment thread dbms/src/Flash/Coprocessor/DAGExpressionAnalyzerHelper.cpp Outdated
@SeaRise SeaRise self-requested a review November 28, 2023 07:19
Signed-off-by: yibin <huyibin@pingcap.com>
@yibin87 yibin87 requested a review from windtalker November 28, 2023 07:32
Comment thread dbms/src/Functions/FunctionsJson.h Outdated
reinterpret_cast<char *>(container_per_element.data()),
orig_length,
tidb_tp->flen());
byte_length = std::min(byte_length, orig_length);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like this is not necessary since charLengthToByteLengthFromUTF8 should ensure that the return value is less than orig_length?

Comment on lines +468 to +481
JsonBinary::JsonBinaryWriteBuffer element_write_buffer(container_per_element);
JsonBinary json_binary(
data_from[current_offset],
StringRef(&data_from[current_offset + 1], json_length - 1));
json_binary.toStringInBuffer(element_write_buffer);
size_t orig_length = element_write_buffer.count();
auto byte_length = charLengthToByteLengthFromUTF8(
reinterpret_cast<char *>(container_per_element.data()),
orig_length,
tidb_tp->flen());
byte_length = std::min(byte_length, orig_length);
if (byte_length < element_write_buffer.count())
context.getDAGContext()->handleTruncateError("Data Too Long");
write_buffer.write(reinterpret_cast<char *>(container_per_element.data()), byte_length);
Copy link
Copy Markdown
Contributor

@SeaRise SeaRise Nov 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about

Suggested change
JsonBinary::JsonBinaryWriteBuffer element_write_buffer(container_per_element);
JsonBinary json_binary(
data_from[current_offset],
StringRef(&data_from[current_offset + 1], json_length - 1));
json_binary.toStringInBuffer(element_write_buffer);
size_t orig_length = element_write_buffer.count();
auto byte_length = charLengthToByteLengthFromUTF8(
reinterpret_cast<char *>(container_per_element.data()),
orig_length,
tidb_tp->flen());
byte_length = std::min(byte_length, orig_length);
if (byte_length < element_write_buffer.count())
context.getDAGContext()->handleTruncateError("Data Too Long");
write_buffer.write(reinterpret_cast<char *>(container_per_element.data()), byte_length);
auto start_pos = write_buffer.offset();
JsonBinary json_binary(
data_from[current_offset],
StringRef(&data_from[current_offset + 1], json_length - 1));
json_binary.toStringInBuffer(write_buffer);
auto end_pos = write_buffer.offset();
auto orig_length = end_pos - start_pos;
auto byte_length = charLengthToByteLengthFromUTF8(
reinterpret_cast<char *>(write_buffer.data() + start_offset),
orig_length,
tidb_tp->flen());
byte_length = std::min(byte_length, orig_length);
if (byte_length < orig_length)
{
context.getDAGContext()->handleTruncateError("Data Too Long");
write_buffer.setOffset(start_pos + byte_length);
}

?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To avoid one more memcpy.

Copy link
Copy Markdown
Contributor Author

@yibin87 yibin87 Nov 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah,you're right. I just think this code path is not common used(because cast json as fixed length char is valid but strange), and even if it is used the performance won't drop significantly, thus choose to use the temporary buffer here to make code more easier.

@ti-chi-bot ti-chi-bot Bot added needs-1-more-lgtm Indicates a PR needs 1 more LGTM. approved labels Nov 28, 2023
@yibin87
Copy link
Copy Markdown
Contributor Author

yibin87 commented Nov 28, 2023

/unhold

@ti-chi-bot ti-chi-bot Bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Nov 28, 2023
@yibin87
Copy link
Copy Markdown
Contributor Author

yibin87 commented Nov 28, 2023

/run-all-tests

Copy link
Copy Markdown
Contributor

@windtalker windtalker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@ti-chi-bot ti-chi-bot Bot added lgtm and removed needs-1-more-lgtm Indicates a PR needs 1 more LGTM. labels Nov 28, 2023
@ti-chi-bot
Copy link
Copy Markdown
Contributor

ti-chi-bot Bot commented Nov 28, 2023

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: SeaRise, windtalker

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@ti-chi-bot
Copy link
Copy Markdown
Contributor

ti-chi-bot Bot commented Nov 28, 2023

[LGTM Timeline notifier]

Timeline:

  • 2023-11-28 08:22:31.526480833 +0000 UTC m=+910980.191707013: ☑️ agreed by SeaRise.
  • 2023-11-28 08:39:03.4456276 +0000 UTC m=+911972.110853795: ☑️ agreed by windtalker.

@yibin87
Copy link
Copy Markdown
Contributor Author

yibin87 commented Nov 28, 2023

/run-all-tests

@ti-chi-bot
Copy link
Copy Markdown
Contributor

ti-chi-bot Bot commented Nov 28, 2023

@yibin87: Your PR was out of date, I have automatically updated it for you.

At the same time I will also trigger all tests for you:

/run-all-tests

trigger some heavy tests which will not run always when PR updated.

If the CI test fails, you just re-trigger the test that failed and the bot will merge the PR for you after the CI passes.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the ti-community-infra/tichi repository.

@ti-chi-bot ti-chi-bot Bot merged commit 4479df8 into pingcap:master Nov 28, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved lgtm release-note-none Denotes a PR that doesn't merit a release note. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support json_unquote function

4 participants