Support Json unquote function by yibin87 · Pull Request #8407 · pingcap/tiflash

yibin87 · 2023-11-22T06:22:07Z

What problem does this PR solve?

Issue Number: close #8334

Problem Summary:

What is changed and how it works?

Check List

Tests

Unit test
Integration test
Manual test (add detailed scripts or steps below)
No code

Side effects

Performance regression: Consumes more CPU
Performance regression: Consumes more Memory
Breaking backward compatibility

Documentation

Release note

None

yibin87 · 2023-11-22T06:23:14Z

/run-all-tests

yibin87 · 2023-11-23T03:58:33Z

/run-all-tests

yibin87 · 2023-11-23T07:46:18Z

/run-all-tests

yibin87 · 2023-11-23T08:18:10Z

/run-all-tests

purelind · 2023-11-23T08:19:42Z

/run-all-tests

purelind · 2023-11-23T08:22:18Z

/rebuild

yibin87 · 2023-11-23T08:22:49Z

/run-integration-test

yibin87 · 2023-11-23T08:49:38Z

/run-all-tests

yibin87 · 2023-11-23T08:50:11Z

/hold

yibin87 · 2023-11-23T09:11:17Z

run-integration-test

SeaRise · 2023-11-23T09:22:55Z

+        auto & factory = FunctionFactory::instance();
+        ColumnsWithTypeAndName columns({input_column});
+        ColumnNumbers argument_column_numbers;
+        for (size_t i = 0; i < columns.size(); ++i)
+            argument_column_numbers.push_back(i);
+
+        ColumnsWithTypeAndName arguments;
+        for (const auto argument_column_number : argument_column_numbers)
+            arguments.push_back(columns.at(argument_column_number));
+
+        const String func_name = "cast_json_as_string";
+        auto builder = factory.tryGet(func_name, context);
+        if (!builder)
+            throw TiFlashTestException(fmt::format("Function {} not found!", func_name));
+        auto func = builder->build(arguments, nullptr);
+        auto * function_build_ptr = builder.get();
+        if (auto * default_function_builder = dynamic_cast<DefaultFunctionBuilder *>(function_build_ptr);
+            default_function_builder)
+        {
+            auto * function_impl = default_function_builder->getFunctionImpl().get();
+            if (auto * function_cast_json_as_string = dynamic_cast<FunctionsCastJsonAsString *>(function_impl);
+                function_cast_json_as_string)
+            {
+                function_cast_json_as_string->setOutputTiDBFieldType(field_type);
+            }
+            else
+            {
+                throw TiFlashTestException(fmt::format("Function {} not found!", func_name));
+            }
+        }


Seems useless because DAGExpressionAnalyerHelper will be called when raw_function_test is false

Can't get your point here, just introduce this method for test to set tidb field type here

yibin87 · 2023-11-24T01:24:10Z

/run-all-tests

yibin87 · 2023-11-24T01:35:26Z

/rebuild

yibin87 · 2023-11-24T03:14:48Z

/run-all-tests

yibin87 · 2023-11-24T05:16:04Z

/run-all-tests

yibin87 · 2023-11-24T05:52:12Z

/run-all-tests

yibin87 · 2023-11-24T08:09:12Z

/run-all-tests

windtalker · 2023-11-27T03:06:00Z

+                    byte_length = std::min(byte_length, orig_length);
+                    if (byte_length < element_write_buffer.count())
+                        context.getDAGContext()->handleTruncateError("Data Too Long");
+                    write_buffer.write(reinterpret_cast<char *>(container_per_element.data()), byte_length);


Looks like if byte_length > element_write_buffer.count(), it will append random bytes, is it the expected behavior?

And Looks like if there is a method to get current pos in write_buffer, we don't need to write tmp result into element_write_buffer and copy it to write_buffer after the byte length check?

byte_length is expected to be equal or fewer than orig_length, thus shouldn't be byte_length > element_write_buffer.count() case.
And it is not common to set char length here, thus use tmp result to make code more readable.

But theoretical speaking, we still need to handle the case of byte_length > element_write_buffer.count()? Maybe we should throw Exception in charLengthToByteLengthFromUTF8 if ret > length?

byte_length = std::min(byte_length, orig_length); is executed after charLengthToByteLengthFromUTF8, thus byte_length <= orig_length. Not sure if this answer your question.

But charLengthToByteLengthFromUTF8 can not guarantee this if it is not a valid utf8 string, so I suggest to throw an exception in charLengthToByteLengthFromUTF8 if ret > length

windtalker · 2023-11-27T03:09:45Z

                FormatImpl<FromDataType>::execute(vec_from[i], element_write_buffer, &type, nullptr);
                size_t byte_length = element_write_buffer.count();
-                if (tp.flen() > 0)
+                if (tp.flen() >= 0)


Is it a bug fix here?

Yes, it is a existing bug.

yibin87 · 2023-11-27T06:29:22Z

/hold

yibin87 · 2023-11-28T05:11:01Z

/run-all-tests

Signed-off-by: yibin <huyibin@pingcap.com>

yibin87 · 2023-11-28T05:16:54Z

/run-all-tests

windtalker · 2023-11-28T07:09:03Z

+                        json_binary.toStringInBuffer(element_write_buffer);
+                    }
+
+                    size_t orig_length = element_write_buffer.count();


L475-L483 should be inside the above else branch?

Yeah, it can reduce useless code for null case. I'll move it.

windtalker · 2023-11-28T07:13:25Z

+                    byte_length = std::min(byte_length, orig_length);
+                    if (byte_length < element_write_buffer.count())
+                        context.getDAGContext()->handleTruncateError("Data Too Long");
+                    write_buffer.write(reinterpret_cast<char *>(container_per_element.data()), byte_length);


But theoretical speaking, we still need to handle the case of byte_length > element_write_buffer.count()? Maybe we should throw Exception in charLengthToByteLengthFromUTF8 if ret > length?

Signed-off-by: yibin <huyibin@pingcap.com>

windtalker · 2023-11-28T07:44:22Z

+                            reinterpret_cast<char *>(container_per_element.data()),
+                            orig_length,
+                            tidb_tp->flen());
+                        byte_length = std::min(byte_length, orig_length);


Looks like this is not necessary since charLengthToByteLengthFromUTF8 should ensure that the return value is less than orig_length?

SeaRise · 2023-11-28T07:50:17Z

+                        JsonBinary::JsonBinaryWriteBuffer element_write_buffer(container_per_element);
+                        JsonBinary json_binary(
+                            data_from[current_offset],
+                            StringRef(&data_from[current_offset + 1], json_length - 1));
+                        json_binary.toStringInBuffer(element_write_buffer);
+                        size_t orig_length = element_write_buffer.count();
+                        auto byte_length = charLengthToByteLengthFromUTF8(
+                            reinterpret_cast<char *>(container_per_element.data()),
+                            orig_length,
+                            tidb_tp->flen());
+                        byte_length = std::min(byte_length, orig_length);
+                        if (byte_length < element_write_buffer.count())
+                            context.getDAGContext()->handleTruncateError("Data Too Long");
+                        write_buffer.write(reinterpret_cast<char *>(container_per_element.data()), byte_length);


how about

Suggested change

JsonBinary::JsonBinaryWriteBuffer element_write_buffer(container_per_element);

JsonBinary json_binary(

data_from[current_offset],

StringRef(&data_from[current_offset + 1], json_length - 1));

json_binary.toStringInBuffer(element_write_buffer);

size_t orig_length = element_write_buffer.count();

auto byte_length = charLengthToByteLengthFromUTF8(

reinterpret_cast<char *>(container_per_element.data()),

orig_length,

tidb_tp->flen());

byte_length = std::min(byte_length, orig_length);

if (byte_length < element_write_buffer.count())

context.getDAGContext()->handleTruncateError("Data Too Long");

write_buffer.write(reinterpret_cast<char *>(container_per_element.data()), byte_length);

auto start_pos = write_buffer.offset();

JsonBinary json_binary(

data_from[current_offset],

StringRef(&data_from[current_offset + 1], json_length - 1));

json_binary.toStringInBuffer(write_buffer);

auto end_pos = write_buffer.offset();

auto orig_length = end_pos - start_pos;

auto byte_length = charLengthToByteLengthFromUTF8(

reinterpret_cast<char *>(write_buffer.data() + start_offset),

orig_length,

tidb_tp->flen());

byte_length = std::min(byte_length, orig_length);

if (byte_length < orig_length)

{

context.getDAGContext()->handleTruncateError("Data Too Long");

write_buffer.setOffset(start_pos + byte_length);

}

?

To avoid one more memcpy.

Yeah，you're right. I just think this code path is not common used(because cast json as fixed length char is valid but strange), and even if it is used the performance won't drop significantly, thus choose to use the temporary buffer here to make code more easier.

Signed-off-by: yibin <huyibin@pingcap.com>

yibin87 · 2023-11-28T08:30:17Z

/unhold

yibin87 · 2023-11-28T08:30:25Z

/run-all-tests

windtalker

LGTM

ti-chi-bot · 2023-11-28T08:39:02Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: SeaRise, windtalker

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [SeaRise,windtalker]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

ti-chi-bot · 2023-11-28T08:39:04Z

[LGTM Timeline notifier]

Timeline:

2023-11-28 08:22:31.526480833 +0000 UTC m=+910980.191707013: ☑️ agreed by SeaRise.
2023-11-28 08:39:03.4456276 +0000 UTC m=+911972.110853795: ☑️ agreed by windtalker.

yibin87 · 2023-11-28T08:44:56Z

/run-all-tests

ti-chi-bot · 2023-11-28T08:49:30Z

@yibin87: Your PR was out of date, I have automatically updated it for you.

At the same time I will also trigger all tests for you:

/run-all-tests

trigger some heavy tests which will not run always when PR updated.

If the CI test fails, you just re-trigger the test that failed and the bot will merge the PR for you after the CI passes.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the ti-community-infra/tichi repository.

ti-chi-bot Bot added release-note-none Denotes a PR that doesn't merit a release note. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Nov 22, 2023

yibin87 changed the title ~~[WIP] Support Json unquote function~~ Support Json unquote function Nov 23, 2023

ti-chi-bot Bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Nov 23, 2023

yibin87 requested review from SeaRise and windtalker November 23, 2023 08:47

ti-chi-bot Bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Nov 23, 2023

SeaRise reviewed Nov 23, 2023

View reviewed changes

yibin87 requested a review from SeaRise November 24, 2023 02:35

windtalker reviewed Nov 27, 2023

View reviewed changes

yibin87 requested a review from windtalker November 27, 2023 03:34

SeaRise reviewed Nov 27, 2023

View reviewed changes

Comment thread dbms/src/Functions/FunctionsJson.h Outdated

Fix format issue

9b3e744

Signed-off-by: yibin <huyibin@pingcap.com>

yibin87 requested a review from SeaRise November 28, 2023 06:07

windtalker reviewed Nov 28, 2023

View reviewed changes

SeaRise reviewed Nov 28, 2023

View reviewed changes

Comment thread dbms/src/Flash/Coprocessor/DAGExpressionAnalyzerHelper.cpp Outdated

SeaRise self-requested a review November 28, 2023 07:19

Address comments

28a6233

Signed-off-by: yibin <huyibin@pingcap.com>

yibin87 requested a review from windtalker November 28, 2023 07:32

windtalker reviewed Nov 28, 2023

View reviewed changes

SeaRise reviewed Nov 28, 2023

View reviewed changes

Address comments to throw exception when invalid utf8 code encountered

b7752df

Signed-off-by: yibin <huyibin@pingcap.com>

yibin87 requested review from SeaRise and windtalker November 28, 2023 08:21

SeaRise approved these changes Nov 28, 2023

View reviewed changes

ti-chi-bot Bot added needs-1-more-lgtm Indicates a PR needs 1 more LGTM. approved labels Nov 28, 2023

ti-chi-bot Bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Nov 28, 2023

windtalker approved these changes Nov 28, 2023

View reviewed changes

ti-chi-bot Bot added lgtm and removed needs-1-more-lgtm Indicates a PR needs 1 more LGTM. labels Nov 28, 2023

Merge branch 'master' into json_unquote

d8cbc35

Merge branch 'master' into json_unquote

572fea9

ti-chi-bot Bot merged commit 4479df8 into pingcap:master Nov 28, 2023

Conversation

yibin87 commented Nov 22, 2023

What problem does this PR solve?

What is changed and how it works?

Check List

Release note

Uh oh!

yibin87 commented Nov 22, 2023

Uh oh!

yibin87 commented Nov 23, 2023

Uh oh!

yibin87 commented Nov 23, 2023

Uh oh!

yibin87 commented Nov 23, 2023

Uh oh!

purelind commented Nov 23, 2023

Uh oh!

purelind commented Nov 23, 2023

Uh oh!

yibin87 commented Nov 23, 2023

Uh oh!

yibin87 commented Nov 23, 2023

Uh oh!

yibin87 commented Nov 23, 2023

Uh oh!

yibin87 commented Nov 23, 2023

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

yibin87 commented Nov 24, 2023

Uh oh!

yibin87 commented Nov 24, 2023

Uh oh!

yibin87 commented Nov 24, 2023

Uh oh!

yibin87 commented Nov 24, 2023

Uh oh!

yibin87 commented Nov 24, 2023

Uh oh!

yibin87 commented Nov 24, 2023

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

windtalker Nov 28, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

windtalker Nov 28, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

yibin87 commented Nov 27, 2023

Uh oh!

yibin87 commented Nov 28, 2023

Uh oh!

yibin87 commented Nov 28, 2023

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

windtalker Nov 28, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

windtalker Nov 28, 2023 •

edited

Loading

windtalker Nov 28, 2023 •

edited

Loading

windtalker Nov 28, 2023 •

edited

Loading

SeaRise Nov 28, 2023 •

edited

Loading

yibin87 Nov 28, 2023 •

edited

Loading