Json-Parse: Parse with `RapidJson::kParseStopWhenDoneFlag` in Unix by swaroop-sridhar · Pull Request #35073 · dotnet/runtime

swaroop-sridhar · 2020-04-16T21:35:37Z

RapidJson's kParseStopWhenDoneFlag indicates that parsing should stop once the root element is parsed, and should not throw an error for any further content.

This flag is useful when the input stream doesn't always have a null-terminator --
ex: an input bytestream (rather than a stringstream), files embeded within the single file bundle.
The limitation of using this flag is that any actual random text after the root element is silently ignored.

Currently, RapidJson is invoked with kParseStopWhenDoneFlag on Windows, but not on Unix.
This caused kParseErrorDocumentRootNotSingular failure on Unix when parsing json files from single-file bundles.

This change fixes this problem by passing kParseStopWhenDoneFlag on all platforms.
This also makes the host behavior consistent across platforms.

ghost · 2020-04-16T21:35:42Z

Tagging subscribers to this area: @vitek-karas, @swaroop-sridhar
Notify danmosemsft if you want to be subscribed.

ghost · 2020-04-16T21:35:50Z

Tagging subscribers to this area: @swaroop-sridhar
Notify danmosemsft if you want to be subscribed.

lpereira · 2020-04-17T02:34:22Z

+
+            m_bundle_data[m_bundle_location->size] = 0;
+#endif
            bool result = parse_json(m_bundle_data, m_bundle_location->size, path);


Wouldn't it be better to alter parse_json() to use the Parse() overloads that take a length?

I find this dangerous - what if there's real data right after the JSON in the single file. In that case we would be writing over it. If there's a way to tell the parser the length (like Leandro suggests) that would be a much cleaner solution.

@leandro: On Linux, the Json parser uses insitu parsing, which requires a null-terminated string. There is no variant of ParseInsitu that takes a size.
https://github.com/dotnet/runtime/blob/master/src/installer/corehost/cli/json/rapidjson/document.h#L2296-L2314

@vitek-karas The json files are mapped copy-on-write, so there's no risk in overwriting the content beyond the json file.

It is interesting to decide between the tradeoffs:

Map the bundle copy-on-write and call ParseInsitu(). This will pretty-much copy all the pages.

Map the bundle as read-only, and call Parse(). This will copy out strings to new buffers during parsing.

So, I used the copy-on-write mapping approach because:

The Rapid Json documentation says that parsing incurs a large overhead because of copying individual strings to new places during parsing. Copying whole pages could be more efficient on modern hardware (thanks to write-buffers). Of course, we'll need to measure the difference to be sure.

It keeps the code somewhat simpler -- ex: no need to ParseInsitu() for normal apps, Parse() for single-file apps.

OK - it will work... but I think it's worth a detailed comment explanation. It's not only that we map it as copy-on-write. It's also that we map it separately for each JSON we read (or rather we only read one JSON per mapping). Copy-on-write itself is not enough.
I wonder if there's some kind of assert or something we can do.

Honestly if we choose this solution I would actually prefer to modify the bundler to force-inject at least one byte of padding after the config files. Size wise it's trivial and it would make this always correct and robust (and we could make that padding a specific value and assert it in the hostpolicy).

Just pass the kParseInSitu flag to the Parse() method that takes a length. ParseInsitu() is just a convenience method.

In fact, I think that not passing the length is why the Windows version has the kParseStopWhenDoneFlag... hindsight is 20/20. Maybe json_parse() should be something like this?

constexpr auto flags = rapidjson::ParseFlag::kParseCommentsFlag #ifndef _WIN32 // Can't use in-situ parsing on Windows, as JSON data is encoded in // UTF-8 and the host expects wide strings. m_document will store // data in UTF-16 (with pal::char_t as the character type), but it // has to know that data is encoded in UTF-8 to convert during parsing. | rapidjson::ParseFlag::kParseInSituFlag #endif // _WIN32 ; m_document.Parse<flags, rapidjson::UTF8<>>(data, length);

@lpereira, the Parse() variants that take a size arguments use memorystream and are not compatible with in-situ parsing. Here's the check that precludes using Parse() with kParseInsituFlag.

However, from your comments, I noticed that kParseStopWhenDoneFlag flag, which serves this purpose on Windows. So, I've updated the PR to use kParseStopWhenDoneFlag on all architectures.

Like @vitek-karas suggested above, we could just as easily add the null terminator from the bundle generator. However, this approach of using kParseStopWhenDoneFlag is simpler (less coupling between bundler and hosts) and makes the parser behavior more consistent across architectures.

I've updated the PR notes above wrt to the new fix. Thanks.

…hitectures RapidJson's `kParseStopWhenDoneFlag` indicates that parsing should stop once the root element is parsed, and should not throw an error for any further content. This flag is useful when the input stream doesn't always have a null-terminator -- ex: an input `bytestream` (rather than a `stringstream`), files embeded within the single file bundle. The limitation of using this flag is that any actual random text after the root element is silently ignored. Currently, RapidJson is invoked with `kParseStopWhenDoneFlag` on Windows, but not on Unix. This caused `kParseErrorDocumentRootNotSingular` failure on Unix when parsing json files from single-file bundles. This change fixes this problem by passing `kParseStopWhenDoneFlag` on all platforms. This also makes the host behavior consistent across platforms.

ghost · 2020-04-18T01:41:15Z

Tagging subscribers to this area: @vitek-karas, @swaroop-sridhar
Notify danmosemsft if you want to be subscribed.

vitek-karas

This is nice - much cleaner way to fix this.

Nit: Just so that we have good test coverage, can you please add a test (non-single-file is fine) which intentionally has some garbage after the main object in .runtimconfig.json. Just so that we have a test covering that part of the functionality.

swaroop-sridhar · 2020-04-20T19:40:19Z

@lpereira I can't merge this change since you have the "changes required" hold.
Can you please take a look at it now? Thanks.

swaroop-sridhar · 2020-04-20T19:43:03Z

Nit: Just so that we have good test coverage, can you please add a test (non-single-file is fine) which intentionally has some garbage after the main object in .runtimconfig.json. Just so that we have a test covering that part of the functionality.

Sure, I'll add a test case.
However, the kParseStopWhenDoneFlag is actually relaxing a restriction. We don't need to explicitly support the case where there is junk content after the root element in a json file. The useful case of memory-maps is already tested by PublishSingleFile apps.

lpereira · 2020-04-20T20:04:06Z

Sure, looks good. I don't remember why I didn't enable this flag for all platforms, but it's better to be more consistent. Since CI seems happy, I'm happy with this change.

…hitectures (dotnet#35073) RapidJson's `kParseStopWhenDoneFlag` indicates that parsing should stop once the root element is parsed, and should not throw an error for any further content. This flag is useful when the input stream doesn't always have a null-terminator -- ex: an input `bytestream` (rather than a `stringstream`), files embeded within the single file bundle. The limitation of using this flag is that any actual random text after the root element is silently ignored. Currently, RapidJson is invoked with `kParseStopWhenDoneFlag` on Windows, but not on Unix. This caused `kParseErrorDocumentRootNotSingular` failure on Unix when parsing json files from single-file bundles. This change fixes this problem by passing `kParseStopWhenDoneFlag` on all platforms. This also makes the host behavior consistent across platforms.

Dotnet-GitSync-Bot added the area-Host label Apr 16, 2020

swaroop-sridhar added the area-Single-File label Apr 16, 2020

swaroop-sridhar removed the area-Host label Apr 16, 2020

swaroop-sridhar requested a review from lpereira April 16, 2020 21:36

lpereira suggested changes Apr 17, 2020

View reviewed changes

swaroop-sridhar force-pushed the parse branch from 77ce444 to 430fa58 Compare April 18, 2020 01:31

swaroop-sridhar changed the title ~~Single-File: Add Null-Terminator in-memory json stream.~~ Json-Parse: Parse with RapidJson::kParseStopWhenDoneFlag in Unix Apr 18, 2020

swaroop-sridhar added the area-Host label Apr 18, 2020

swaroop-sridhar removed the area-Single-File label Apr 18, 2020

vitek-karas approved these changes Apr 20, 2020

View reviewed changes

lpereira approved these changes Apr 20, 2020

View reviewed changes

swaroop-sridhar merged commit 342ed3c into dotnet:master Apr 23, 2020

swaroop-sridhar mentioned this pull request Apr 30, 2020

[release/5.0-preview4] Single-File: Fix a json parsing issue #35656

Closed

ghost locked as resolved and limited conversation to collaborators Dec 9, 2020

Conversation

swaroop-sridhar commented Apr 16, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ghost commented Apr 16, 2020

Uh oh!

ghost commented Apr 16, 2020

Uh oh!

lpereira Apr 17, 2020

Choose a reason for hiding this comment

Uh oh!

vitek-karas Apr 17, 2020

Choose a reason for hiding this comment

Uh oh!

swaroop-sridhar Apr 17, 2020

Choose a reason for hiding this comment

Uh oh!

vitek-karas Apr 17, 2020

Choose a reason for hiding this comment

Uh oh!

lpereira Apr 17, 2020

Choose a reason for hiding this comment

Uh oh!

lpereira Apr 17, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

swaroop-sridhar Apr 18, 2020

Choose a reason for hiding this comment

Uh oh!

swaroop-sridhar Apr 18, 2020

Choose a reason for hiding this comment

Uh oh!

ghost commented Apr 18, 2020

Uh oh!

vitek-karas left a comment

Choose a reason for hiding this comment

Uh oh!

swaroop-sridhar commented Apr 20, 2020

Uh oh!

swaroop-sridhar commented Apr 20, 2020

Uh oh!

lpereira commented Apr 20, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

swaroop-sridhar commented Apr 16, 2020 •

edited

Loading

lpereira Apr 17, 2020 •

edited

Loading