Add parser callback with the ability to filter results. by aburgh · Pull Request #41 · nlohmann/json

aburgh · 2015-03-02T22:49:01Z

This request builds on the "incremental" pull request. I separated the two in case you find this change objectionable. The changes implement a callback to a user-provided function (which can be a closure) to notify the user of key parser events: entering object and array elements, closing object and array elements, parsing an object key, and parsing a value. This enables processing elements as they are parsed, for example to provide progress feedback. More importantly, the user function returns a bool to indicate whether to keep the value. This can be used to filter the accumulated elements to reduce memory consumption. There is a default callback provided, so it existing code should compile and work as before.

Below is an example use case. It parses a JSON file that consists of an array (which is inside a simple object) of a large number of objects. The example just pretty-prints the result, discarding all dictionaries at a depth of 2, but it could do more interesting processing. Without the callback, a 4.1 MB test file uses 12.5 MB of memory. With the callback, it peaks at around 680 KB, most of which is process overhead.

using namespace std;
using json = nlohmann::json;

static const auto tab = '\t';
ifstream stream("/tmp/test/json/data.json");

...
{
        bool keyed = false;

        json j = json::parse(stream, [&keyed] (int depth, json::parse_event_t event, const json& element) -> bool {

            switch (event) {
                case json::parse_event_t::object_start:
                    if (not keyed)
                        for (int i = 0; i < depth; i++) cout << tab;
                    cout << '{' << endl;
                    keyed = false;
                    break;

                case json::parse_event_t::object_end:
                    for (int i = 0; i < depth; i++) cout << tab;
                    cout << '}' << endl;
                    if (depth == 2) return false;
                    break;

                case json::parse_event_t::array_start:
                    if (not keyed)
                        for (int i = 0; i < depth; i++) cout << tab;
                    cout << '[' << endl;
                    keyed = false;
                    break;

                case json::parse_event_t::array_end:
                    for (int i = 0; i < depth; i++) cout << tab;
                    cout << ']' << endl;
                    break;

                case json::parse_event_t::key:
                    for (int i = 0; i < depth; i++) cout << tab;
                    cout << element << " = ";
                    keyed = true;
                    break;

                case json::parse_event_t::value:
                    if (keyed) {
                        cout << element << endl;
                    }
                    else {
                        for (int i = 0; i < depth; i++) cout << tab;
                        cout << element << endl;
                    }
                    keyed = false;
                    break;

                default:
                    break;
            }
            return true;
        });
    }

…o that streams are read incrementally.

…ill.

…ssed, including the ability to reject individual elements.

coveralls · 2015-03-02T22:56:01Z

Coverage remained the same at 100.0% when pulling b52dc79 on aburgh:callback into 5526de1 on nlohmann:master.

nlohmann · 2015-04-24T21:56:43Z

I like the idea and I love the way how you improved the parser, my first experiments show that the runtime is twice as high.

Consider the following code:

#include <json.hpp>
#include <fstream>

int main(int argc, char** argv)
{
    std::ifstream input_file(argv[1]);
    nlohmann::json j;
    j << input_file;
}

For the old version, I can read https://github.com/miloyip/nativejson-benchmark/blob/master/data/canada.json in 140 ms. The new parser takes 260 ms. I used clang 3.6 with -flto -O3. I'll try to find out where the speed is going - my guess would be that the ´return true` was not inlined and not all copies could be avoided.

aburgh · 2015-04-26T21:48:01Z

I agree the default callback is probably adding some time and I think it's worth investigating, but I doubt that it would double the time (see below). I declared the default callback as static, but I don't recall why, and now I don't think it adds anything and I wonder if it may prevent inlining.

I suspect the biggest performance hit comes from using the result variable instead of returning a newly constructed object. I think the way it's written will cause the object to be copied into result, which would be relatively expensive for strings, vectors, and maps. I would try changing the assignments result = ... to result = std::move(...).

If the performance penalty can't be eliminated, the parse with callback could be added as a totally separate function. Or, you could even left out of your main code and include it as a user-contributed patch for those that could use it. It's critical to me for parsing a 1+ GB file, but my case is probably uncommon.

nlohmann · 2015-04-26T21:54:21Z

Hi Aaaron, thanks for answering!

First of - please ignore the messages by AppVeyor. I am currently trying to whether MSVC 2015's C++11 support is as good as some people claim...

Second, I'll check the pull request as soon as I can find the time. This weekend, I tried to build a version of parse_internal() that does not return an object, but manipulates an object passed as reference. This would avoid copying altogether, but I could not get it running correctly. Maybe with another look at your code I get an idea.

I understand your use case, but - if possible - I would like to get an idea of the input data. Getting a 1+ GB JSON file with a real-life task would be a nice benchmark - especially as it is not about how many milliseconds it takes to create an object.

All the best
Niels

aburgh · 2015-04-27T01:23:56Z

I ran a debug build in Apple's Instruments to profile your test program with the canada.json data. It spent 1.4% of the time in default callback, so I think we can ignore it as the problem. I also reminded myself why it's static: it's a user-supplied function and thus not a member function, and so it has to be static when defined in its current location, and I put it inside the class declaration so that it could easily specify a basic_json parameter type.

I tried changing the result = basic_json... to result = move(basic_json...), but that didn't make a noticeable difference.

The program spent a lot of time in push_back and destructing map and vector containers. Looking at the patch diff, I suspect this change is significant:

-                        result.push_back(parse_internal());
+                        auto value = parse_internal(keep);
+                        if (keep and not value.is_discarded())
+                        {
+                            result.push_back(value);
+                        }

The sample data contains a lot of arrays, and that push_back is a copy. I changed it to ...push_back(move(value)) and my debug build test times dropped from ~300ms to ~225ms, which is recovering 50% or more of the performance drop. If you want me to try it with optimization, let me know.

nlohmann · 2015-04-27T19:53:24Z

Hi Aaron, I'll check the code in a minute. I also checked the rest of the parser: a lot of time is wasted when arrays/objects are parsed, because they begin with empty capacity and are resized gradually. I think there is great room for improvements. Another thing is the string handling - the escape function does a terrible job. I'll keep an eye on that.

nlohmann · 2015-04-27T20:31:58Z

I tried again with a larger file (http://www.reddit.com/r/datasets/comments/1uyd0t/200000_jeopardy_questions_in_a_json_file). I get 2410 ms for the version without callback (clang 3.6, -O3 -flto -std=c++11) and 3667 ms with callbacks. Any ideas?

aburgh · 2015-04-28T10:54:08Z

I found another instance where std::move() is needed. In parse_internal(), change:

result[key] = value;

to

result[key] = std::move(value);

I didn't compare it to the non-callback version yet, but that one move makes another big performance improvement.

P.S. Nice find for a sample file!

aburgh · 2015-04-29T01:39:11Z

With the additional change described in my previous comment, here are some test results with Xcode 6.3.1 and flags -O3 -flto -std=c++11:

Three runs of the jeopardy file:

w/o callback	with callback
1222 ms	1279 ms
1184 ms	1247 ms
1233 ms	1264 ms

Three runs of the canada.json file:

w/o callback	with callback
69.6 ms	71.2 ms
73.1 ms	78.2 ms
83.8 ms	72.9 ms

Do you want to see more improvement before merging? If not, do you want me to update the pull request?

nlohmann · 2015-04-29T05:53:27Z

That sounds awesome! Let me give it a try, and then I'll merge. Thanks so much! 👍🏻

nlohmann · 2015-05-03T17:03:59Z

Hi @aburgh, I pulled your code and made some minor adjustments. Thanks a lot, and thanks for your patience!

aburgh · 2015-05-03T20:54:02Z

Hi Niels, it appears you didn't include the performance tweaks we found. Would you like me to submit a another pull request with them?

nlohmann · 2015-05-03T21:00:03Z

Oops... I had problems merging the code. Sorry for that. Yes, another pull request against the current version would be great!

nlohmann · 2015-05-06T17:51:03Z

This was closed with #69.

Sync Fork from Upstream Repo

aburgh added 12 commits March 1, 2015 05:55

Moved buffer management into the lexer class and implemented YYFILL s…

754c38e

…o that streams are read incrementally.

Moved m_marker in lexer::scan() to be a member of lexer.

e4cc42c

Deleted extraneous comment.

e3e18d7

Removed duplicate m_marker updates in YYFILL macro.

0d79e7c

Removed unused member m_state.

b66c306

Purged old commented-out code.

ec6979b

Fixed variable adjustments in yyfill().

edb6972

Added comments to new method yyfill.

268fd44

Use inplace configuration for yyfill and disable the parameter to yyf…

2855c70

…ill.

Added parse() for streams.

8774628

Replaced leading tabs with spaces (4 per tab).

396f64a

Added parser callback to enable processing parsed data as it is proce…

b52dc79

…ssed, including the ability to reject individual elements.

aburgh changed the title ~~Add parser callback and with the ability to filter results.~~ Add parser callback with the ability to filter results. Mar 2, 2015

nlohmann mentioned this pull request Mar 24, 2015

Parse streams incrementally. #40

Merged

nlohmann self-assigned this Apr 24, 2015

nlohmann added a commit that referenced this pull request May 3, 2015

manually merged pull request #41

952cbbc

nlohmann closed this May 3, 2015

nlohmann reopened this May 3, 2015

nlohmann closed this May 6, 2015

aburgh deleted the callback branch May 10, 2015 21:47

bignmllc mentioned this pull request Dec 22, 2017

Regression Tests Failure using "ctest" #887

Closed

GerHobbelt pushed a commit to GerHobbelt/nlohmann-json that referenced this pull request May 7, 2021

Merge pull request nlohmann#41 from nlohmann/develop

35d0dbc

Sync Fork from Upstream Repo

Uh oh!

Conversation

aburgh commented Mar 2, 2015

Uh oh!

coveralls commented Mar 2, 2015

Uh oh!

nlohmann commented Apr 24, 2015

Uh oh!

aburgh commented Apr 26, 2015

Uh oh!

nlohmann commented Apr 26, 2015

Uh oh!

aburgh commented Apr 27, 2015

Uh oh!

nlohmann commented Apr 27, 2015

Uh oh!

nlohmann commented Apr 27, 2015

Uh oh!

aburgh commented Apr 28, 2015

Uh oh!

aburgh commented Apr 29, 2015

Uh oh!

nlohmann commented Apr 29, 2015

Uh oh!

nlohmann commented May 3, 2015

Uh oh!

aburgh commented May 3, 2015

Uh oh!

nlohmann commented May 3, 2015

Uh oh!

nlohmann commented May 6, 2015

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants