Skip to content

stdlib: json, expose some internals for jsonStringify users#20962

Closed
Arwalk wants to merge 1 commit intoziglang:masterfrom
Arwalk:open_up_some_json_internals
Closed

stdlib: json, expose some internals for jsonStringify users#20962
Arwalk wants to merge 1 commit intoziglang:masterfrom
Arwalk:open_up_some_json_internals

Conversation

@Arwalk
Copy link
Contributor

@Arwalk Arwalk commented Aug 6, 2024

the json library allows custom representations if an object possesses the jsonStringify interface.

Some of the internals of the json library are exposed for users of this functionality, such as beginObject, beginArray ... But nothing is available when the goal is to push a raw value (such as a string) directly into the stream.

This commit makes public valueStart and valueDone to avoid people having to reimplement by themselves (and follow the developments) of some of the json lib.

I was specifically in this case on https://github.com/Arwalk/zig-protobuf when trying to implement json parsing/encoding : bytes field are supposed to be base64 encoded, meaning i had to reimplement valueStart and valueDone just to be able to make my encoding.

@Arwalk Arwalk requested a review from thejoshwolfe as a code owner August 6, 2024 20:00
@Arwalk Arwalk force-pushed the open_up_some_json_internals branch from dff2674 to d0e8171 Compare August 6, 2024 20:11
@thejoshwolfe
Copy link
Contributor

Are you able to use .print() for you use case?

@Arwalk
Copy link
Contributor Author

Arwalk commented Aug 7, 2024

Sorry, i wasnt explicit enough in the goal: encoding directly to the json without unnecessary allocations.

Print would have worked, but it still would have required me to encode to base64 then use print to insert it in the stream.

As the whole point of base64 is to encode binary data that can be quite large, this would have effectively doubled the memory footprint here.

@thejoshwolfe
Copy link
Contributor

Ah I see. You want to stream large amounts of data into the write stream as a single json value. So then exposing the begin and end methods seem to make sense.

Exposing these functions is more effort than just putting pub on them; we gotta add documentation and tests, and we gotta consider the implications to the API overall (which might result in adding some references from other documentation and such). I'll see what I can do over the next few days.

@Arwalk
Copy link
Contributor Author

Arwalk commented Aug 7, 2024

If you make a rough plan i'll be happy to participate a bit. Thanks for everything.

@Arwalk
Copy link
Contributor Author

Arwalk commented Aug 7, 2024

I thought about it a little while, and there are not a lot of cases where you need this, only if you're encoding a very long string.

Instead of opening up the internals like this, we could try a few other things (that are not mutually exclusive).

Providing a new writeStringFromReader interface.

This interface expects some kind of iterator object that returns ?[]const u8 when its next method is called, and streams quietly the entire string to the json output. Something along those lines

pub fn writeStringFromReader(self: *Self, reader: anytype) !void {
    try self.valueStart();
    try self.stream.writeByte('"');

    while(reader.next()) |data| {
        try self.stream.writeAll(data);
    }

    try self.stream.writeByte('"');
    self.valueDone();
}

For my case of base64 encoding, this would make my code as an end user in this style

const Base64Provider = struct {
    iterator: std.mem.WindowIterator(u8),

    var temp = [_]u8{0} ** 5;

    pub fn next(self: *Base64Provider) ?[]const u8 {
        if(self.iterator.next()) |chunk| {
            return base64.standard.Encoder.encode(&temp, chunk);
        }
        return null;
    }
};

fn print_bytes(value: anytype, jws: anytype) !void {
    var provider = Base64Provider{
        .iterator = std.mem.window(u8, value.getSlice(), 3, 3)
    };
    try jws.writeStringStream(&provider);
}

Providing a new writeStringFromStreamHandler interface.

The interface expects an object with a method handle which has a stream as parameter.

This approach is related to my proposition for a stream encoder in the base64 stdlib (see #20961 ).

This would be along those lines:

pub fn writeStringFromStreamHandler(self: *Self, something: anytype) !void {
    try self.valueStart();
    try self.stream.writeByte('"');

    try something.handle(self.stream);

    try self.stream.writeByte('"');
    self.valueDone();
}

on user side, imagining the proposition of stream encoder is accepted, this would look like this for base64 encoding

const Base64StreamHandler = struct {
    source: []const u8,

    pub fn handle(self: Base64StreamHandler, stream: anytype) !void {
       try base64.standard.Encoder.encodeWriter(stream, self.source);
    }
};

fn print_bytes(value: anytype, jws: anytype) !void {
    try jws.writeStringFromStreamHandler(Base64StreamHandler{.source = value.getSlice()});
}

There are obviously things to work on the subject, that could be better approaches than just publishing internals like i proposed originally.

Edit: I think the second approach is much more practical overall.

@thejoshwolfe
Copy link
Contributor

If you make a rough plan i'll be happy to participate a bit. Thanks for everything.

Thanks for the offer! However, I'm not sure that collaboration makes sense for this. I don't want to turn away willing volunteers, but I hope the following should explain why I'm not optimistic about the usefulness of collaboration.

The rough plan is a bit hard to explain, but it's something like this: expose either the existing functions, as in this PR, or make dedicated public endpoints that also track a debug-mode-only boolean for whether the stream is in raw streaming write mode. then either create a method that accepts a byte slice and writes directly to the output stream, or just explain to the caller that they can write directly to the internal field while in the special mode. then consider whether the method version of the previous sentence would be better than the field version in order to add an assertion using that debug-only boolean. then consider how many other methods should have assertions for that debug-only boolean to make sure we're entering and exiting the raw write mode correctly before using any other methods. then document everything and write tests for it all. all the while reconsidering the whole approach and thinking of other potential solutions.

and one more thing to work out is who's responsible for writing the beginning and ending quotes to the string, and i'm leaning toward the caller needing to include that in what they write, because that also allows this mechanism to be used to write arbitrarily long numbers (which is allowed in JSON). then there's the consideration that this mechanism could be used to write arbitrary preformatted json, which seems like a good feature rather than a red flag i think. whatever the outcome of this paragraph should also be documented.

so articulating the process suggests that this is a lot of ... design work? it's not just going through a checklist of certain tasks, which makes this less suitable for collaboration.

I've been working on finally cleaning up my json-diagnostics branch before working on this PR. so it's taking me more than a few days to get to it, but it is still on my queue to get to eventually.

@thejoshwolfe
Copy link
Contributor

we could try a few other things

I'd like to keep the callgraph pointing from application code calling into the stdlib code as much as possible, rather than the library calling an application function. i know this already isn't the case for the methods to customize serialization and deserialization, but i'd still prefer to keep that to a minimum (there are already problems with the existing custom functions such as #16891 .). Some advantages of keeping the application in control of initiating writing the next chunk of the stream rather than a loop in the stdlib code is that it's trivial to delay or cancel for whatever reason.

I don't think we need to define any objects or interfaces to make this feature work, and I'd prefer not to. I know that everyone has a different opinion about what seems "simpler" and "easier to maintain", and i certainly have my opinion on it. Adding a bunch of assertions into every public method seems simpler and easier to maintain to me than creating a loop that calls an application function. I can try to explain why, but i'm not going to try to convince anyone to change their opinion. I do believe that my approach results in smaller generated code (in releasefast) and a less tangled call graph (which i don't know how to define rigorously, but i expect that there's some way).

@Arwalk
Copy link
Contributor Author

Arwalk commented Aug 14, 2024

I see your points. Of course it's not an easy subject. #16891 is interesting as we fall in this exact case in zig-protobuf too: the specification says that an implementation should have some options in their encoding methods (such as writing enums with their int value instead of their name) and decoding methods (such as ignoring unknown keys/fields). Right now, we're not sure how to handle this, probably by encapsulating our initial structure inside another structure that has the original data + context and having jsonParse/jsonStringify there.

I heard you on the collaboration aspect, and agree that it is certainly a case where a single person on the subject will go faster, but my offer still stands, at least to give you feedback on your work once you have a branch ready. Don't hesitate.

Should we close this PR? And of course, thank you for your time.

@thejoshwolfe
Copy link
Contributor

thank you for your time.

No problem! of course some actual productive time would be better than just sketching ideas in a discussion, but hopefully that will come soon.

@thejoshwolfe
Copy link
Contributor

See #21155

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants