Skip to content

Add JsonInputFormat option to assume newline delimited JSON, improve parse exception handling for multiline JSON#13089

Merged
jon-wei merged 3 commits intoapache:masterfrom
jon-wei:json_split
Sep 27, 2022
Merged

Add JsonInputFormat option to assume newline delimited JSON, improve parse exception handling for multiline JSON#13089
jon-wei merged 3 commits intoapache:masterfrom
jon-wei:json_split

Conversation

@jon-wei
Copy link
Copy Markdown
Contributor

@jon-wei jon-wei commented Sep 14, 2022

When JsonInputFormat is used for streaming ingestion, if the input string contains multiple JSON events, but there is a parse exception occurs, all events in the record will be discarded, even if some events were valid (see #10383)

This PR adds the following:

  • A assumedNewlineDelimited option for JsonInputFormat that can be used when the input is known to be newline-delimited JSON. This will force the input format to create a JsonLineReader, which can parse lines independently, so that a parse exception on one line will not prevent other valid lines from being ingested.
  • A useJsonNodeReader option for JsonInputFormat. If true, instead of creating a JsonReader, the input format will create a new JsonNodeReader for parsing multi-line JSON. This new parser splits an input string into JsonNode objects and parses them one by one into InputRow. This allows valid events found prior to a parse exception to be ingested. Potentially valid events after invalid JSON syntax is encountered will still be ignored. This also prevents valid JSON events with an unparseable timestamp from causing other events in the same input string to be discarded.

This PR has:

  • been self-reviewed.
  • added documentation for new or modified features or behaviors.
  • added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
  • added or updated version, license, or notice information in licenses.yaml
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
  • added integration tests.
  • been tested in a test Druid cluster.

@jon-wei jon-wei changed the title Add JsonInputFormat option to assume newline delimited JSON, improve handling for non-NDJSON Add JsonInputFormat option to assume newline delimited JSON, improve parse exception handling for multiline JSON Sep 14, 2022
@FrankChen021
Copy link
Copy Markdown
Member

I think the new option assumedNewlineDelimited on JsonFormat is a good design.
I'm wondering if we can use this option to replace the internal lineSplitter property?

@vogievetsky
Copy link
Copy Markdown
Contributor

Does this need to be documented and/or surfaced in the console?

@jon-wei
Copy link
Copy Markdown
Contributor Author

jon-wei commented Sep 15, 2022

@FrankChen021 Hm, I'm inclined to keep the internal lineSplittable property for now, since we don't support multi-line JSON for batch ingestion, I feel like we'd still need some kind of thing to force batch ingestion to use the JsonLineReader (like if the user were to specify assumeNewlineDelimited = false in a batch job)

@vogievetsky I added some docs for the new properties

* 1. a JSON string of an object in a line or multiple lines(such as pretty-printed JSON text)
* 2. multiple JSON object strings concated by white space character(s)
* <p>
* If an input string contains invalid JSON syntax, any valid JSON objects found prior to encountering the invalid
Copy link
Copy Markdown
Contributor

@YongGang YongGang Sep 16, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wonder whether we can distinguish the failure is due to JSON syntax error or data field in wrong format (e.g. wrong Timestamp format).
If it's due to field in wrong format then we can continue to parse the rest.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Timestamp in wrong format errors are special, in that Druid requires a valid __time field. For other columns, such parse errors would be detected at a later stage, and wouldn't run into the same issue of preventing other events from being ingested

The JSON syntax errors here are errors where malformed JSON syntax prevents the event from being deserialized at all into a map.

@zachjsh zachjsh self-requested a review September 22, 2022 15:49
}
this.lineSplittable = lineSplittable;
this.assumeNewlineDelimited = assumeNewlineDelimited != null && assumeNewlineDelimited;
this.useJsonNodeReader = useJsonNodeReader != null && useJsonNodeReader;
Copy link
Copy Markdown
Contributor

@zachjsh zachjsh Sep 22, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The description for this property above seems to indicate that it should be used for non-newline delimited JSON. Would assumeNewLineDelimted true interfere with this / should this be forced to false if assumeNewLineDelimted is true?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems like maybe it would be an error to indicate useJsonNodeReader if assumeNewLineDelimted is set, though ignoring it seems ok too 🤷

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay, that makes sense, I'll have it be an error

return this.lineSplittable ?
new JsonLineReader(inputRowSchema, source, getFlattenSpec(), objectMapper, keepNullColumns) :
new JsonReader(inputRowSchema, source, getFlattenSpec(), objectMapper, keepNullColumns);
if (this.lineSplittable || this.assumeNewlineDelimited) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When would lineSplitable be false, but the the data is actually newLineDelimited? Does this situation arise when you have a batch of new line delimited json events being ingested at one time, for example?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lineSplittable is used as an indicator from the task implementation as to whether it expects the input to be newline delimited (and thus use either JsonLineReader or JsonReader). Prior to this patch, batch tasks would always use newline delimited (JsonLineReader) and streaming tasks would always use the multiline parser (JsonReader).

With this patch, batch tasks would still always use newline delimited, while for streaming tasks the behavior is configurable (controlled by the new asssumeNewlineDelimited property)

}
this.lineSplittable = lineSplittable;
this.assumeNewlineDelimited = assumeNewlineDelimited != null && assumeNewlineDelimited;
this.useJsonNodeReader = useJsonNodeReader != null && useJsonNodeReader;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems like maybe it would be an error to indicate useJsonNodeReader if assumeNewLineDelimted is set, though ignoring it seems ok too 🤷

Copy link
Copy Markdown
Contributor

@zachjsh zachjsh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚀

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants