Support to parse numbers in text-based input formats#17082
Merged
abhishekrb19 merged 25 commits intoapache:masterfrom Sep 19, 2024
Merged
Support to parse numbers in text-based input formats#17082abhishekrb19 merged 25 commits intoapache:masterfrom
abhishekrb19 merged 25 commits intoapache:masterfrom
Conversation
…arsers. This helps samplers to detect numeric types for text-based formats like csv and tsv. These text-based formats by default parse numbers as strings. This change add a config flag to optionally parse numbers as numbers. Long for integers and Double for floating-point numbers. It falls back to string if it cannot parse. The web-console has some code in the load data flow to parse the sample of data returned by the Druid sampler to further inspect types so it can convert them to specific numeric types, if applicable. After this change, the web-console sampler/other applications can just rely on Druid to do it.
| @Nullable | ||
| private static Object tryParseStringAsNumber(@Nullable final String input) | ||
| { | ||
| if (!NumberUtils.isNumber(input)) { |
Member
There was a problem hiding this comment.
i wonder if this is worth looping over the string an extra time before we do try parse attempts, or if we should just start with trying to parse it as a long. I guess having this function call saves the double tryParse which uses a regex pattern.
Contributor
Author
There was a problem hiding this comment.
Yeah, I considered something like that. However, it adds an additional overhead of regex that you note for string inputs, so I kept the current approach, which is optimized for non-numeric strings
Conflict in sql/src/test/java/org/apache/druid/sql/calcite/IngestTableFunctionTest.java
clintropolis
approved these changes
Sep 19, 2024
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Text-based input formats like
csvandtsvcurrently parse inputs only as strings, following theRFC4180Parserspec).To workaround this, the web-console and other tools need to further inspect the sample data returned to sample data returned by the Druid sampler API to parse them as numbers. See here for the relevant web-console code.
Changes:
tryParseNumbersfor thecsvandtsvinput formats.Key classes to review:
ParserUtilsCsvInputFormatCsvParserDelimitedInputFormatDelimitedValueReaderRelease note:
Introduce a new optional config,
tryParseNumbers, for thecsvandtsvinput formats.If enabled, any numbers present in the input will be parsed in the following manner --longdata type for integer types anddoublefor floating-point numbers. By default, this configuration is set to false, so numeric strings will be treated as strings.