Web console: use arrayIngestMode: array#15588
Conversation
There was a problem hiding this comment.
I think it is good to get the web-console to always be using "arrayIngestMode": "array" 🤘. I do wonder though if we should perhaps have a stronger callout whenever we detect arrays to try to make sure if users want arrays or if they are expecting classic mvds - and push to using ARRAY_TO_MV transforms to get mvds to make it explicit (as least for MSQ). I'm not completely sure how to do this, maybe just a thingy on side-bar that links to https://druid.apache.org/docs/latest/querying/multi-value-dimensions/ and https://druid.apache.org/docs/latest/querying/arrays/
I also have some thoughts about the current detection logic; for example mixed types that have arrays and scalars end up as STRING in the current version of this PR, but if there are other arrays with a single type, you end up with a strange mix of arrays and mvds. I think we probably don't want to encourage having a mix of arrays and mvds in the same dataset.
For example:

in this data there are a bunch of arrays of various types, but arrayVariant ends up as 'string'. The sampler response logical type shows as ARRAY, which follows the druid 'least restrictive type' logic that auto columns follow when mixed types:

It is pretty nice overall though, and in some ways, more powerful than what happens on druid ingest side since it doesn't rely on java types for example, so can even find types in csv, so don't view this as a suggestion to drop it completely.
I'm not entirely sure what is best to do here, my gut says that users should probably use all arrays or no arrays, but idk if that is just my bias against mvds showing.
All this said, I do think this is pushing in the right direction 🚀
| name: 'castToType', | ||
| type: 'string', | ||
| defined: typeIsKnown(KNOWN_TYPES, 'auto'), | ||
| suggestions: [undefined, 'ARRAY<STRING>', 'ARRAY<LONG>', 'ARRAY<DOUBLE>'], |
There was a problem hiding this comment.
this works with any native druid type built-in type STRING, LONG, DOUBLE, FLOAT, ARRAY<STRING>, ARRAY<LONG>, ARRAY<DOUBLE> (ARRAY<FLOAT> works too but is just mapped to ARRAY<DOUBLE>, which is also true for FLOAT which gets mapped to DOUBLE when using 'auto').
It isn't completely crazy to use 'auto' with 'castToType' for all column types really though i'm not entirely sure we want to make that jump yet.. but for example if you want indexes for numerical columns you should use this instead of the classic dimension schemas, the only exception is multi-value strings which do not exist in 'auto' (and must be ingested as ARRAY instead).
There was a problem hiding this comment.
When you say that ARRAY<FLOAT> works too but is just mapped to ARRAY<DOUBLE> are you advocating that we add it to the suggestions? The suggestion are not meant to be everything that works, only everything that is somehow useful. Remember that a standard dimension spec type can be flooooot (it gets mapped to string) but we don't want to suggest that.
Why would we not use 'auto' with 'castToType' everywhere 🤔 ? I really like it because it let's us get away from dimension types (which have a confusing relationship with column types).
There was a problem hiding this comment.
When you say that ARRAY works too but is just mapped to ARRAY are you advocating that we add it to the suggestions?
Nah, I don't think it is necessary since they will be handled the same as ARRAY<DOUBLE> when used by 'auto', was just mentioning all of the valid strings (there are actually a few extras for backwards compatibility https://github.com/apache/druid/blob/master/processing/src/main/java/org/apache/druid/segment/column/Types.java#L37)
Remember that a standard dimension spec type can be flooooot (it gets mapped to string) but we don't want to suggest that.
Yea, that is true, though this is a bit different because castToType only supports real types else if it cant parse them it falls back to full 'auto' behavior (though I suppose this also could have been an error to not repeat the silent different behavior of the dimension schema types themselves).
Why would we not use 'auto' with 'castToType' everywhere 🤔 ? I really like it because it let's us get away from dimension types (which have a confusing relationship with column types).
I think eventually we do want to use them everywhere, I just wasn't too sure if now is the right time quite yet. There are a few reasons you might not want to use it today. First is of course if you want MVDs, since the STRING column produced by 'auto' will never make an MVD. The second would be I think if you want numeric columns but do not want the indexes that all 'auto' numeric columns have today, since this can make the segments larger, though other than disk space there probably isn't much of a penalty for this if not actually filtering the numbers, leaving the main cost just longer ingest time to build the indexes. I also plan to eventually allow 'auto' and nested columns to be customized to allow things like leaving out the indexes similar to what we can do today with string columns (and eventually, specifying alternative indexes once I implement them). Given that the second thing isn't much of an issue, I don't really think it would be a problem to switch native to always using 'auto' with 'castToType' for all cases except for MVDs.
| if (timeExpression && timeExpression.containsColumnName(columnName)) return; | ||
| return C(columnName).applyIf(isArrays[i], ex => F('MV_TO_ARRAY', ex).as(columnName) as any); | ||
| return C(columnName); | ||
| // return C(columnName).applyIf(isArrays[i], ex => F('MV_TO_ARRAY', ex).as(columnName) as any); |
There was a problem hiding this comment.
nit: any reason to leave this commented section?
clintropolis
left a comment
There was a problem hiding this comment.
this is pretty sweet 🤘
should we add a similar warning (but maybe not a toggle) to MSQ ingestion to indicate that arrays were detected? We could potentially advise to use the ARRAY_TO_MV function if you want mvds, and link to mvd docs https://druid.apache.org/docs/latest/querying/multi-value-dimensions/
| const KNOWN_TYPES = ['string', 'long', 'float', 'double', 'json']; | ||
| // This is a web console internal made up column type that represents a multi value dimension | ||
| const MADE_UP_MV_COLUMN_TYPE = 'mv-string'; | ||
| function madeMadeUpMvDimensionSpec(name: string): DimensionSpec { |
There was a problem hiding this comment.
nit: should this be makeMadeUpMvDimensionSpec?
|
my bad, totally missed the toggle for MSQ loader, lgtm 👍 |
* Adapt to new array mode * Feedback fixes * fixing type detection and highlighting * goodies * add docs * feedback fixes * finish array work * update snapshots * typo fix * color fixes * small fix * make MVDs default for now * better sqlStringifyArrays default * fix spec converter * fix tests
* Adapt to new array mode * Feedback fixes * fixing type detection and highlighting * goodies * add docs * feedback fixes * finish array work * update snapshots * typo fix * color fixes * small fix * make MVDs default for now * better sqlStringifyArrays default * fix spec converter * fix tests Co-authored-by: Vadim Ogievetsky <vadim@ogievetsky.com>
This PR adds support for array types for all the ingestion wizards.
This PR also adds null mode setting detection