document SQL compatible null handling mode#8894
Conversation
|
This should not be merged until #8876 is merged |
…atible-null-handling
| |`druid.indexing.doubleStorage`|Set to "float" to use 32-bit double representation for double columns.|double| | ||
|
|
||
| ### SQL compatible null handling | ||
| Also prior to version 0.13.0, Druid string columns treated `''` and `null` values as interchangeable, and numeric columns were unable to represent `null` values, coercing `null` to `0`. Druid 0.13.0 introduced an at the time undocumented mode which enabled SQL compatible null handling, allowing string columns to distinguish empty strings from nulls, and numeric columns to contain null rows. |
There was a problem hiding this comment.
Nix the "also", people usually aren't reading these straight through, so it's best for each section to stand alone.
There was a problem hiding this comment.
Nix "at the time undocumented"; no need to mention that in the current notes.
|
|
||
| |Property|Description|Default| | ||
| |---|---|---| | ||
| |`druid.generic.useDefaultValueForNull`|When set to `true`, `null` values will be stored as `''` for string columns and `0` for numeric columns. Set to `false` to store and query segments in SQL compatible mode.|`true`| |
There was a problem hiding this comment.
"data" not "segments" would be clearer IMO.
| bitmaps. | ||
|
|
||
| ## SQL Compatible Null Handling | ||
| By default, Druid string dimension columns use the values `''` and `null` interchangeably and numeric and metric columns can not represent `null` at all, instead coercing nulls to `0`. However, Druid also provides an SQL compatible null handling mode, which must be enabled at the system level, through `druid.generic.useDefaultValueForNull`. This setting, when set to `false`, will allow Druid to _at ingestion time_ create segments whose string columns can distinguish `''` from `null`, and numeric columns which can represent `null` valued rows instead of `0`. |
There was a problem hiding this comment.
I'd go with "a SQL" not "an SQL".
https://oracle-base.com/blog/2015/01/02/a-sql-or-an-sql/ presents some arguments for both sides, but the "an SQL" side is clearly misguided.
| ## SQL Compatible Null Handling | ||
| By default, Druid string dimension columns use the values `''` and `null` interchangeably and numeric and metric columns can not represent `null` at all, instead coercing nulls to `0`. However, Druid also provides an SQL compatible null handling mode, which must be enabled at the system level, through `druid.generic.useDefaultValueForNull`. This setting, when set to `false`, will allow Druid to _at ingestion time_ create segments whose string columns can distinguish `''` from `null`, and numeric columns which can represent `null` valued rows instead of `0`. | ||
|
|
||
| String dimension columns contain no additional column structures in this mode, instead just reserving an additional dictionary entry for the `null` value. Numeric columns however will be stored in the segment with an additional `bitmap` whose set bits indicate `null` valued rows. In addition to slightly increased segment sizes, this also means that SQL compatible null handling comes at a query time cost for numeric columns too, which must now check whether or not the row is null valued during selection and aggregation. This overhead has been calculated to be approximately 10-20 nanoseconds _per row_ scanned in each query, so it is worth considering if the expressivity is worth the performance hit for your individual use case. |
There was a problem hiding this comment.
No reason to put bitmap in code style.
There was a problem hiding this comment.
oops, no idea why I did that, fixed
| ## SQL Compatible Null Handling | ||
| By default, Druid string dimension columns use the values `''` and `null` interchangeably and numeric and metric columns can not represent `null` at all, instead coercing nulls to `0`. However, Druid also provides an SQL compatible null handling mode, which must be enabled at the system level, through `druid.generic.useDefaultValueForNull`. This setting, when set to `false`, will allow Druid to _at ingestion time_ create segments whose string columns can distinguish `''` from `null`, and numeric columns which can represent `null` valued rows instead of `0`. | ||
|
|
||
| String dimension columns contain no additional column structures in this mode, instead just reserving an additional dictionary entry for the `null` value. Numeric columns however will be stored in the segment with an additional `bitmap` whose set bits indicate `null` valued rows. In addition to slightly increased segment sizes, this also means that SQL compatible null handling comes at a query time cost for numeric columns too, which must now check whether or not the row is null valued during selection and aggregation. This overhead has been calculated to be approximately 10-20 nanoseconds _per row_ scanned in each query, so it is worth considering if the expressivity is worth the performance hit for your individual use case. |
There was a problem hiding this comment.
I think we need to dial down this block a bit:
In addition to slightly increased segment sizes, this also means that SQL compatible null handling comes at a query time cost for numeric columns too, which must now check whether or not the row is null valued during selection and aggregation. This overhead has been calculated to be approximately 10-20 nanoseconds per row scanned in each query, so it is worth considering if the expressivity is worth the performance hit for your individual use case.
The reasons being: (1) we will eventually be wanting to make this mode default; (2) 10–20ns assumes certain things about amount of nulls per column, we might do further optimizations, etc. So I would skip the specific number and some of the warnings.
How about:
In addition to slightly increased segment sizes, SQL compatible null handling can incur a performance cost at query time as well, due to the need to check the null bitmap. This performance cost only occurs for columns that actually contain nulls.
There was a problem hiding this comment.
Changed to suggestion
| NULLable. Numeric columns are NOT NULL; if you query a numeric column that is not present in all segments of your Druid | ||
| datasource, then it will be treated as zero for rows from those segments. | ||
|
|
||
| ### SQL compatible null handling |
There was a problem hiding this comment.
The header is weird here. It includes a bunch of stuff that isn't null handling related. I think it would be better to merge this stuff into the prior paragraph about null handling. I think we should also point out that the SQL optimizer works best when SQL compliant null handling is enabled, so if you're doing a lot of SQL with NULLs, it's best to enable the mode.
There was a problem hiding this comment.
combined sections and mentioned optimizer
* document SQL compatible null handling mode * adjustments * fix docs * review changes

Description
This PR does what the title says, and adds documentation for the SQL compatible null handling mode added from #4349, and advising on the potential performance implications using the data collected in #8822.
This PR has: