Skip to content

document SQL compatible null handling mode#8894

Merged
gianm merged 5 commits intoapache:masterfrom
clintropolis:document-sql-compatible-null-handling
Nov 20, 2019
Merged

document SQL compatible null handling mode#8894
gianm merged 5 commits intoapache:masterfrom
clintropolis:document-sql-compatible-null-handling

Conversation

@clintropolis
Copy link
Copy Markdown
Member

Description

This PR does what the title says, and adds documentation for the SQL compatible null handling mode added from #4349, and advising on the potential performance implications using the data collected in #8822.


This PR has:

  • been self-reviewed.
  • added documentation for new or modified features or behaviors.

@clintropolis
Copy link
Copy Markdown
Member Author

This should not be merged until #8876 is merged

@clintropolis clintropolis removed the WIP label Nov 19, 2019
Comment thread docs/configuration/index.md Outdated
|`druid.indexing.doubleStorage`|Set to "float" to use 32-bit double representation for double columns.|double|

### SQL compatible null handling
Also prior to version 0.13.0, Druid string columns treated `''` and `null` values as interchangeable, and numeric columns were unable to represent `null` values, coercing `null` to `0`. Druid 0.13.0 introduced an at the time undocumented mode which enabled SQL compatible null handling, allowing string columns to distinguish empty strings from nulls, and numeric columns to contain null rows.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nix the "also", people usually aren't reading these straight through, so it's best for each section to stand alone.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nix "at the time undocumented"; no need to mention that in the current notes.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Comment thread docs/configuration/index.md Outdated

|Property|Description|Default|
|---|---|---|
|`druid.generic.useDefaultValueForNull`|When set to `true`, `null` values will be stored as `''` for string columns and `0` for numeric columns. Set to `false` to store and query segments in SQL compatible mode.|`true`|
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"data" not "segments" would be clearer IMO.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed

Comment thread docs/design/segments.md Outdated
bitmaps.

## SQL Compatible Null Handling
By default, Druid string dimension columns use the values `''` and `null` interchangeably and numeric and metric columns can not represent `null` at all, instead coercing nulls to `0`. However, Druid also provides an SQL compatible null handling mode, which must be enabled at the system level, through `druid.generic.useDefaultValueForNull`. This setting, when set to `false`, will allow Druid to _at ingestion time_ create segments whose string columns can distinguish `''` from `null`, and numeric columns which can represent `null` valued rows instead of `0`.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd go with "a SQL" not "an SQL".

https://oracle-base.com/blog/2015/01/02/a-sql-or-an-sql/ presents some arguments for both sides, but the "an SQL" side is clearly misguided.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but fine I will change all occurrences that I added 😛

Comment thread docs/design/segments.md Outdated
## SQL Compatible Null Handling
By default, Druid string dimension columns use the values `''` and `null` interchangeably and numeric and metric columns can not represent `null` at all, instead coercing nulls to `0`. However, Druid also provides an SQL compatible null handling mode, which must be enabled at the system level, through `druid.generic.useDefaultValueForNull`. This setting, when set to `false`, will allow Druid to _at ingestion time_ create segments whose string columns can distinguish `''` from `null`, and numeric columns which can represent `null` valued rows instead of `0`.

String dimension columns contain no additional column structures in this mode, instead just reserving an additional dictionary entry for the `null` value. Numeric columns however will be stored in the segment with an additional `bitmap` whose set bits indicate `null` valued rows. In addition to slightly increased segment sizes, this also means that SQL compatible null handling comes at a query time cost for numeric columns too, which must now check whether or not the row is null valued during selection and aggregation. This overhead has been calculated to be approximately 10-20 nanoseconds _per row_ scanned in each query, so it is worth considering if the expressivity is worth the performance hit for your individual use case.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No reason to put bitmap in code style.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oops, no idea why I did that, fixed

Comment thread docs/design/segments.md Outdated
## SQL Compatible Null Handling
By default, Druid string dimension columns use the values `''` and `null` interchangeably and numeric and metric columns can not represent `null` at all, instead coercing nulls to `0`. However, Druid also provides an SQL compatible null handling mode, which must be enabled at the system level, through `druid.generic.useDefaultValueForNull`. This setting, when set to `false`, will allow Druid to _at ingestion time_ create segments whose string columns can distinguish `''` from `null`, and numeric columns which can represent `null` valued rows instead of `0`.

String dimension columns contain no additional column structures in this mode, instead just reserving an additional dictionary entry for the `null` value. Numeric columns however will be stored in the segment with an additional `bitmap` whose set bits indicate `null` valued rows. In addition to slightly increased segment sizes, this also means that SQL compatible null handling comes at a query time cost for numeric columns too, which must now check whether or not the row is null valued during selection and aggregation. This overhead has been calculated to be approximately 10-20 nanoseconds _per row_ scanned in each query, so it is worth considering if the expressivity is worth the performance hit for your individual use case.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to dial down this block a bit:

In addition to slightly increased segment sizes, this also means that SQL compatible null handling comes at a query time cost for numeric columns too, which must now check whether or not the row is null valued during selection and aggregation. This overhead has been calculated to be approximately 10-20 nanoseconds per row scanned in each query, so it is worth considering if the expressivity is worth the performance hit for your individual use case.

The reasons being: (1) we will eventually be wanting to make this mode default; (2) 10–20ns assumes certain things about amount of nulls per column, we might do further optimizations, etc. So I would skip the specific number and some of the warnings.

How about:

In addition to slightly increased segment sizes, SQL compatible null handling can incur a performance cost at query time as well, due to the need to check the null bitmap. This performance cost only occurs for columns that actually contain nulls.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to suggestion

Comment thread docs/querying/sql.md Outdated
NULLable. Numeric columns are NOT NULL; if you query a numeric column that is not present in all segments of your Druid
datasource, then it will be treated as zero for rows from those segments.

### SQL compatible null handling
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The header is weird here. It includes a bunch of stuff that isn't null handling related. I think it would be better to merge this stuff into the prior paragraph about null handling. I think we should also point out that the SQL optimizer works best when SQL compliant null handling is enabled, so if you're doing a lot of SQL with NULLs, it's best to enable the mode.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

combined sections and mentioned optimizer

@gianm gianm merged commit d67c3c7 into apache:master Nov 20, 2019
@clintropolis clintropolis deleted the document-sql-compatible-null-handling branch November 20, 2019 19:16
jon-wei pushed a commit to jon-wei/druid that referenced this pull request Nov 26, 2019
* document SQL compatible null handling mode

* adjustments

* fix docs

* review changes
@jon-wei jon-wei added this to the 0.17.0 milestone Dec 17, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants