Skip to content

Numeric array support for columnar frames#15917

Merged
LakshSingla merged 12 commits intoapache:masterfrom
LakshSingla:columnar-frames-array
Feb 21, 2024
Merged

Numeric array support for columnar frames#15917
LakshSingla merged 12 commits intoapache:masterfrom
LakshSingla:columnar-frames-array

Conversation

@LakshSingla
Copy link
Copy Markdown
Contributor

Description

This PR adds support for numeric arrays for columnar frame types. Columnar frame types are used in window functions processing and materializing subquery results, therefore it fixes up a major hole in the current capabilities of those.

Layout of the column

For storing a column of n rows, it is laid out in the following fashion. Assume that the total number of the elements (i.e. individual values in the array) in all the rows combined is k:

Section Length of the section Value
0 Type info, representing the column type
1 n * Integer.BYTES n integers, where i-th integer represents cumulative row length
2 k * Byte.BYTES k bytes, where i-th byte represent whether the k-th value (from the start is null)
3 k * ELEMENT_SIZE k values, each representing the element, or null equivalent value (e.g 0 for double)

Release note

Columnar frames used in subquery materialization and window functions now support numeric arrays.


Key changed/added classes in this PR
  • MyFoo
  • OurBar
  • TheirBaz

This PR has:

  • been self-reviewed.
  • added documentation for new or modified features or behaviors.
  • a release note entry in the PR description.
  • added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
  • added or updated version, license, or notice information in licenses.yaml
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
  • added integration tests.
  • been tested in a test Druid cluster.

@LakshSingla LakshSingla changed the title Columnar frames array Numeric array support for columnar frames Feb 16, 2024

for (int i = 0; i < rowLength; ++i) {
final Number element = numericArray.get(i);
final long memoryOffset = rowDataCursor.start() + ((long) elementSizeBytes() * i);

Check warning

Code scanning / CodeQL

Dereferenced variable may be null

Variable [rowDataCursor](1) may be null at this access because of [this](2) assignment.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a false warning, since rowDataCursor = null iff rowLength = 0, and in which case, we won't be accessing the variable.

final Number element = numericArray.get(i);
final long memoryOffset = rowDataCursor.start() + ((long) elementSizeBytes() * i);
if (element == null) {
rowNullityDataCursor.memory()

Check warning

Code scanning / CodeQL

Dereferenced variable may be null

Variable [rowNullityDataCursor](1) may be null at this access because of [this](2) assignment.
.putByte(rowNullityDataCursor.start() + (long) Byte.BYTES * i, NULL_ELEMENT_MARKER);
putNull(rowDataCursor.memory(), memoryOffset);
} else {
rowNullityDataCursor.memory()

Check warning

Code scanning / CodeQL

Dereferenced variable may be null

Variable [rowNullityDataCursor](1) may be null at this access because of [this](2) assignment.
(long) Byte.BYTES * FrameColumnReaderUtils.getAdjustedCumulativeRowLength(
memory,
getStartOfCumulativeLengthSection(),
numRows - 1

Check failure

Code scanning / CodeQL

User-controlled data in arithmetic expression

This arithmetic expression depends on a [user-provided value](1), potentially causing an underflow. This arithmetic expression depends on a [user-provided value](2), potentially causing an underflow.
Copy link
Copy Markdown
Contributor

@adarshsanjeev adarshsanjeev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me overall!


if (multiValue) {
totalNumValues = adjustCumulativeRowLength(getCumulativeRowLength(memory, numRows - 1));
totalNumValues = FrameColumnReaderUtils.adjustCumulativeRowLength(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be changed to getAdjustedCumulativeRowLength instead?

rowLength = cumulativeRowLength;
} else {
rowLength = cumulativeRowLength - adjustCumulativeRowLength(getCumulativeRowLength(memory, physicalRow - 1));
rowLength = cumulativeRowLength - FrameColumnReaderUtils.adjustCumulativeRowLength(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this also be getAdjustedRowLength since the function exists?

import org.apache.druid.segment.column.ColumnType;

/**
* Reaaers for columns written by {@link org.apache.druid.frame.write.columnar.LongArrayFrameColumnWriter}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: spelling

*
* Note on cumulative lengths stored in section 1: Cumulative lengths are stored so that its fast to offset into the
* elements of the array. We also use negative cumulative length to denote that the array itself is null (as opposed to
* individual elements being null, which we store in section 2)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the detailed doc! This made it a lot easier to follow

totalNumValues = FrameColumnReaderUtils.getAdjustedCumulativeRowLength(
memory,
getStartOfCumulativeLengthSection(),
numRows - 1

Check failure

Code scanning / CodeQL

User-controlled data in arithmetic expression

This arithmetic expression depends on a [user-provided value](1), potentially causing an underflow. This arithmetic expression depends on a [user-provided value](2), potentially causing an underflow.
@LakshSingla
Copy link
Copy Markdown
Contributor Author

Thanks for the review @adarshsanjeev. I have confirmed that the CodeQL warnings are not correct (with explanation)

@LakshSingla LakshSingla merged commit a1b2c73 into apache:master Feb 21, 2024
@LakshSingla LakshSingla deleted the columnar-frames-array branch February 21, 2024 06:02
@adarshsanjeev adarshsanjeev added this to the 30.0.0 milestone May 6, 2024
gianm added a commit that referenced this pull request Sep 6, 2024
This patch adds "TypeCastSelectors", which is used when writing frames to
perform two coercions:

- When a numeric type is desired and the underlying type is non-numeric or
  unknown, the underlying selector is wrapped, "getObject" is called and the
  result is coerced using "ExprEval.ofType". This differs from the prior
  behavior where the primitive methods like "getLong", "getDouble", etc, would
  be called directly. This fixes an issue where a column would be read as
  all-zeroes when its SQL type is numeric and its physical type is string, which
  can happen when evolving a column's type from string to number.

-  When an array type is desired, the underlying selector is wrapped,
   "getObject" is called, and the result is coerced to Object[]. This coercion
   replaces some earlier logic from #15917.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants