Skip to content

Deserialize complex dimensions in group by queries to their respective types when reading from spilled files and cached results#16620

Merged
LakshSingla merged 18 commits intoapache:masterfrom
LakshSingla:spill-deserialize
Jul 15, 2024
Merged

Deserialize complex dimensions in group by queries to their respective types when reading from spilled files and cached results#16620
LakshSingla merged 18 commits intoapache:masterfrom
LakshSingla:spill-deserialize

Conversation

@LakshSingla
Copy link
Copy Markdown
Contributor

@LakshSingla LakshSingla commented Jun 17, 2024

Description

Like #16511, but for keys that have been spilled or cached during the grouping process. This was not caught in the original PR since there aren't tests for it. This PR adds the necessary deserialization, as well as the tests that fail if this patch and #16511 aren't in.


This PR has:

  • been self-reviewed.
  • added documentation for new or modified features or behaviors.
  • a release note entry in the PR description.
  • added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
  • added or updated version, license, or notice information in licenses.yaml
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
  • added integration tests.
  • been tested in a test Druid cluster.

Comment on lines +1392 to +1395
if (includeTimestamp && dimsReadSoFar == 0) {
// Read the timestamp
objects[dimsReadSoFar] = codec.readValue(jp, Long.class);
} else {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it might be nice to pull this out of the while loop?


          if (includeTimestamp && jp.currentToken() != JsonToken.END_ARRAY) {
            objects[dimsReadSoFar++] = codec.readValue(jp, Long.class);
            jp.nextToken();
          }

          while (jp.currentToken() != JsonToken.END_ARRAY) {
            if (dimsReadSoFar >= dimsToRead) {
              throw DruidException.defensive("More dimensions encountered than expected [%d]", dimsToRead);
            }

            DruidException.conditionalDefensive(
                dimsReadSoFar < dimsToRead,
                "Insufficient serde helpers present"
            );
            // Read the dimension
            if (serdeHelpers[dimsReadSoFar - timestampAdjustment].getComplexClazz() == null) {
              objects[dimsReadSoFar] = codec.readValue(jp, Object.class);
              if (objects[dimsReadSoFar] instanceof Integer) {
                objects[dimsReadSoFar] = ((Integer) objects[dimsReadSoFar]).longValue();
              } else if (objects[dimsReadSoFar] instanceof Double) {
                objects[dimsReadSoFar] = ((Double) objects[dimsReadSoFar]).floatValue();
              }
            } else {
              objects[dimsReadSoFar] =
                  codec.readValue(jp, serdeHelpers[dimsReadSoFar - timestampAdjustment].getComplexClazz());
            }
            ++dimsReadSoFar;
            jp.nextToken();
          }

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried this, but the complexity is the same, given that we'd have to duplicate the defensive checks, loop counter increment and json stream increment. I have left it as is for now.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i guess my thought was that we would be removing one conditional check of the loop of every column of every row which could add up, but i haven't really measured one way or another. This is a pretty hot loop though so worth thinking about making it as efficient as possible

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was looking at it from code's complexity perspective. Your approach is better to make it more efficient. I'd guess most of the time branch predictions would make it quite equivalent. I'll make the changes to make it explicit however.

Comment on lines +1403 to +1407
if (objects[dimsReadSoFar] instanceof Integer) {
objects[dimsReadSoFar] = ((Integer) objects[dimsReadSoFar]).longValue();
} else if (objects[dimsReadSoFar] instanceof Double) {
objects[dimsReadSoFar] = ((Double) objects[dimsReadSoFar]).floatValue();
}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could serde helpers just tell this to use Float.class and Long.class instead of generically doing Object.class and then checking it? (e.g. make key serde helpers return non-null classes instead of only complex types being non-null)

Copy link
Copy Markdown
Contributor Author

@LakshSingla LakshSingla Jun 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have modified this. I had a doubt about the original flow. In the case of Double dimensions, we would deserialize them as double and cast them as float right, which seems incorrect. Modified PR seems correct, but it differs from the original behavior.

* work correctly when deserialized as generic java objects without type information.
*/
@Nullable
Class<?> getComplexClazz();
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i suppose this can just be called getComplexClass (the reason for getClazz is to not conflict with getClass), unless we do the other suggestion and make it not nullable and just use it for all the json serde, in which case it should be getClazz or getJsonClass or something

@LakshSingla LakshSingla changed the title Deserialize complex dimensions in group by queries to their respective types when reading from spilled files Deserialize complex dimensions in group by queries to their respective types when reading from spilled files and cached results Jun 26, 2024
public abstract TypeReference<ResultType> getResultTypeReference();

/**
*
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: please add javadocs or remove (but preferably add and what does null mean if it returns null like the other one has).

Also, should this be deprecated?

@Override
public CacheStrategy getCacheStrategy(DataSourceMetadataQuery query, @Nullable ObjectMapper mapper)
{
return null;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: if this is returning null is the override really needed?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added for clarity, will remove it.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah maybe is safer to leave since default is to not cache I guess? I don't feel super strongly either way

}
classes[i] = nullableTypeStrategy.getClazz();
} else {
classes[i] = Object.class;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any reason not to use the real class here?

Comment on lines +1392 to +1395
if (includeTimestamp && dimsReadSoFar == 0) {
// Read the timestamp
objects[dimsReadSoFar] = codec.readValue(jp, Long.class);
} else {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i guess my thought was that we would be removing one conditional check of the loop of every column of every row which could add up, but i haven't really measured one way or another. This is a pretty hot loop though so worth thinking about making it as efficient as possible

}

final ObjectMapper newObjectMapper = spillMapper.copy();
newObjectMapper.registerModule(new SpillModule());
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this be a static somewhere so we don't have to make a new one all the time?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new deserialization method is linked to a particular query. That is why we need to create one for each query.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh right, i can't read 😜 , sorry

@Override
public Class<?> getClazz()
{
return Object.class;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same thing Object[].class

@Override
public Class<?> getClazz()
{
return Object.class;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

String.class?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar explanation as to why I left it as Object.class to preserve existing behavior rather than being clearer and mentioning String.class.
I doubt we'd run into regressions with the latter, but I was being overtly cautious.
wdyt?

Comment on lines +1670 to +1673
if (columnType.is(ValueType.COMPLEX)) {
clazz = columnType.getNullableStrategy().getClazz();
} else {
clazz = Object.class;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not always use the column type?

{
NullHandling.initializeForTests();
//noinspection ResultOfObjectAllocationIgnored
new AggregatorsModule();
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, we should move the complex serde registrations of that module to a static method it calls instead of doing this so its more obvious what is going on here

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cool, ill do that. It's weird that we need to call this in the first place, I would have expected the stuff to be registered. I'll make it more explicit and add a comment.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: forget about this?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oops, yeah. Will fix it up

DimensionHandlerUtils.convertObjectToType(results.next(), dimensionSpec.getOutputType())
);

if (dimensionSpec.getOutputType().is(ValueType.COMPLEX)) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i kind of wonder if we should special handle json (COMPLEX<json>) here since it is kind of a special complex type (that really should probably be a built-in standard type instead of complex...) because convertValue looks pretty expensive.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

because convertValue looks pretty expensive.

That sounds reasonable.
Maybe as an alternative, what if we return Object.class instead of StructuredValue.class for jsonStrategy.getClazz()? I assume that convertValue should short-circuit in that case - converting it from an object to object. If that's the case, we'd get away with special handling (perhaps).

);

// Read the dimension
serdeHelpers[dimsReadSoFar - timestampAdjustment].getClazz();

Check failure

Code scanning / CodeQL

Array index out of bounds

This array access might be out of bounds, as the index might be equal to the array length.
// Read the dimension
serdeHelpers[dimsReadSoFar - timestampAdjustment].getClazz();
objects[dimsReadSoFar] =
codec.readValue(jp, serdeHelpers[dimsReadSoFar - timestampAdjustment].getClazz());

Check failure

Code scanning / CodeQL

Array index out of bounds

This array access might be out of bounds, as the index might be equal to the array length.
public <T> CacheStrategy<ResultType, T, QueryType> getCacheStrategy(QueryType query, @Nullable ObjectMapper mapper)
{
return null;
return getCacheStrategy(query);

Check notice

Code scanning / CodeQL

Deprecated method or constructor invocation

Invoking [QueryToolChest.getCacheStrategy](1) should be avoided because it has been deprecated.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not relevant, the deprecated method is used to maintain backward compatibility.

Comment on lines +733 to +736
DruidException.conditionalDefensive(
mapper != null,
"Cannot deserialize complex dimension from if object mapper is not provided"
);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: i wonder if this should be outside of the loop or detected outside of it at least before we get here? like maybe while doing createDimensionClasses or something

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, I'll move the checks there.

}

final ObjectMapper newObjectMapper = spillMapper.copy();
newObjectMapper.registerModule(new SpillModule());
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh right, i can't read 😜 , sorry

@Override
public Class<?> getClazz()
{
return Object.class;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i guess i was wondering if maybe could help get rid of some of the coerce list to array stuff, but totally fine to explore this later

{
NullHandling.initializeForTests();
//noinspection ResultOfObjectAllocationIgnored
new AggregatorsModule();
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: forget about this?


while (jp.currentToken() != JsonToken.END_ARRAY) {

DruidException.conditionalDefensive(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we remove these checks because they are running per row ? Is it worth having this overhead in the hot path ?

@LakshSingla LakshSingla merged commit 209f8a9 into apache:master Jul 15, 2024
@LakshSingla LakshSingla deleted the spill-deserialize branch July 15, 2024 09:30
sreemanamala pushed a commit to sreemanamala/druid that referenced this pull request Aug 6, 2024
…e types when reading from spilled files and cached results (apache#16620)

Like apache#16511, but for keys that have been spilled or cached during the grouping process
@kfaraz kfaraz added this to the 31.0.0 milestone Oct 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants