Deserialize complex dimensions in group by queries to their respective types when reading from spilled files and cached results#16620
Conversation
| if (includeTimestamp && dimsReadSoFar == 0) { | ||
| // Read the timestamp | ||
| objects[dimsReadSoFar] = codec.readValue(jp, Long.class); | ||
| } else { |
There was a problem hiding this comment.
it might be nice to pull this out of the while loop?
if (includeTimestamp && jp.currentToken() != JsonToken.END_ARRAY) {
objects[dimsReadSoFar++] = codec.readValue(jp, Long.class);
jp.nextToken();
}
while (jp.currentToken() != JsonToken.END_ARRAY) {
if (dimsReadSoFar >= dimsToRead) {
throw DruidException.defensive("More dimensions encountered than expected [%d]", dimsToRead);
}
DruidException.conditionalDefensive(
dimsReadSoFar < dimsToRead,
"Insufficient serde helpers present"
);
// Read the dimension
if (serdeHelpers[dimsReadSoFar - timestampAdjustment].getComplexClazz() == null) {
objects[dimsReadSoFar] = codec.readValue(jp, Object.class);
if (objects[dimsReadSoFar] instanceof Integer) {
objects[dimsReadSoFar] = ((Integer) objects[dimsReadSoFar]).longValue();
} else if (objects[dimsReadSoFar] instanceof Double) {
objects[dimsReadSoFar] = ((Double) objects[dimsReadSoFar]).floatValue();
}
} else {
objects[dimsReadSoFar] =
codec.readValue(jp, serdeHelpers[dimsReadSoFar - timestampAdjustment].getComplexClazz());
}
++dimsReadSoFar;
jp.nextToken();
}
There was a problem hiding this comment.
I tried this, but the complexity is the same, given that we'd have to duplicate the defensive checks, loop counter increment and json stream increment. I have left it as is for now.
There was a problem hiding this comment.
i guess my thought was that we would be removing one conditional check of the loop of every column of every row which could add up, but i haven't really measured one way or another. This is a pretty hot loop though so worth thinking about making it as efficient as possible
There was a problem hiding this comment.
I was looking at it from code's complexity perspective. Your approach is better to make it more efficient. I'd guess most of the time branch predictions would make it quite equivalent. I'll make the changes to make it explicit however.
| if (objects[dimsReadSoFar] instanceof Integer) { | ||
| objects[dimsReadSoFar] = ((Integer) objects[dimsReadSoFar]).longValue(); | ||
| } else if (objects[dimsReadSoFar] instanceof Double) { | ||
| objects[dimsReadSoFar] = ((Double) objects[dimsReadSoFar]).floatValue(); | ||
| } |
There was a problem hiding this comment.
could serde helpers just tell this to use Float.class and Long.class instead of generically doing Object.class and then checking it? (e.g. make key serde helpers return non-null classes instead of only complex types being non-null)
There was a problem hiding this comment.
I have modified this. I had a doubt about the original flow. In the case of Double dimensions, we would deserialize them as double and cast them as float right, which seems incorrect. Modified PR seems correct, but it differs from the original behavior.
| * work correctly when deserialized as generic java objects without type information. | ||
| */ | ||
| @Nullable | ||
| Class<?> getComplexClazz(); |
There was a problem hiding this comment.
i suppose this can just be called getComplexClass (the reason for getClazz is to not conflict with getClass), unless we do the other suggestion and make it not nullable and just use it for all the json serde, in which case it should be getClazz or getJsonClass or something
| public abstract TypeReference<ResultType> getResultTypeReference(); | ||
|
|
||
| /** | ||
| * |
There was a problem hiding this comment.
nit: please add javadocs or remove (but preferably add and what does null mean if it returns null like the other one has).
Also, should this be deprecated?
| @Override | ||
| public CacheStrategy getCacheStrategy(DataSourceMetadataQuery query, @Nullable ObjectMapper mapper) | ||
| { | ||
| return null; |
There was a problem hiding this comment.
nit: if this is returning null is the override really needed?
There was a problem hiding this comment.
added for clarity, will remove it.
There was a problem hiding this comment.
ah maybe is safer to leave since default is to not cache I guess? I don't feel super strongly either way
| } | ||
| classes[i] = nullableTypeStrategy.getClazz(); | ||
| } else { | ||
| classes[i] = Object.class; |
There was a problem hiding this comment.
any reason not to use the real class here?
| if (includeTimestamp && dimsReadSoFar == 0) { | ||
| // Read the timestamp | ||
| objects[dimsReadSoFar] = codec.readValue(jp, Long.class); | ||
| } else { |
There was a problem hiding this comment.
i guess my thought was that we would be removing one conditional check of the loop of every column of every row which could add up, but i haven't really measured one way or another. This is a pretty hot loop though so worth thinking about making it as efficient as possible
| } | ||
|
|
||
| final ObjectMapper newObjectMapper = spillMapper.copy(); | ||
| newObjectMapper.registerModule(new SpillModule()); |
There was a problem hiding this comment.
should this be a static somewhere so we don't have to make a new one all the time?
There was a problem hiding this comment.
The new deserialization method is linked to a particular query. That is why we need to create one for each query.
There was a problem hiding this comment.
oh right, i can't read 😜 , sorry
| @Override | ||
| public Class<?> getClazz() | ||
| { | ||
| return Object.class; |
| @Override | ||
| public Class<?> getClazz() | ||
| { | ||
| return Object.class; |
There was a problem hiding this comment.
Similar explanation as to why I left it as Object.class to preserve existing behavior rather than being clearer and mentioning String.class.
I doubt we'd run into regressions with the latter, but I was being overtly cautious.
wdyt?
| if (columnType.is(ValueType.COMPLEX)) { | ||
| clazz = columnType.getNullableStrategy().getClazz(); | ||
| } else { | ||
| clazz = Object.class; |
There was a problem hiding this comment.
why not always use the column type?
| { | ||
| NullHandling.initializeForTests(); | ||
| //noinspection ResultOfObjectAllocationIgnored | ||
| new AggregatorsModule(); |
There was a problem hiding this comment.
hmm, we should move the complex serde registrations of that module to a static method it calls instead of doing this so its more obvious what is going on here
There was a problem hiding this comment.
cool, ill do that. It's weird that we need to call this in the first place, I would have expected the stuff to be registered. I'll make it more explicit and add a comment.
There was a problem hiding this comment.
oops, yeah. Will fix it up
| DimensionHandlerUtils.convertObjectToType(results.next(), dimensionSpec.getOutputType()) | ||
| ); | ||
|
|
||
| if (dimensionSpec.getOutputType().is(ValueType.COMPLEX)) { |
There was a problem hiding this comment.
i kind of wonder if we should special handle json (COMPLEX<json>) here since it is kind of a special complex type (that really should probably be a built-in standard type instead of complex...) because convertValue looks pretty expensive.
There was a problem hiding this comment.
because convertValue looks pretty expensive.
That sounds reasonable.
Maybe as an alternative, what if we return Object.class instead of StructuredValue.class for jsonStrategy.getClazz()? I assume that convertValue should short-circuit in that case - converting it from an object to object. If that's the case, we'd get away with special handling (perhaps).
| ); | ||
|
|
||
| // Read the dimension | ||
| serdeHelpers[dimsReadSoFar - timestampAdjustment].getClazz(); |
Check failure
Code scanning / CodeQL
Array index out of bounds
| // Read the dimension | ||
| serdeHelpers[dimsReadSoFar - timestampAdjustment].getClazz(); | ||
| objects[dimsReadSoFar] = | ||
| codec.readValue(jp, serdeHelpers[dimsReadSoFar - timestampAdjustment].getClazz()); |
Check failure
Code scanning / CodeQL
Array index out of bounds
| public <T> CacheStrategy<ResultType, T, QueryType> getCacheStrategy(QueryType query, @Nullable ObjectMapper mapper) | ||
| { | ||
| return null; | ||
| return getCacheStrategy(query); |
Check notice
Code scanning / CodeQL
Deprecated method or constructor invocation
There was a problem hiding this comment.
Not relevant, the deprecated method is used to maintain backward compatibility.
This reverts commit c2724b9.
| DruidException.conditionalDefensive( | ||
| mapper != null, | ||
| "Cannot deserialize complex dimension from if object mapper is not provided" | ||
| ); |
There was a problem hiding this comment.
nit: i wonder if this should be outside of the loop or detected outside of it at least before we get here? like maybe while doing createDimensionClasses or something
There was a problem hiding this comment.
Makes sense, I'll move the checks there.
| } | ||
|
|
||
| final ObjectMapper newObjectMapper = spillMapper.copy(); | ||
| newObjectMapper.registerModule(new SpillModule()); |
There was a problem hiding this comment.
oh right, i can't read 😜 , sorry
| @Override | ||
| public Class<?> getClazz() | ||
| { | ||
| return Object.class; |
There was a problem hiding this comment.
i guess i was wondering if maybe could help get rid of some of the coerce list to array stuff, but totally fine to explore this later
| { | ||
| NullHandling.initializeForTests(); | ||
| //noinspection ResultOfObjectAllocationIgnored | ||
| new AggregatorsModule(); |
|
|
||
| while (jp.currentToken() != JsonToken.END_ARRAY) { | ||
|
|
||
| DruidException.conditionalDefensive( |
There was a problem hiding this comment.
Can we remove these checks because they are running per row ? Is it worth having this overhead in the hot path ?
…e types when reading from spilled files and cached results (apache#16620) Like apache#16511, but for keys that have been spilled or cached during the grouping process
Description
Like #16511, but for keys that have been spilled or cached during the grouping process. This was not caught in the original PR since there aren't tests for it. This PR adds the necessary deserialization, as well as the tests that fail if this patch and #16511 aren't in.
This PR has: