Skip to content

Consolidate the conversion between Granularity and VirtualColumn, and improve the mapping of granularity usage in projections.#18403

Merged
cecemei merged 22 commits intoapache:masterfrom
cecemei:gran
Sep 12, 2025
Merged

Conversation

@cecemei
Copy link
Copy Markdown
Contributor

@cecemei cecemei commented Aug 14, 2025

Description

Consolidate the conversion between Granularity and VirtualColumn, and improve the mapping of granularity usage in projections.

Before this PR, projection has issues handling granularity with time zones, examples:

  • TimeseriesQuery with Pacific time hourly granularity can't use UTC hourly projection.
  • GroupbyQuery with Indian time hourly granularity is matched to UTC hourly projection falsely (it's 5h30m offset)

After this PR, the granularity matches as following:

  • TimeseriesQuery with Pacific time hourly granularity can use UTC hourly projection.
  • GroupbyQuery with Indian time hourly granularity can't use UTC hourly projection.

This is done through:

  • VirtualColumn -> Granularity should be able to handle all TimestampFloorExpr regardless of the time zone.
  • Granularity -> VirtualColumn should always be able to convert PeriodGranularity to TimestampFloorExpr, regardless of the time zone.
  • PeriodGranularity has a new canBeMappedTo function, in a simple way, a finer gran can be mapped to a coarser gran. It also considers timezone, origin, month/year, week, day, hour compatible stuff. E.x.
    • same zone hour can be mapped to day, 2 hour can be mapped to 4 hour, but can not be mapped to 3 hour.
    • different zone sometimes hour can also be mapped to day, like pacific time, but indian time can not
    • week can not be mapped to month, we simplify this by not allowing week/day/hour period if there's month/year
  • query will almost always have a __gran VirtualColumn, unless Granularities.ALL.
  • projections would check its __gran canBeMappedTo to query __gan, this is more restrictive than isFinerThan.

Additional restrictions on projection granularity:

  • it has to be in UTC time zone and null origin. This is consistent with the __time column. Without this requirement, we can't simply map timeColumnName in projection to __time.

Key changed/added classes in this PR
  • Projections
  • Granularities
  • PeriodGranularity
  • PeriodGranularityTest
  • CursorFactoryProjectionTest

This PR has:

  • been self-reviewed.
  • added documentation for new or modified features or behaviors.
  • a release note entry in the PR description.
  • added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
  • added or updated version, license, or notice information in licenses.yaml
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
  • added integration tests.
  • been tested in a test Druid cluster.

@cecemei cecemei marked this pull request as ready for review August 14, 2025 23:08
@cecemei cecemei changed the title some changes on gran and expr and projection Consolidate the conversion between Granularity and VirtualColumn, and improve the mapping of granularity usage in projections. Aug 14, 2025
@cecemei cecemei requested a review from clintropolis August 14, 2025 23:09
* <li>Period('PT1H') in America/Los_Angeles can be mapped to Period('PT1H') in UTC</li>
* <li>Period('P1D') in America/Los_Angeles cannot be mapped to Period('P1D') in UTC</li>
*/
public boolean canBeMappedTo(PeriodGranularity target)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

admittedly i'm still digesting how exactly this method works, but it seems kind of expensive to do to match for every projection we consider of every segment when the timezones don't match (at least getUtcMappablePeriodSecondsOrThrow seems expensive). Perhaps we should make a cache of these conversions so we can re-use the work we've done since its likely going to be a lot of the same checks over and over?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also these methods feel like they could use some additional comments to make it clearer what is going on and why this works

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added some additional comments. also added PERIOD_GRAN_CACHE in Projections.java.

@cecemei cecemei requested a review from clintropolis September 9, 2025 00:10
Comment on lines +249 to +256
// we don't allow:
// 1. virtual column on __time if there's no grouping on __time
// 2. non-UTC or non-epoch origin period granularity
throw InvalidInput.exception(
"cannot use granularity[%s] on column[%s] as projection time column",
maybeGranularity,
dimension.getName()
);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this needs to be an exception, we just need to not consider it as a time column, the same way as a coarser granularity would not be __time. These columns can still be substituted for query virtual columns, so there is value in being able to pre-compute them, they just cannot serve as a replacement for __time (or time floor on __time) in queries.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah i was thinking about some cases when there's a time-like virtual column (non-utc) and another time-like virtual column (utc, can be converted to gran), e.x. PT1H in Pacific and P1D in UTC, if we just ignore, we would just use floor(__time, P1D) as the time column, and the virtual column PT1H in Pacific would just be computed based on floor(__time, P1D), that would be incorrect.

it seems simple and safe to just assume the first time-like grouping column must be gran, and the following must be coarser, e.x. PT1H in UTC and P1D in Pacific is an acceptable config. Another case is PT2H, PT3H, PT1H, it seems a bit more complex (or calculated) if we don't have the FIRST time-like must be gran assumption.

Copy link
Copy Markdown
Contributor Author

@cecemei cecemei Sep 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually realized non-UTC vc would just be treated as regular grouping column, for PT2H and PT3H, it'd just choose the finer one PT2H.

Comment on lines +227 to +228
// determine the granularity and time column name for the projection, based on the first time-like grouping column.
// if there are multiple time-like grouping columns, they must be "coarser" than the first time-like grouping column.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is too restrictive, the __time column does not need to be first, we just need to replace any candidate __time column with the finer granularity column, since what we are looking for is the finest granularity column we can use as a substitute for __time in any queries against the base table

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

finer is not completely safe, we're looking for the most compatible gran, e.x. between PT2H and PT3H, we can find PT1H as compatible (but this is not currently supported, i'm just using this as an example).

and deciding on the most compatible is tricky since we don't want non-UTC time to sneak in, e.x. PT30M in Pacific time would sort of invalidate our PT1H gran in UTC, handling this just makes things more complicated, thus choose to handle in a strict way here

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually the compatible is PT6H. i realized projection need to find coarse ones as compatible

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated this logic to handle the gran, it should never throw exception now, skips non-utc and non-mappable vc

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this comment is stale now since it no longer requires time to be first and additional time groupings to be coarser

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated comment

.addReferencedPhysicalColumn(ColumnHolder.TIME_COLUMN_NAME);
} else if (virtualGranularity.equals(Granularities.NONE)
|| projection.getEffectiveGranularity().equals(Granularities.ALL)) {
return null;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is this check for none and all separately for? Are there other query virtual column granularities that are ok with an all granularity projection?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if query granularity is NONE, and projection gran is not (would already been caught by the first if statement), we just can't map, e.x. format(__time, 'YYYY-MM-DD HH:mm:ss') query can't use hourly projection. This exits early.

but i also reformatted this part to be more readable, so return null should be at bottom now

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i should be more specific, previous this is using a simple isFinerThan comparison, which is not safe, because PT2H is finer than PT3H but the former can't be mapped to the latter. now we're switching to canBeMappedTo, but it only applies to period gran, so adding this special handling for Granularities.NONE and Granularities.ALL.


public class Projections
{
private static final Map<byte[], Boolean> PERIOD_GRAN_CACHE = new HashMap<>();
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this needs to be a concurrent map or use a lock or something

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated to use ConcurrentHashMap, no lock should be fine since we're only using computeIfAbsent

Copy link
Copy Markdown
Member

@clintropolis clintropolis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

overall lgtm 👍

Comment on lines +227 to +228
// determine the granularity and time column name for the projection, based on the first time-like grouping column.
// if there are multiple time-like grouping columns, they must be "coarser" than the first time-like grouping column.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this comment is stale now since it no longer requires time to be first and additional time groupings to be coarser

Comment on lines +402 to +403
PeriodGranularity projectionGran = (PeriodGranularity) projection.getEffectiveGranularity();
byte[] combinedKey = StringUtils.toUtf8(projectionGran + "->" + virtualGran);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it might be better to use getCacheKey() from the granularities instead of toString conversion, this whole bit could be

byte[] combinedKey = new CacheKeyBuilder(0).appendCacheable(projectionGran).appendCacheable(virtualGran).build()

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

@cecemei cecemei merged commit f55de8f into apache:master Sep 12, 2025
62 checks passed
@cecemei cecemei added this to the 35.0.0 milestone Oct 21, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants