Skip to content

Conversation

@gianm
Copy link
Contributor

@gianm gianm commented Dec 29, 2025

This patch integrates MSQ with virtual storage. It also refactors how MSQ reads inputs to give stages more control over how inputs are read and merged. In particular, stages are now able to fully control merging logic.

The main changes:

  1. Integrate with virtual storage. Removed DataSegmentProvider, replaced it with direct usage of SegmentManager by SegmentsInputSliceReader. The SegmentManager reference ends up wrapped into RegularLoadableSegment, which provides methods acquire() and acquireIfCached() to the query logic.

  2. Give stages control over input merging: rework InputSliceReader to return ReadablePartitions directly, without embedding any merging logic. Break out StandardPartitionReader as a separate class.

Other changes:

  1. Move ReadableInput to the querykit package. It is no longer specific to the MSQ framework.

  2. Remove StandardStageProcessor, refactoring dependent code to not require it.

  3. Remove ExternalColumnSelectorFactory wrapper. Type casting is now handled directly by RowBasedColumnSelectorFactory.

  4. Include full query context in worker context, rather than just a subset.

The purpose of this config is to enable using SegmentLocalCacheManager
for loading segments on MSQ worker tasks, where segments are not
assigned by load/drop rules, and where there is not generally a specific
maxSize configured for the local cache. We need to evict segments
immediately so local disks don't fill up.

The main changes:

1) In StorageLocation, update the releaseHold runnable to check for
   evictImmediately. If it is set, unmount the cache entry if all holds
   have been released.

2) In SegmentLocalCacheManager, when evictImmediately is set, "mount"
   sets an onUnmount handler to delete the info file.
This patch integrates MSQ with virtual storage. It also refactors how MSQ
reads inputs to give stages more control over how inputs are read and merged.
In particular, stages are now able to fully control merging logic.

The main changes:

1) Integrate with virtual storage: merge the two DataSegmentProvider impls
   (Dart and Task) into DataSegmentProviderImpl that relies on SegmentManager.

2) Give stages control over input merging: rework InputSliceReader to return
   ReadablePartitions directly, without embedding any merging logic. Break out
   StandardPartitionReader as a separate class.

Other changes:

1) Move ReadableInput to the querykit package. It is no longer specific to the
   MSQ framework.

2) Remove StandardStageProcessor, refactoring dependent code to not require it.

3) Remove ExternalColumnSelectorFactory wrapper. Type casting is now handled
   directly by RowBasedColumnSelectorFactory.

4) Include full query context in worker context, rather than just a subset.

Includes apache#18871.
@github-actions github-actions bot added Area - Batch Ingestion Area - Segment Format and Ser/De Area - Ingestion Area - MSQ For multi stage queries - https://github.com/apache/druid/issues/12262 labels Dec 29, 2025
final int partitionNum = i % 2;

segments.add(
DataSegment.builder()

Check notice

Code scanning / CodeQL

Deprecated method or constructor invocation Note test

Invoking
DataSegment.builder
should be avoided because it has been deprecated.
*/
public static DataSegment createDataSegmentForTest(final SegmentId segmentId)
{
return DataSegment.builder()

Check notice

Code scanning / CodeQL

Deprecated method or constructor invocation Note test

Invoking
DataSegment.builder
should be avoided because it has been deprecated.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Area - Batch Ingestion Area - Ingestion Area - MSQ For multi stage queries - https://github.com/apache/druid/issues/12262 Area - Segment Format and Ser/De

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant