Eliminate Periodic Realtime Segment Metadata Queries: Task Now Publish Schema for Seamless Coordinator Updates#15475
Conversation
|
Please note that the test failures are because of an uncovered |
cryptoe
left a comment
There was a problem hiding this comment.
Need to go through UT's.
|
|
||
| if ((!Objects.equals(numRows, previousNumRows)) || (updatedColumns.size() > 0) || (newColumns.size() > 0)) { | ||
| publish = true; | ||
| delta = true; |
There was a problem hiding this comment.
| delta = true; |
| new SegmentSchema( | ||
| segmentId.getDataSource(), | ||
| segmentId.toString(), | ||
| delta, |
There was a problem hiding this comment.
| delta, | |
| true, |
There was a problem hiding this comment.
Delta can be either true or false. Hence cannot hardcode.
| private final CopyOnWriteArrayList<FireHydrant> hydrants = new CopyOnWriteArrayList<>(); | ||
|
|
||
| private final LinkedHashSet<String> dimOrder = new LinkedHashSet<>(); | ||
| // columns excluding current index, includes __time column |
There was a problem hiding this comment.
what is curIndex, the inmemory Fire hydrant ? If yes lets rename it like that / or document it
| numRowsExcludingCurrIndex.addAndGet(segment.asQueryableIndex().getNumRows()); | ||
| QueryableIndex index = segment.asQueryableIndex(); | ||
| mergeIndexDimensions(new QueryableIndexStorageAdapter(index)); | ||
| numRowsExcludingCurrIndex.addAndGet(index.getNumRows()); |
There was a problem hiding this comment.
This is only for restoring tasks rite ?
| /** | ||
| * Merge the column from the index with the existing columns. | ||
| */ | ||
| private void mergeIndexDimensions(StorageAdapter storageAdapter) |
There was a problem hiding this comment.
| private void mergeIndexDimensions(StorageAdapter storageAdapter) | |
| private void overWriteIndexDimensions(StorageAdapter storageAdapter) |
| ReferenceCountingSegment segment = hydrant.getIncrementedSegment(); | ||
| try { | ||
| numRowsExcludingCurrIndex.addAndGet(segment.asQueryableIndex().getNumRows()); | ||
| QueryableIndex index = segment.asQueryableIndex(); |
There was a problem hiding this comment.
Sink is also used for both realtime and batch. How do your changes affect batch ingestion ?
There was a problem hiding this comment.
Additionally, I am maintaining set of columns in each sink. I am assuming this wouldn't be much of an overhead in batch ingestion.
cryptoe
left a comment
There was a problem hiding this comment.
Changes LGTM. Nit comments only.
|
|
||
| // Newest segments first, so they override older ones. | ||
| private static final Comparator<SegmentId> SEGMENT_ORDER = Comparator | ||
| protected static final Comparator<SegmentId> SEGMENT_ORDER = Comparator |
There was a problem hiding this comment.
Whats the change in this class apart from field access change ?
There was a problem hiding this comment.
Visibility is changed for a couple of fields.
|
|
||
| if (Boolean.parseBoolean(properties.getProperty("druid.centralizedDatasourceSchema.enabled")) | ||
| && !properties.getOrDefault("druid.serverview.type", "http").equals("http")) { | ||
| throw new RuntimeException( |
There was a problem hiding this comment.
This should be druidException for cluster administrator no ?
| if (isSegmentMetadataCacheEnabled) { | ||
| if (!properties.getOrDefault(SERVERVIEW_TYPE_PROPERTY, "http").equals("http")) { | ||
| throw new RuntimeException( | ||
| "CentralizedDatasourceSchema feature is incompatible with Zookeeper based segment discovery. " |
There was a problem hiding this comment.
I think we should not mention zookeeper but just the value of properties.getOrDefault(SERVERVIEW_TYPE_PROPERTY). If tomorrow we add new eay of announcing serviewViews that we need not change this piece of code.
| private static final String AS_OVERLORD_PROPERTY = "druid.coordinator.asOverlord.enabled"; | ||
| private static final String CENTRALIZED_SCHEMA_MANAGEMENT_ENABLED = "druid.centralizedDatasourceSchema.enabled"; | ||
| private static final String CENTRALIZED_DATASOURCE_SCHEMA_ENABLED = "druid.centralizedDatasourceSchema.enabled"; | ||
| private static final String SERVERVIEW_TYPE_PROPERTY = "druid.serverview.type"; |
There was a problem hiding this comment.
Please push this config to ServerViewModule and then reference them from there.
|
@findingrish Thank you for the PR. |
…rce Schema Building (#15817) Issue: #14989 The initial step in optimizing segment metadata was to centralize the construction of datasource schema in the Coordinator (#14985). Thereafter, we addressed the problem of publishing schema for realtime segments (#15475). Subsequently, our goal is to eliminate the requirement for regularly executing queries to obtain segment schema information. This is the final change which involves publishing segment schema for finalized segments from task and periodically polling them in the Coordinator.
Description
Issue: #14989
The initial step in optimizing segment metadata was to centralize the construction of datasource schema in the Coordinator (#14985). Subsequently, our goal is to eliminate the requirement for regularly executing queries to obtain segment schema information. This task encompasses addressing both realtime and finalized segments.
This modification specifically addresses the issue with realtime segments. Tasks will now routinely communicate the schema for realtime segments during the segment announcement process. The Coordinator will identify the schema alongside the segment announcement and subsequently update the schema for realtime segments in the metadata cache.
Design
Task
Sink.getSignatureto return the RowSignature of the Sink by coming the signature of eachFireHydrant.StreamAppenderator.SinkSchemaAnnouncerwill compute sink schema changes and announce them to theDataSegmentAnnouncer.DataSegmentAnnouncerto receive sink schema information and manage schema cleanup when a task is closed.SegmentSchemashas been added to facilitate the passing of schema information for multiple segments.DataSegmentChangeRequesthas been introduced, namedSegmentSchemasChangeRequest.Coordinator
HttpServerInventoryViewto handle schema information.Schema Update Flow:HttpServerInventoryView->CoordinatorServerView->CoordinatorSegmentMetadataCache.CoordinatorSegmentMetadatacache has been updated to incorporate schema changes. Changes have also been made to the refresh logic to eliminate the need for executing segment metadata queries for realtime segments.Testing
Potential side effects
None
Limitations
Currently, this feature doesn't work with zookeeper based segment announcement.
Upgrade considerations
The general upgrade order should be followed. The new code is behind a feature flag, so it is compatible with existing setups. Even if centralized datasource schema building (#14985) is enabled, realtime segments will be refreshed using segment metadata query to Indexer/Task.
This experimental feature aims to eliminate the necessity for periodically executing the SegmentMetadataQuery to the Indexer/Task for retrieving the schema of realtime segments. Presently, it is accessible through two feature flags and should only be enabled for Proof of Concept (PoC) or testing purposes. To activate it, configure the following settings in the common configurations:
druid.centralizedDatasourceSchema.enabledanddruid.centralizedDatasourceSchema.announceRealtimeSegmentSchema. It's important to note that the feature flag is temporarydruid.centralizedDatasourceSchema.announceRealtimeSegmentSchemaand will be removed in a subsequent update.This PR has: