Allow use of centralized datasource schema and segment metadata cache together#17996
Allow use of centralized datasource schema and segment metadata cache together#17996kfaraz merged 22 commits intoapache:masterfrom
Conversation
| updateUsedSegmentPayloadsInCache(datasourceToSummary); | ||
| retrieveAllPendingSegments(datasourceToSummary); | ||
| updatePendingSegmentsInCache(datasourceToSummary, syncStartTime); | ||
| retrieveAllSegmentSchemas(datasourceToSummary); |
There was a problem hiding this comment.
Should add something to the method-level javadoc of "stuff that happens every sync".
There was a problem hiding this comment.
Do you mean the javadoc of each of these methods should say whether they are invoked in every sync or only the first sync? Or some other additional info too?
There was a problem hiding this comment.
I meant that in the javadoc for this method itself, there's a section titled "The following actions are performed in every sync". It doesn't currently mention the schema syncing.
There was a problem hiding this comment.
Ah, right, will do 👍🏻
| final String sql = StringUtils.format( | ||
| "SELECT fingerprint, payload FROM %s WHERE version = %s", | ||
| tablesConfig.getSegmentSchemasTable(), CentralizedDatasourceSchemaConfig.SCHEMA_VERSION | ||
| ); |
There was a problem hiding this comment.
I think this code is fetching the entire set of segment schemas on every call to syncWithMetadataStore. Is this going to be OK, performance-wise? It seems expensive.
There was a problem hiding this comment.
The current implementation in SqlSegmentsMetadataManager polls all the schemas in every sync too.
But since the schemas table already has a used_status_last_updated as well as a created_time column,
we can try to do delta syncs in a fashion similar to the segments table.
Thanks for the suggestion, I will update the PR accordingly.
| if (syncFinishTime.get() == null) { | ||
| retrieveUsedSegmentSchemasUpdatedAfter(DateTimes.COMPARE_DATE_AS_STRING_MIN, datasourceToSummary); | ||
| } else { | ||
| retrieveUsedSegmentSchemasUpdatedAfter(syncStartTime, datasourceToSummary); |
There was a problem hiding this comment.
There should be some slack in this, to allow for the fact that clocks may not be perfectly synced across servers, and various factors (such as retries on insert) can cause records with timestamps in the past to appear. An hour should be more than enough.
There was a problem hiding this comment.
Makes sense.
There is another bug here, anyway I should have been using the start time of the previous sync rather than the current sync.
There was a problem hiding this comment.
(such as retries on insert)
Minor clarification on this point:
For the most part, in IndexerSQLMetadataStorageCoordinator, I have tried to ensure that each transaction retry uses a fresh timestamp, since retries can go on for a while.
But as you point out, there can still be cases where past records appear.
P.S: I think we should be able to do something similar for fetching used segment IDs too.
Currently, we fetch all of them. But we could potentially fetch only the recently updated ones,
thus improving the delta sync time even further.
We would just need an index on used + used_status_last_updated in druid_segments table
(same as schemas table already does).
| this.cacheMode = config.get().getCacheUsageMode(); | ||
| this.pollDuration = config.get().getPollDuration().toStandardDuration(); | ||
| this.tablesConfig = tablesConfig.get(); | ||
| this.useSchemaCache = schemaConfig.get().isEnabled() && nodeRoles.contains(NodeRole.COORDINATOR); |
There was a problem hiding this comment.
Small point, but just wanted to comment. This sort of logic is an anti pattern in Guice usage. In an ideal world, decisions about which features to enable on which servers should live in the Guice modules, rather than the main code. Separating that way makes the main code more composable and testable.
There was a problem hiding this comment.
Thanks for calling it out! Felt really hacky to me too.
Let me see how I can clean it up.
gianm
left a comment
There was a problem hiding this comment.
LGTM with the delta sync changes. Up to you if you want to adjust the Guice stuff.
|
Thanks for the review, @gianm . |
|
@gianm , I have had to fix up the delta sync logic since there were some bugs with the previous approach. The sync for schemas now resembles the logic employed for used segments. It works as follows:
|
Description
#17935 enables use of
HeapMemorySegmentMetadataCacheon the Coordinator.But it cannot be used in conjunction with centralized datasource schema (i.e.
SegmentSchemaCache)This patch supports usage of both features on the Coordinator together.
Main Changes
SegmentSchemaCachea dependency ofHeapMemorySegmentMetadataCacheSegmentMetadataCacheandSegmentSchemaCacheinMetadataManagerModuleNoopSegmentSchemaCacheto be used on the OverlordHeapMemorySegmentMetadataCacheand updateSegmentSchemaCacheused_status_last_updatedcolumn of a segment record when its schema fingerprint is updatedFix a race condition
Add a sync buffer duration of 10 seconds to
HeapMemorySegmentMetadataCacheCompactionTaskRunTest)and were added to the cache just after sync start
within 10 seconds of any other update done it (created, marked used, schema info added)
Guice changes
MetadataManagerModuleused only by Coordinator and Overlord to bind metadata managersSQLMetadataStorageDruidModuleto bind only SQL connector related stuffThis PR has: