Add feature to automatically remove datasource metadata based on retention period#11227
Add feature to automatically remove datasource metadata based on retention period#11227maytasm merged 13 commits intoapache:masterfrom
Conversation
| ) | ||
| ); | ||
| for (String datasourceMetadataInDb : datasources) { | ||
| if (!excludeDatasources.contains(datasourceMetadataInDb)) { |
There was a problem hiding this comment.
possible NPE - excludeDatasources is marked as nullable in the function definition
There was a problem hiding this comment.
excludeDatasources should be @NotNull. Fixed
| .mapTo(String.class) | ||
| .list() | ||
| ); | ||
| return connector.getDBI().withHandle( |
There was a problem hiding this comment.
What happens if an exception is thrown while trying to delete the datasources? withHandle will throw a CallbackFailedException - is this handled somewhere else in the code?
There was a problem hiding this comment.
Added try catch block
| handle -> { | ||
| final PreparedBatch batch = handle.prepareBatch( | ||
| StringUtils.format( | ||
| "DELETE FROM %1$s WHERE dataSource = :dataSource AND created_date < '%2$s'", |
There was a problem hiding this comment.
Why did you choose to build the delete statements one at a time instead of doing a batch delete?
I think we could encapsulate the excludeDatasources logic in a where clause of this delete statement instead.
Something like DELETE FROM datasources where created_date < "date" and datasource not in ("excludeDataSources")
There was a problem hiding this comment.
This is to prevent the IN clause being too large
There was a problem hiding this comment.
Also moved the filtering of the datasources to outside the handle block.
| if ((lastKillTime + period) < System.currentTimeMillis()) { | ||
| lastKillTime = System.currentTimeMillis(); | ||
| long timestamp = System.currentTimeMillis() - retainDuration; |
There was a problem hiding this comment.
nit: use consistent timestamp for all calculations
| if ((lastKillTime + period) < System.currentTimeMillis()) { | |
| lastKillTime = System.currentTimeMillis(); | |
| long timestamp = System.currentTimeMillis() - retainDuration; | |
| long currentTimeMillis = System.currentTimeMillis(); | |
| if ((lastKillTime + period) < currentTimeMillis) { | |
| lastKillTime = currentTimeMillis; | |
| long timestamp = currentTimeMillis - retainDuration; |
There was a problem hiding this comment.
Additional question - I notice this pattern in a few other co-ordinator duties.
Are there any additional safeguards we need for a very large retainDuration? What happens if timestamp is calculated to be negative?
There was a problem hiding this comment.
Added safeguard so that we never get calculated timestamp to be negative
There was a problem hiding this comment.
Note that I intentionally didnt specific in the docs that retainDuration have to be less than current timestamp (although we do check against this condition to protect ourself from unexpected behavior), since this is a very unlikely scenario and I don't want to make the docs to be unnecessary verbose.
| Set<String> allDatasourceWithActiveSupervisor = allSupervisor.values() | ||
| .stream() | ||
| // Terminated supervisor will have it's latest supervisorSpec as NoopSupervisorSpec | ||
| // (NoopSupervisorSpec is used as a tombstone marker) |
There was a problem hiding this comment.
This logic is very similar to SQLMetadataSupervisorManager#removeTerminatedSupervisorsOlderThan
Should that logic be moved out of the metadata store layer and pulled into the KillSupervisors class instead?
Should this logic be shared so that other callers can easily find the "active" supervisors?
There was a problem hiding this comment.
I added some convenience methods in SQLMetadataSupervisorManager to getLatestTerminatedOnly and getLatestActiveOnly supervisors
suneet-s
left a comment
There was a problem hiding this comment.
LGTM after CI
[ERROR] Errors:
[ERROR] KillDatasourceMetadataTest.unnecessary Mockito stubbings » UnnecessaryStubbing
Add feature to automatically remove datasource metadata based on retention period
Description
We currently already have tasklog auto cleanup (#3677) and audit logs auto cleanup (#11084). This PR adds a similar auto cleanup based on duration (time to retained) but for the datasource metadata table to auto clean up datasource that is no longer active -- meaning that the datasource does not have active supervisor running (Note: datasource metadata only exists for datasource created from supervisor).
This is useful when Druid user has a high churn of task / datasource in a short amount of time causing the metadata store size to grow uncontrollably.
This PR has: