Motivation
The process to delete data (mark segments unused) is not simple currently, look at the steps in tutorial http://druid.io/docs/latest/tutorials/tutorial-delete-data.html.
It would be useful to provide coordinator APIs which can be used to delete(mark unused) or reload(mark used) segments for a datasource by interval. The API to reload data should mark any non-overshadowed segments as used.
Proposed changes
Propose to add two new API's to DataSourcesResource.java
POST /druid/coordinator/v1/datasources/{dataSourceName}/markUnused
POST /druid/coordinator/v1/datasources/{dataSourceName}/markUsed
The payload would contain the interval or segment ids which needs to be marked used/unused. Either interval or segment ids should be provided, if both are provided in the payload , the API would throw an error.
Interval specifies the start and end times as IS0 8601 strings.
interval = [start, end] where start and end are both inclusive.
Only the segments which are contained in the interval will be affected, i.e. segments which partially fall in the start or end will not be affected by these API's.
For example, interval payload json can be:
{
"interval" : "2012-01-01T00:00:00.000/2012-01-03T00:00:00.000"
}
sample payload with segment ids :
{
"segmentIds":
[
"wikipedia_2019-01-03T19:00:00.000Z_2019-01-03T20:00:00.000Z_2019-01-03T19:00:05.964Z",
"wikipedia_2019-01-03T22:00:00.000Z_2019-01-03T23:00:00.000Z_2019-01-03T19:00:05.964Z_2"
]
}
Following http response code would be returned by both APIs:
200 : if data is successfully dropped or reloaded
204 : if no content is found for given interval or segment ids not found in db
400 : if the payload is incorrect, either none or both (interval and segments) are provided
500 : if there was an exception trying carry out the action
Handling overshadowed segments while reloading data
The API to reload data would only mark used=true for segments which are not overshadowed by other segments to avoid the needless reload and drop of segments on historicals. This can be done in SQLMetadataSegmentManager by filtering segments which are overshadowed before updating in the metadata store.
Rationale
Considered DELETE /druid/coordinator/v1/datasources/{dataSourceName}/intervals/{interval} and POST /druid/coordinator/v1/datasources/{dataSourceName}/intervals/{interval}, but the former is already being used to kill segments. So decided to use PUT with the datasource and action, with the payload.
Operational impact
New api, so no operational impact.
Future work
DELETE /druid/coordinator/v1/datasources/{dataSourceName}/intervals/{interval} kills the segments, but DELETE /druid/coordinator/v1/datasources/{dataSourceName}/segments/{segmentId} disables segments. It would be useful to make the behavior of these two APIs coherent or change the URL to make it more intuitive.
Motivation
The process to delete data (mark segments unused) is not simple currently, look at the steps in tutorial http://druid.io/docs/latest/tutorials/tutorial-delete-data.html.
It would be useful to provide coordinator APIs which can be used to delete(mark unused) or reload(mark used) segments for a datasource by interval. The API to reload data should mark any non-overshadowed segments as used.
Proposed changes
Propose to add two new API's to
DataSourcesResource.javaPOST /druid/coordinator/v1/datasources/{dataSourceName}/markUnusedPOST /druid/coordinator/v1/datasources/{dataSourceName}/markUsedThe payload would contain the interval or segment ids which needs to be marked used/unused. Either interval or segment ids should be provided, if both are provided in the payload , the API would throw an error.
Interval specifies the start and end times as IS0 8601 strings.
interval = [start, end] where start and end are both inclusive.
Only the segments which are contained in the interval will be affected, i.e. segments which partially fall in the
startorendwill not be affected by these API's.For example, interval payload json can be:
sample payload with segment ids :
Following http response code would be returned by both APIs:
200 : if data is successfully dropped or reloaded
204 : if no content is found for given interval or segment ids not found in db
400 : if the payload is incorrect, either none or both (interval and segments) are provided
500 : if there was an exception trying carry out the action
Handling overshadowed segments while reloading data
The API to reload data would only mark
used=truefor segments which are not overshadowed by other segments to avoid the needless reload and drop of segments on historicals. This can be done inSQLMetadataSegmentManagerby filtering segments which are overshadowed before updating in the metadata store.Rationale
Considered
DELETE /druid/coordinator/v1/datasources/{dataSourceName}/intervals/{interval}andPOST /druid/coordinator/v1/datasources/{dataSourceName}/intervals/{interval}, but the former is already being used to kill segments. So decided to use PUT with thedatasourceand action, with the payload.Operational impact
New api, so no operational impact.
Future work
DELETE /druid/coordinator/v1/datasources/{dataSourceName}/intervals/{interval}kills the segments, butDELETE /druid/coordinator/v1/datasources/{dataSourceName}/segments/{segmentId}disables segments. It would be useful to make the behavior of these two APIs coherent or change the URL to make it more intuitive.