Skip to content

API to drop and reload data by interval #7439

@surekhasaharan

Description

@surekhasaharan

Motivation

The process to delete data (mark segments unused) is not simple currently, look at the steps in tutorial http://druid.io/docs/latest/tutorials/tutorial-delete-data.html.
It would be useful to provide coordinator APIs which can be used to delete(mark unused) or reload(mark used) segments for a datasource by interval. The API to reload data should mark any non-overshadowed segments as used.

Proposed changes

Propose to add two new API's to DataSourcesResource.java

POST /druid/coordinator/v1/datasources/{dataSourceName}/markUnused
POST /druid/coordinator/v1/datasources/{dataSourceName}/markUsed

The payload would contain the interval or segment ids which needs to be marked used/unused. Either interval or segment ids should be provided, if both are provided in the payload , the API would throw an error.
Interval specifies the start and end times as IS0 8601 strings.
interval = [start, end] where start and end are both inclusive.
Only the segments which are contained in the interval will be affected, i.e. segments which partially fall in the start or end will not be affected by these API's.

For example, interval payload json can be:

{
     "interval" : "2012-01-01T00:00:00.000/2012-01-03T00:00:00.000"
}

sample payload with segment ids :

{
	"segmentIds": 
	[
		"wikipedia_2019-01-03T19:00:00.000Z_2019-01-03T20:00:00.000Z_2019-01-03T19:00:05.964Z",
		"wikipedia_2019-01-03T22:00:00.000Z_2019-01-03T23:00:00.000Z_2019-01-03T19:00:05.964Z_2"
	]
}

Following http response code would be returned by both APIs:

200 : if data is successfully dropped or reloaded
204 : if no content is found for given interval or segment ids not found in db
400 : if the payload is incorrect, either none or both (interval and segments) are provided
500 : if there was an exception trying carry out the action

Handling overshadowed segments while reloading data

The API to reload data would only mark used=true for segments which are not overshadowed by other segments to avoid the needless reload and drop of segments on historicals. This can be done in SQLMetadataSegmentManager by filtering segments which are overshadowed before updating in the metadata store.

Rationale

Considered DELETE /druid/coordinator/v1/datasources/{dataSourceName}/intervals/{interval} and POST /druid/coordinator/v1/datasources/{dataSourceName}/intervals/{interval}, but the former is already being used to kill segments. So decided to use PUT with the datasource and action, with the payload.

Operational impact

New api, so no operational impact.

Future work

DELETE /druid/coordinator/v1/datasources/{dataSourceName}/intervals/{interval} kills the segments, but DELETE /druid/coordinator/v1/datasources/{dataSourceName}/segments/{segmentId} disables segments. It would be useful to make the behavior of these two APIs coherent or change the URL to make it more intuitive.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions