We use druid 0.10, but recently upgraded to 0.11 and still see the same issue. We have a use case in which we (a) ingest data via HDFS indexing, then (b) wait until the data is available to be queried, then (c) ping a separate system, which then queries the data to do something with it. The problem is in step b. At first we tried this:
- Poll the Overlord status endpoint /druid/indexer/v1/task//status until it returns SUCCESS or FAILED. This lets us know when the indexing job is complete.
- Poll the Coordinator loadstatus endpoint /druid/coordinator/v1/loadstatus until it returns 100% for our datasource.
But then we discovered that for a pre-existing datasource there is a period of time after the indexing job completes in which loadstatus returns 100%, before it starts to return something less than 100%. So we added another step:
1.5. Poll the Coordinator loadstatus endpoint every second until it returns non-100%.
This polling has to be every second because if the datasource is small it takes very little time to distribute through the cluster. If we poll every 10 seconds, we might miss seeing a non-100% number, and then we have no way of knowing if the new data is available or not. Even at 1 second it's possible to miss this window.
It would be great to have a reliable way of knowing when ingested data is queryable. Thanks!
We use druid 0.10, but recently upgraded to 0.11 and still see the same issue. We have a use case in which we (a) ingest data via HDFS indexing, then (b) wait until the data is available to be queried, then (c) ping a separate system, which then queries the data to do something with it. The problem is in step b. At first we tried this:
But then we discovered that for a pre-existing datasource there is a period of time after the indexing job completes in which loadstatus returns 100%, before it starts to return something less than 100%. So we added another step:
This polling has to be every second because if the datasource is small it takes very little time to distribute through the cluster. If we poll every 10 seconds, we might miss seeing a non-100% number, and then we have no way of knowing if the new data is available or not. Even at 1 second it's possible to miss this window.
It would be great to have a reliable way of knowing when ingested data is queryable. Thanks!