[Proposal] Support new and different segment types

Currently, Druid has its own persistence format which was designed to handle structured data in the form of dimensions and metrics.  It would be nice to expand Druid to be more straightforward in handling different structures, persistence formats and even functionality all as extensions to core.  This proposal tries to move us in that direction.

There are three basic areas that require attention in order to do this:
1. Ingestion
2. Hand-off
3. Querying

*\* Ingestion

The ingestion side of this is already handled by the recent `Appenderator` changes, so I do not believe any noteworthy changes would be required in the immediate future.  That said, not all ingestion mechanisms leverage `Appenderator`s yet, so in order to get the capabilities enabled by this proposal, there will be work needed to implement all mechanisms of ingestion in terms of `Appenderator` objects. 

*\* Hand-off

Hand-off is comprised of two things: persisting on the ingestion-side and deserializing on the historical(read)-side.  `Appenderator`s already own the persisting process on the ingestion side, so once again, I do not believe any noteworthy changes are required for that half of the story.

On the deserialization end of the spectrum, though, there will be changes required to enable these formats to be added as extensions.  Namely, the current `SegmentLoader` interface is implemented by `SegmentLoaderLocalCacheManager` in a two-step algorithm

```
public Segment getSegment(DataSegment segment) throws SegmentLoadingException
{
  File segmentFiles = getSegmentFiles(segment); // step 1: download and unzip files
  return new QueryableIndexSegment(segment.getIdentifier(), factory.factorize(segmentFiles)); // step 2
}
```

We need to change the algorithm of step 2 to be something that can be extended, which means a Guice or Jackson touch-point.  I propose that we make it a Jackson touch point.  Specifically, we should add a file to the zip that is a JSON-descriptor for the factory that should be used to deserialize the files.  Essentially, resulting in this implementation instead:

```
public Segment getSegment(DataSegment segment) throws SegmentLoadingException
{
  File segmentFiles = getSementFiles(segment);

  SegmentFactory factory = legacyFactory;
  File factoryJson = new File(segmentFiles, "factory.json");
  if (factoryJson.exists()) {
    factory = objectMapper.readValue(factoryJson, SegmentFactory.java);
  }

  return factory.factorize(segment, segmentFiles);
}
```

Where the interface for `SegmentFactory` is

```
public interface SegmentFactory {
  Segment factorize(DataSegment segment, File segmentFiles) throws SegmentLoadingException;
}
```

And `legacyFactory` is an implementation of `SegmentFactory` that is just

```
public Segment factorize(DataSegment segment, File segmentFiles) throws SegmentLoadingException {
  return new QueryableIndexSegment(segment.getIdentifier(), factory.factorize(segmentFiles));
}
```

*\* Querying

Different segment persistence types can expose and enable new and different functionality in varying ways.  We want to enable queries that can take advantage of the specific benefits of any given persistence type while also providing methods to connect into functionality already implemented.  We can enable this with a relatively simple interface change.  Currently, the Segment interface is

```
public interface Segment extends Closeable
{
  String getIdentifier();
  Interval getDataInterval();
  QueryableIndex asQueryableIndex();
  StorageAdapter asStorageAdapter();
}
```

I.e. it has two methods that allow you to "convert" the `Segment` into an object with specific semantics that queries know how to deal with:

```
  QueryableIndex asQueryableIndex();
  StorageAdapter asStorageAdapter();
```

I propose that we change the interface to be

```
public interface Segment extends Closeable
{
  String getIdentifier();
  Interval getDataInterval();
  <T> T as(Class<T> clazz);
}
```

This essentially means that all of the places that currently call `asQueryableIndex()` would be able to call `as(QueryableIndex.class)` instead.  This interface _does_ potentially have external touch-points in that if anybody has implemented their own Query, the might be calling either `asQueryableIndex()` or `asStorageAdapter()` already.  So, there is a decision to be made on whether we make the change backwards compatible from the beginning and then ultimately remove the two superfluous methods later or if we make the backwards incompatible change now.

I think I'm a fan of making the backwards incompatible change now, because that should limit how many people are adversely affected by it.  If we make it later, then everyone who has implemented their own storage format must update their implementation to not have the methods anymore.

Then again, if all we are doing is removing methods from an interface in the future, I _think_ that would be a compile-time incompatibility, but not a runtime incompatibility.  I.e. after we remove the methods, I _think_ it's still possible for an implementation that was compiled assuming those methods exist to continue to work.  But, I'm not sure about that.  If that's the case, then the backwards-incompatible removal of the methods from the interface is actually a relatively low-risk change and only potentially adversely affects people who have implemented their own queries in terms of the removed methods.

I believe that these changes are sufficient to enable extensions to create their own persistence formats and even expose their own interfaces for querying, if required to access some specific properties of the new persistent format.  This could be leveraged for many different uses, from building connectors to ORC or Parquet-based data to trying something completely new and different while still leveraging Druid's ingestion, segment-management and query-routing functionality.

In general, this will also improve the staying power of the Druid system as it will enable us to switch persistence formats when and if it is determined that the format implemented in the way-back-when is not keeping up the current needs and trends of the space.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Proposal] Support new and different segment types #2965

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[Proposal] Support new and different segment types #2965

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions