Skip to content

[Proposal] Support new and different segment types #2965

@cheddar

Description

@cheddar

Currently, Druid has its own persistence format which was designed to handle structured data in the form of dimensions and metrics. It would be nice to expand Druid to be more straightforward in handling different structures, persistence formats and even functionality all as extensions to core. This proposal tries to move us in that direction.

There are three basic areas that require attention in order to do this:

  1. Ingestion
  2. Hand-off
  3. Querying

** Ingestion

The ingestion side of this is already handled by the recent Appenderator changes, so I do not believe any noteworthy changes would be required in the immediate future. That said, not all ingestion mechanisms leverage Appenderators yet, so in order to get the capabilities enabled by this proposal, there will be work needed to implement all mechanisms of ingestion in terms of Appenderator objects.

** Hand-off

Hand-off is comprised of two things: persisting on the ingestion-side and deserializing on the historical(read)-side. Appenderators already own the persisting process on the ingestion side, so once again, I do not believe any noteworthy changes are required for that half of the story.

On the deserialization end of the spectrum, though, there will be changes required to enable these formats to be added as extensions. Namely, the current SegmentLoader interface is implemented by SegmentLoaderLocalCacheManager in a two-step algorithm

public Segment getSegment(DataSegment segment) throws SegmentLoadingException
{
  File segmentFiles = getSegmentFiles(segment); // step 1: download and unzip files
  return new QueryableIndexSegment(segment.getIdentifier(), factory.factorize(segmentFiles)); // step 2
}

We need to change the algorithm of step 2 to be something that can be extended, which means a Guice or Jackson touch-point. I propose that we make it a Jackson touch point. Specifically, we should add a file to the zip that is a JSON-descriptor for the factory that should be used to deserialize the files. Essentially, resulting in this implementation instead:

public Segment getSegment(DataSegment segment) throws SegmentLoadingException
{
  File segmentFiles = getSementFiles(segment);

  SegmentFactory factory = legacyFactory;
  File factoryJson = new File(segmentFiles, "factory.json");
  if (factoryJson.exists()) {
    factory = objectMapper.readValue(factoryJson, SegmentFactory.java);
  }

  return factory.factorize(segment, segmentFiles);
}

Where the interface for SegmentFactory is

public interface SegmentFactory {
  Segment factorize(DataSegment segment, File segmentFiles) throws SegmentLoadingException;
}

And legacyFactory is an implementation of SegmentFactory that is just

public Segment factorize(DataSegment segment, File segmentFiles) throws SegmentLoadingException {
  return new QueryableIndexSegment(segment.getIdentifier(), factory.factorize(segmentFiles));
}

** Querying

Different segment persistence types can expose and enable new and different functionality in varying ways. We want to enable queries that can take advantage of the specific benefits of any given persistence type while also providing methods to connect into functionality already implemented. We can enable this with a relatively simple interface change. Currently, the Segment interface is

public interface Segment extends Closeable
{
  String getIdentifier();
  Interval getDataInterval();
  QueryableIndex asQueryableIndex();
  StorageAdapter asStorageAdapter();
}

I.e. it has two methods that allow you to "convert" the Segment into an object with specific semantics that queries know how to deal with:

  QueryableIndex asQueryableIndex();
  StorageAdapter asStorageAdapter();

I propose that we change the interface to be

public interface Segment extends Closeable
{
  String getIdentifier();
  Interval getDataInterval();
  <T> T as(Class<T> clazz);
}

This essentially means that all of the places that currently call asQueryableIndex() would be able to call as(QueryableIndex.class) instead. This interface does potentially have external touch-points in that if anybody has implemented their own Query, the might be calling either asQueryableIndex() or asStorageAdapter() already. So, there is a decision to be made on whether we make the change backwards compatible from the beginning and then ultimately remove the two superfluous methods later or if we make the backwards incompatible change now.

I think I'm a fan of making the backwards incompatible change now, because that should limit how many people are adversely affected by it. If we make it later, then everyone who has implemented their own storage format must update their implementation to not have the methods anymore.

Then again, if all we are doing is removing methods from an interface in the future, I think that would be a compile-time incompatibility, but not a runtime incompatibility. I.e. after we remove the methods, I think it's still possible for an implementation that was compiled assuming those methods exist to continue to work. But, I'm not sure about that. If that's the case, then the backwards-incompatible removal of the methods from the interface is actually a relatively low-risk change and only potentially adversely affects people who have implemented their own queries in terms of the removed methods.

I believe that these changes are sufficient to enable extensions to create their own persistence formats and even expose their own interfaces for querying, if required to access some specific properties of the new persistent format. This could be leveraged for many different uses, from building connectors to ORC or Parquet-based data to trying something completely new and different while still leveraging Druid's ingestion, segment-management and query-routing functionality.

In general, this will also improve the staying power of the Druid system as it will enable us to switch persistence formats when and if it is determined that the format implemented in the way-back-when is not keeping up the current needs and trends of the space.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions