Stateful auto compaction

### Motivation

In auto compaction, the coordinator searches for segments to compact based on their byte size. This algorithm is currently a stateless algorithm. At each coordinator run, it traverses all segments of all datasources from the latest one to the oldest one, compares their size against `targetCompactionSizeBytes` (this is currently missing which causes #8481), and issues a compaction task if it finds some segments smaller than `targetCompactionSizeBytes`.

However, only comparing the segment size against `targetCompactionSizeBytes` is not enough to tell that a given segment requires a further compaction or not because the segment could be created with various types of `partitionsSpec` and later compacted with one of them. (As of now, auto compaction supports only `maxRowsPerSegment`, `maxTotalRows`, and `targetCompactionSizeBytes`, but it should support all partitionsSpec types in the future.)

As of now, we have 3 `partitionsSpec`, i.e., `DynamicPartitionsSpec`, `HashedPartitionsSpec`, and `SingleDimensionPartitionsSpec`.

- `DynamicPartitionsSpec` has `maxRowsPerSegment`, and `maxTotalRows`. 
- `HashedPartitionsSpec` has `targetPartitionSize` (target number of rows per segment), `numShards`, and `partitionDimensions`.
- `SingleDimensionPartitionsSpec` has `targetPartitionSize` (target number of rows per segment), `maxPartitionSize` (max number of rows per segment), `partitionDimensions`.

In the coordinator, most of these configurations are not easy to use to search for segments which need compaction because the number of rows in each segment is not available in the coordinator. And even if we had that in the coordinator, hash or range partitioned segments could have totally different number of rows from `targetPartitionSize`, which makes hard to tell a given segment needs compaction or not. Note that a segment doesn't need further compaction if it's already compacted with the same `partitionsSpec`. Auto compaction based on parallel indexing can also make things complicated. In parallel indexing with `DynamicPartitionsSpec`, the last segment created by each task can have rows less than `maxRowsPerSegment`.

### Proposed changes

To address this issue, I propose to store the state of auto compaction in the metadata store, so that auto compaction can search for compaction candidates from the segments which are not compacted yet. 

#### DataSegment change

A new `compactionPartitionsSpec` will be added to `DataSegment`. `compactionPartitionsSpec` will be filled in only the coordinator.

```java
public class DataSegment
{
  private final Integer binaryVersion;
  private final SegmentId id;
  @Nullable
  private final Map<String, Object> loadSpec;
  private final List<String> dimensions;
  private final List<String> metrics;
  private final ShardSpec shardSpec;
  @Nullable
  private final PartitionsSpec compactionPartitionsSpec;
  private final long size;
}
```

#### Changes in publishing segments

When a _compaction_ task creates segments, it can add its `partitionsSpec` to them (`Appenderator.add()`). Other task types don't add it.

#### Changes in the coordinator

`SQLMetadataSegmentManager` loads `DataSegment` and keeps them in memory same as now. Since `PartitionsSpec` is not very diverse in general, interning would be useful to reduce memory usage.

`DruidCoordinatorSegmentCompactor` checks that the `compactionPartitionsSpec` is available for a given segment. If it's missing, that segment should be a new segment created by a non-compaction task, which means it will be a compaction candidate. If it exists, the coordinator compares the `compactionPartitionsSpec` with the `partitionsSpec` in auto compaction configuration. The segment will be a compaction candidate if it has a different `partitionsSpec`.

### Rationale

#### Dropped alternative

The coordinator can get the number of rows of each segment from the system schema to find the compaction candidates. However, as mentioned in the motivation section, number of rows is not enough to determine that a given segment needs a compaction or not.

### Operational impact

There should be no operational impact.

### Test plan

- Will add unit tests
- Will test in our internal cluster

### Future work

In `DataSegment`, similar to `compactionPartitionsSpec`, there are a couple of fields which are loaded only in some particular places; they are null or some default value elsewhere. But, even when they are null or default, they will still use 8 more bytes per segment. I think it would probably be worth to split `DataSegment` into several classes, so that only necessary fields are loaded to reduce memory usage.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stateful auto compaction #8489

Motivation

Proposed changes

DataSegment change

Changes in publishing segments

Changes in the coordinator

Rationale

Dropped alternative

Operational impact

Test plan

Future work

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Stateful auto compaction #8489

Description

Motivation

Proposed changes

DataSegment change

Changes in publishing segments

Changes in the coordinator

Rationale

Dropped alternative

Operational impact

Test plan

Future work

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions