Today the column size for a given segment is limited to ~2GB because of ByteBuffer Integer.Max limitation. At flurry we started hitting this limit for complex columns like Theta Sketches.
Proposal:
To work around this limit, we would like to propose splitting the existing column file into multiple column files, if it exceeds the max limit. We will store metadata for the chunks while writing the columns and while reading we use this metadata to reconstruct the column.
Keeping in mind backward compatibility, here are the changes:
- Version
3 of GenericIndexed will be introduced. This version stores the following new fields
- Number of file splits
- Number of rows each split stored as power of 2 (all splits, except the “last” one, will have an equivalent number of rows).
Columns which are fit within the current 2GB limit will still continue to use version 1.
- In order to do this, the serialization logic needs to have access to the FileSmoosher while serializing, so we propose that GenericIndexedWriter.writeChannel goes
FROM
public void writeToChannel(WritableByteChannel channel)
TO
public void writeToChannel(WritableByteChannel channel, FileSmoosher smoosher)
Passing a null for the FileSmoosher is considered valid and will result in version 3 not being an option.
One of the places that calls GenericIndexedWriter.writeToChannel is the GenericColumnSerializer.writeToChannel. This method will require the same adjustment to its interface. So,
FROM
Public void GenericColumnSerializer.writeToChannel(WritableByteChannel channel)
TO
Public void GenericColumnSerializer.writeToChannel(WritableByteChannel channel, FileSmoosher smoosher)
Presently, complex columns are serialized out using ComplexColumnSerializer which is generic over all complex columns and requires that serialization be done via the ObjectStrategy. This has two implications
- It makes it hard to support both versions at the same time
- It makes it impossible for a ComplexColumnSerde to control how it is serialized, sometimes resulting in sub-optimal disk layouts.
Therefore, we propose adding a new method to ComplexMetricSerde:
public GenericColumnSerializer getSerializer(IOPeon peon, String column) {
return new ComplexColumnSerializer(peon, column, this);
}
This will mean that in order to go beyond the 2GB limit, new ComplexMetricSerde implementations will need to provide their own implementation of this method. We will be adding a new class ComplexColumnSerializerV2 to aide implementations that want to use the new functionality. This will be useable through an incantation like
@Override
public GenericColumnSerializer getSerializer(IOPeon peon, String metric)
{
return ComplexColumnSerializerV2.createWithColumnSize(peon, metric, this.getObjectStrategy(), Integer.MAX_VALUE);
}
- Given how the code is currently written, the above changes will result in a
FileSmoosher object being passed from IndexMerger to GenericIndexedWriter while it (the FileSmoosher) already has an OutputStream open.
The writeToChannel implementation is expected to create more files using the FileSmoosher and likely close them before the already opened OutputStream has been closed. This means that FileSmoosher will also need to be updated to handle this sort of usage.
We propose to support this by having the FileSmoosher detect when it already has an OutputStream open and redirect newly opened OutputStreams to new files on the file system. When any of the open OutputStream objects are closed, they will also check to see if any of the other files have been closed in the meantime. If there are OutputStreams that have been closed, they will be copied on to the main smoosh file and the extra underlying file will be cleaned up. This has the downside of introducing an extra copy to these files, but it given that the strategy will only be used when absolutely necessary, this shouldn’t result in any noticeable performance degradation during indexing.
Today the column size for a given segment is limited to ~2GB because of
ByteBuffer Integer.Maxlimitation. At flurry we started hitting this limit for complex columns like Theta Sketches.Proposal:
To work around this limit, we would like to propose splitting the existing column file into multiple column files, if it exceeds the max limit. We will store metadata for the chunks while writing the columns and while reading we use this metadata to reconstruct the column.
Keeping in mind backward compatibility, here are the changes:
3of GenericIndexed will be introduced. This version stores the following new fieldsColumns which are fit within the current 2GB limit will still continue to use version 1.
FROM
public void writeToChannel(WritableByteChannel channel)TO
public void writeToChannel(WritableByteChannel channel, FileSmoosher smoosher)Passing a
nullfor the FileSmoosher is considered valid and will result in version 3 not being an option.One of the places that calls GenericIndexedWriter.writeToChannel is the GenericColumnSerializer.writeToChannel. This method will require the same adjustment to its interface. So,
FROM
Public void GenericColumnSerializer.writeToChannel(WritableByteChannel channel)TO
Public void GenericColumnSerializer.writeToChannel(WritableByteChannel channel, FileSmoosher smoosher)Presently, complex columns are serialized out using ComplexColumnSerializer which is generic over all complex columns and requires that serialization be done via the ObjectStrategy. This has two implications
Therefore, we propose adding a new method to ComplexMetricSerde:
This will mean that in order to go beyond the 2GB limit, new
ComplexMetricSerdeimplementations will need to provide their own implementation of this method. We will be adding a new classComplexColumnSerializerV2to aide implementations that want to use the new functionality. This will be useable through an incantation likeFileSmoosherobject being passed fromIndexMergertoGenericIndexedWriterwhile it (theFileSmoosher) already has an OutputStream open.The
writeToChannelimplementation is expected to create more files using theFileSmoosherand likely close them before the already opened OutputStream has been closed. This means thatFileSmoosherwill also need to be updated to handle this sort of usage.We propose to support this by having the
FileSmoosherdetect when it already has an OutputStream open and redirect newly opened OutputStreams to new files on the file system. When any of the open OutputStream objects are closed, they will also check to see if any of the other files have been closed in the meantime. If there are OutputStreams that have been closed, they will be copied on to the main smoosh file and the extra underlying file will be cleaned up. This has the downside of introducing an extra copy to these files, but it given that the strategy will only be used when absolutely necessary, this shouldn’t result in any noticeable performance degradation during indexing.