Skip to content

feat: Add configurable truncation for string columns#19146

Merged
aho135 merged 7 commits intoapache:masterfrom
jaykanakiya:string-truncation
Mar 20, 2026
Merged

feat: Add configurable truncation for string columns#19146
aho135 merged 7 commits intoapache:masterfrom
jaykanakiya:string-truncation

Conversation

@jaykanakiya
Copy link
Copy Markdown
Contributor

@jaykanakiya jaykanakiya commented Mar 12, 2026

Summary

Adds a configurable maximum string length for string dimension columns. Strings exceeding the limit are truncated during ingestion.

  • Global config: druid.indexing.formats.maxStringLength
  • Per-dimension override: maxStringLength field in the dimension spec

Release note

Added a new maxStringLength configuration for string dimensions that truncates values exceeding the specified length during ingestion. Can be set globally via druid.indexing.formats.maxStringLength or per-dimension in the ingestion spec.


Key changed/added classes in this PR
  • DefaultColumnFormatConfig
  • StringDimensionSchema
  • StringDimensionHandler
  • StringDimensionIndexer

This PR has:

  • been self-reviewed.
  • added documentation for new or modified features or behaviors.
  • a release note entry in the PR description.
  • added or updated version, license, or notice information in licenses.yaml
  • added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
  • added integration tests.
  • been tested in a test Druid cluster.

Comment on lines +54 to +55
@JsonProperty("createBitmapIndex") Boolean createBitmapIndex,
@JsonProperty("maxStringLength") @Nullable Integer maxStringLength
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

drive by comment (i'll have a closer look at rest of PR later)

instead of adding additional arguments here, I was hoping to deprecate these arguments in favor of adding a column format spec similar to was done for auto/json columns in #17762, which could serve as a reference for how this should be wired up. I was planning to move the existing createBitmapIndex and multiValueHandling into such a spec, but just haven't got to it yet. I think this would be much cleaner and less disruptive to call sites going forward. It also allows wiring up to IndexSpec to be able to define job level defaults as a middle place between per column and system wide.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for taking a look @clintropolis. Adding something like StringCommonFormatColumnFormatSpec would make it cleaner and makes sense to consolidate the configs there. Since it seems like a bigger refactor, does it make sense to do it in a follow up? Let me know what you think.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

my preference would be that this refactor is done before a release that contains it so that we don't have to support both styles of configuration, apologies I didn't have a chance to do a review before it was merged

*/
private static DimensionSchema.MultiValueHandling STRING_MV_MODE = DimensionSchema.MultiValueHandling.SORTED_ARRAY;
private static IndexSpec DEFAULT_INDEX_SPEC = IndexSpec.builder().build();
private static int MAX_STRING_LENGTH = 0;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense to set this to Integer max value? In case this is used elsewhere in the future there wouldn't need to explicit handling for 0 like you have in truncateIfNeeded

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are using NON_DEFAULT to not serialize the default value and I think for integer jackson's default is 0. If we set the MAX_STRING_LENGTH default as int max, it'll serialize this value for each dimension.

public SideEffectRegisterer initDimensionHandlerAndMvHandlingMode(DefaultColumnFormatConfig formatsConfig)
{
setStringMultiValueHandlingModeIfConfigured(formatsConfig.getStringMultiValueHandlingMode());
setMaxStringLengthIfConfigured(formatsConfig.getMaxStringLength());
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can take a look at druid.indexing.formats.stringMultiValueHandlingMode in BuiltInTypesModuleTest It would be good to have some test coverage for the new property

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added tests for this property.

this.stringMultiValueHandlingMode = validateMultiValueHandlingMode(stringMultiValueHandlingMode);
this.nestedColumnFormatVersion = nestedColumnFormatVersion;
this.indexSpec = indexSpec;
this.maxStringLength = maxStringLength;
Copy link
Copy Markdown
Contributor

@aho135 aho135 Mar 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here we should validate that configured maxStringLength > 0, otherwise we can throw an exception or log that we are falling back to the default value

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added validations.

);
}

private long verifyEncodedValues(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Leaving a comment for us to revisit later:

I'm curious what the truncation behavior would be like for MVD

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did have a chat with @abhishekrb19 about multi value strings and we decided to leave out truncation for those.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh okay. I see we're truncating for MVD's with a single value. Would be good to add a test case for that also if that's the intended behavior


@AfterEach
public void beforeEach()
@After
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there's some mix of junit 4/5 annotations due to which AfterEach was not firing. Updated to use junit 4's annotation for clean up.

Copy link
Copy Markdown
Contributor

@aho135 aho135 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the contribution @jaykanakiya! The changes look good to me. I'll let @clintropolis take a pass through as well. I think it would be fine to do the suggested refactoring in a follow up

);
}

private long verifyEncodedValues(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh okay. I see we're truncating for MVD's with a single value. Would be good to add a test case for that also if that's the intended behavior

Comment thread docs/configuration/index.md Outdated
|`druid.indexer.task.tmpStorageBytesPerTask`|Maximum number of bytes per task to be used to store temporary files on disk. This config is generally intended for internal usage. Attempts to set it are very likely to be overwritten by the TaskRunner that executes the task, so be sure of what you expect to happen before directly adjusting this configuration parameter. The config is documented here primarily to provide an understanding of what it means if/when someone sees that it has been set. A value of -1 disables this limit. |-1|
|`druid.indexer.task.allowHadoopTaskExecution`|Conditional dictating if the cluster allows `index_hadoop` tasks to be executed. `index_hadoop` is deprecated, and defaulting to false will force cluster operators to acknowledge the deprecation and consciously opt in to using index_hadoop with the understanding that it will be removed in the future.|false|
|`druid.indexer.server.maxChatRequests`|Maximum number of concurrent requests served by a task's chat handler. Set to 0 to disable limiting.|0|
|`druid.indexing.formats.maxStringLength`|Maximum number of characters to store per string dimension value. Longer values are truncated during ingestion. Set to 0 to disable. Can be overridden per-dimension using `maxStringLength` in the [dimension object](../ingestion/ingestion-spec.md#dimension-objects).|0 (no truncation)|
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be good to mention that truncation does not apply for MVD's

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added the test and doc update in the next commit.

Comment thread docs/ingestion/ingestion-spec.md Outdated
Co-authored-by: aho135 <andrewho135@gmail.com>
@jaykanakiya jaykanakiya changed the title Add configurable truncation for string columns feat: Add configurable truncation for string columns Mar 20, 2026
@aho135 aho135 merged commit 0c6a1da into apache:master Mar 20, 2026
63 of 65 checks passed
@github-actions github-actions Bot added this to the 37.0.0 milestone Mar 20, 2026
|`druid.indexer.task.tmpStorageBytesPerTask`|Maximum number of bytes per task to be used to store temporary files on disk. This config is generally intended for internal usage. Attempts to set it are very likely to be overwritten by the TaskRunner that executes the task, so be sure of what you expect to happen before directly adjusting this configuration parameter. The config is documented here primarily to provide an understanding of what it means if/when someone sees that it has been set. A value of -1 disables this limit. |-1|
|`druid.indexer.task.allowHadoopTaskExecution`|Conditional dictating if the cluster allows `index_hadoop` tasks to be executed. `index_hadoop` is deprecated, and defaulting to false will force cluster operators to acknowledge the deprecation and consciously opt in to using index_hadoop with the understanding that it will be removed in the future.|false|
|`druid.indexer.server.maxChatRequests`|Maximum number of concurrent requests served by a task's chat handler. Set to 0 to disable limiting.|0|
|`druid.indexing.formats.maxStringLength`|Maximum number of characters to store per string dimension value. Longer values are truncated during ingestion. Does not apply to multi-value string dimensions. Set to 0 to disable. Can be overridden per-dimension using `maxStringLength` in the [dimension object](../ingestion/ingestion-spec.md#dimension-objects).|0 (no truncation)|
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the default be -1? 0 seems counterintuitive
Also making it -1 will align with maxBytesInMemory, maxParseExceptions etc

@Nullable
private static Integer validateMaxStringLength(@Nullable Integer maxStringLength)
{
if (maxStringLength != null && maxStringLength <= 0) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current documentation notes set 0 to disable, but this check would fail startup if an operator sets it to 0 right? Should this check be maxStringLength < 0

If we do make the default be -1, then this check wouldn't be relevant anymore

Comment on lines +86 to +87

private String truncateIfNeeded(String value)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a short javadoc here

Suggested change
private String truncateIfNeeded(String value)
@Nullable
private String truncateIfNeeded(@Nullable String value)

For the future, it would be helpful to have a counter when this truncation occurs, so we have visibility into data integrity. The counter can then be periodically emitted, similar to thrownAway/dropped event metrics, etc.

abhishekrb19 pushed a commit that referenced this pull request Mar 23, 2026
Follow up to #19146


Updated the default value for druid.indexing.formats.maxStringLength to null from 0. This change also included documentation update for the same.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants