Skip to content

[Rust][DataFusion] Add support for Dictionary types in data fusion #26169

@asfimport

Description

@asfimport

We have a system that need to process low cardinality string data (aka there are only a few distinct values, but there are many millions of values).

Using a StringArray is very expensive as the same string value is copied over and over again. The DictionaryArray was exactly designed to handle this situatio: rather than repeating each string, it uses indexes into a dictionary and thus repeats integer values.

Sadly, DataFusion does not support processing on DictionaryArray types for several reasons.

This test (to be added to arrow/rust/datafusion/tests/sql.rs) shows what I would like to be possible:

#[tokio::test]
async fn query_on_string_dictionary() -> Result<()> {
    // ensure that data fusion can operate on dictionary types
    // Use StringDictionary (32 bit indexes = keys)
    let field_type = DataType::Dictionary(
        Box::new(DataType::Int32),
        Box::new(DataType::Utf8),
    );
    let schema = Arc::new(Schema::new(vec![Field::new("d1", field_type, true)]));


    let keys_builder = PrimitiveBuilder::<Int32Type>::new(10);
    let values_builder = StringBuilder::new(10);
    let mut builder = StringDictionaryBuilder::new(
        keys_builder, values_builder
    );

    builder.append("one")?;
    builder.append_null()?;
    builder.append("three")?;
    let array = Arc::new(builder.finish());

    let data = RecordBatch::try_new(
        schema.clone(),
        vec![array],
    )?;

    let table = MemTable::new(schema, vec![vec![data]])?;
    let mut ctx = ExecutionContext::new();
    ctx.register_table("test", Box::new(table));


    // Basic SELECT
    let sql = "SELECT * FROM test";
    let actual = execute(&mut ctx, sql).await.join("\n");
    let expected = "\"one\"\nNULL\n\"three\"".to_string();
    assert_eq!(expected, actual);

    // basic filtering
    let sql = "SELECT * FROM test WHERE d1 IS NOT NULL";
    let actual = execute(&mut ctx, sql).await.join("\n");
    let expected = "\"one\"\n\"three\"".to_string();
    assert_eq!(expected, actual);

    // filtering with constant
    let sql = "SELECT * FROM test WHERE d1 = 'three'";
    let actual = execute(&mut ctx, sql).await.join("\n");
    let expected = "\"three\"".to_string();
    assert_eq!(expected, actual);

    // Expression evaluation
    let sql = "SELECT concat(d1, '-foo') FROM test";
    let actual = execute(&mut ctx, sql).await.join("\n");
    let expected = "\"one-foo\"\nNULL\n\"three-foo\"".to_string();
    assert_eq!(expected, actual);

    // aggregation
    let sql = "SELECT COUNT(d1) FROM test";
    let actual = execute(&mut ctx, sql).await.join("\n");
    let expected = "2".to_string();
    assert_eq!(expected, actual);


    Ok(())
}

However, it errors immediately:

{code}

---- query_on_string_dictionary stdout ----
thread 'query_on_string_dictionary' panicked at 'assertion failed: (left == right)
left: "\"one\"\nNULL\n\"three\"",
right: "?<q>\nNULL\n</q>?"', datafusion/tests/sql.rs:989:5
note: run with RUST_BACKTRACE=1 environment variable to display a backtrace

{code{

This ticket tracks adding proper support Dictionary types to DataFusion. I will break the work down into several smaller subtasks

Reporter: Andrew Lamb / @alamb
Assignee: Andrew Lamb / @alamb

Subtasks:

Related issues:

PRs and other links:

Note: This issue was originally created as ARROW-10159. Please see the migration documentation for further details.

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions