-
Notifications
You must be signed in to change notification settings - Fork 4k
Description
We have a system that need to process low cardinality string data (aka there are only a few distinct values, but there are many millions of values).
Using a StringArray is very expensive as the same string value is copied over and over again. The DictionaryArray was exactly designed to handle this situatio: rather than repeating each string, it uses indexes into a dictionary and thus repeats integer values.
Sadly, DataFusion does not support processing on DictionaryArray types for several reasons.
This test (to be added to arrow/rust/datafusion/tests/sql.rs) shows what I would like to be possible:
#[tokio::test]
async fn query_on_string_dictionary() -> Result<()> {
// ensure that data fusion can operate on dictionary types
// Use StringDictionary (32 bit indexes = keys)
let field_type = DataType::Dictionary(
Box::new(DataType::Int32),
Box::new(DataType::Utf8),
);
let schema = Arc::new(Schema::new(vec![Field::new("d1", field_type, true)]));
let keys_builder = PrimitiveBuilder::<Int32Type>::new(10);
let values_builder = StringBuilder::new(10);
let mut builder = StringDictionaryBuilder::new(
keys_builder, values_builder
);
builder.append("one")?;
builder.append_null()?;
builder.append("three")?;
let array = Arc::new(builder.finish());
let data = RecordBatch::try_new(
schema.clone(),
vec![array],
)?;
let table = MemTable::new(schema, vec![vec![data]])?;
let mut ctx = ExecutionContext::new();
ctx.register_table("test", Box::new(table));
// Basic SELECT
let sql = "SELECT * FROM test";
let actual = execute(&mut ctx, sql).await.join("\n");
let expected = "\"one\"\nNULL\n\"three\"".to_string();
assert_eq!(expected, actual);
// basic filtering
let sql = "SELECT * FROM test WHERE d1 IS NOT NULL";
let actual = execute(&mut ctx, sql).await.join("\n");
let expected = "\"one\"\n\"three\"".to_string();
assert_eq!(expected, actual);
// filtering with constant
let sql = "SELECT * FROM test WHERE d1 = 'three'";
let actual = execute(&mut ctx, sql).await.join("\n");
let expected = "\"three\"".to_string();
assert_eq!(expected, actual);
// Expression evaluation
let sql = "SELECT concat(d1, '-foo') FROM test";
let actual = execute(&mut ctx, sql).await.join("\n");
let expected = "\"one-foo\"\nNULL\n\"three-foo\"".to_string();
assert_eq!(expected, actual);
// aggregation
let sql = "SELECT COUNT(d1) FROM test";
let actual = execute(&mut ctx, sql).await.join("\n");
let expected = "2".to_string();
assert_eq!(expected, actual);
Ok(())
}However, it errors immediately:
{code}
---- query_on_string_dictionary stdout ----
thread 'query_on_string_dictionary' panicked at 'assertion failed: (left == right)
left: "\"one\"\nNULL\n\"three\"",
right: "?<q>\nNULL\n</q>?"', datafusion/tests/sql.rs:989:5
note: run with RUST_BACKTRACE=1 environment variable to display a backtrace
{code{
This ticket tracks adding proper support Dictionary types to DataFusion. I will break the work down into several smaller subtasks
Reporter: Andrew Lamb / @alamb
Assignee: Andrew Lamb / @alamb
Subtasks:
- [Rust] Improve documentation of DictionaryType
- [Rust] Support display of DictionaryArrays in pretty printing
- [Rust] [DataFusion] Add DictionaryArray coercion support
- [Rust] Add support for DictionaryArray types to cast kernels
- [Rust] [DataFusion] Allow DataFusion to cast all type combinations supported by Arrow cast kernel
- [Rust] Support display of DictionaryArrays in sql.rs
Related issues:
PRs and other links:
Note: This issue was originally created as ARROW-10159. Please see the migration documentation for further details.