Skip to content

[C++] Add option to CSV reader to dictionary encode individual columns or all string / binary columns #19735

@asfimport

Description

@asfimport

For many datasets, dictionary encoding everything can result in drastically lower memory usage and subsequently better performance in doing analytics

One difficulty of dictionary encoding in multithreaded conversions is that ideally you end up with one dictionary at the end. So you have two options:

  • Implement a concurrent hashing scheme – for low cardinality dictionaries, the overhead associated with mutex contention will not be meaningful, for high cardinality it can be more of a problem

  • Hash each chunk separately, then normalize at the end

    My guess is that a crude concurrent hash table with a mutex to protect mutations and resizes is going to outperform the latter

Reporter: Wes McKinney / @wesm
Assignee: Antoine Pitrou / @pitrou

Related issues:

PRs and other links:

Note: This issue was originally created as ARROW-3408. Please see the migration documentation for further details.

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions