[Java] Suppor linear dictionary encoder

For many scenarios, the distribution of dictionary entries is highly skewed. In other words, a few dictionary entries occurs much more frequently than others. If we can sort the dictionary by the non-increasing order of entry frequencies, and compare each value to encode from the beginning of the dictionary, we get the following benefits:

1)      We need no extra memory space or data structure.
2)      The search is extremely efficient, as we are likely to find a match in the first few entries of the dictionary.

This is the basic idea behind the linear dictionary encoder. When the scenario is right (highly skewed dictionary distribution), it outperforms both search based encoder and hash table based encoders. 


**Reporter**: [Liya Fan](https://issues.apache.org/jira/browse/ARROW-6933) / @liyafan82
**Assignee**: [Liya Fan](https://issues.apache.org/jira/browse/ARROW-6933) / @liyafan82
#### PRs and other links:
- [GitHub Pull Request #5692](https://github.com/apache/arrow/pull/5692)

<sub>**Note**: *This issue was originally created as [ARROW-6933](https://issues.apache.org/jira/browse/ARROW-6933). Please see the [migration documentation](https://github.com/apache/arrow/issues/14542) for further details.*</sub>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Java] Suppor linear dictionary encoder #23254

PRs and other links:

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Java] Suppor linear dictionary encoder #23254

Description

PRs and other links:

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions