Reindex categorical codes from Polars by lbittarello · Pull Request #387 · Quantco/tabmat

Luca Bittarello (lbittarello) · 2024-09-11T17:22:08Z

When loading Parquet data, I realised that Polars doesn't necessarily start categorical codes at zero, which causes the kernel to crash in, e.g., transpose_matvec_complex. This PR adds reindexing for safety.

Jan Tilly (jtilly)

Changelog?

Does this need a test (maybe Martin Stancsics (@stanmart) can contribute one)? Seems like a subtle bug that's easy to re-introduce.

Martin Stancsics (stanmart) · 2024-09-12T06:41:11Z

AFAICT _extract_codes_and_categories still relies on the fact that the pl.Series.get_categories() returns category labels such that their codes are ordered (increasing). I would expect that to be the case, but then I would have also expected category codes to have no gaps. Do you know of a way to make sure it is true (without comparing physical values and labels, which is probably quite expensive)? A cursory search told me polars considers the codes an implementation detail and do not really want to expose the mapping.

Martin Stancsics (stanmart) · 2024-09-12T07:30:05Z

Here's a mwe to reproduce the non-consecutive codes without loading/writing data:

import polars as pl

with pl.StringCache():
    s1 = pl.Series(["beagle", "poodle", "labrador"], dtype=pl.Categorical)
    s2 = pl.Series(["labrador", "boxer", "beagle"], dtype=pl.Categorical)

print("s1 categories: ", s1.cat.get_categories().to_numpy())
print("s1 codes: ", s1.to_physical().to_numpy())

print("s2 categories: ", s2.cat.get_categories().to_numpy())
print("s2 codes: ", s2.to_physical().to_numpy())

Output:

s1 categories:  ['beagle' 'poodle' 'labrador']
s1 codes:  [0 1 2]
s2 categories:  ['labrador' 'boxer' 'beagle']
s2 codes:  [2 3 0]

~~The good news is that get_categories() indeed seems to return the categories such that their codes are increasing, even when the observation with the lowest code comes last.~~

It seems that get_categories returns labels in order of their apparence, and not their category codes. I'll turn it into a test case.

Martin Stancsics (stanmart) · 2024-09-12T07:37:55Z

It will be an issue, though:

_extract_codes_and_categories(s2)

(array([1, 2, 0]), array(['labrador', 'boxer', 'beagle'], dtype=object))

If we reconstruct the categorical from this, it will be ['boxer', 'beagle', 'labrador'], which is incorrect.

Martin Stancsics (stanmart)

Need to handle polars.Series.get_categories() returning category labels such that their codes are not necessarily in an increasing order.

Luca Bittarello (lbittarello) · 2024-09-12T08:32:59Z

I implemented a different fix, which gets around the string cache by localising the series. It allows us to reconstitute both arrays in Martin's example. (My hunch is that Polars is using some sort of string caching when reading data from disk, which is why I thought that there might be gaps in the first place, but I did naively expect the categories to be in the right order. Thanks for catching that, Martin Stancsics (@stanmart).)

Changelog?

Polars support hasn't made it into any release yet. The existing log entry (under new features) should be enough.

* Add dependencies to pixi.toml * Pixi-ize pre-commit * Add pixi tasks * Update CI * Fix build dependencies * update lockfile * Fix doctest * Try to fix readthedocs * Use latest pixi on conda-forge * Find some minimum versions * Bump minimum formulaic version * Find minimum numpy version * Make polars a test dependency * Update lockfile * Fix typing issues * Fix benchmarks * Update contributing docs * Make ruff happy * Remove unnecessary pre-commit option from CI * first try * Added deprecation, docstring * replace from_pandas and from_polars * keep sorting * add narwhals to conda recipe * bump minimum narwhals version * added narwhals to setup.py * Changelog * Fix categoricals with non-numpy-or-pandas input * Fix categoricals from numpy/list input * Remove unnecessary import * Merge fix from #387 * Bump minimum narwhals version * Update tests * Remove unnecessary argument * Simplify `_extract_codes_and_categories` * Make the check work with the new changes * Import narwhals' stable v1 API --------- Co-authored-by: Martin Stancsics <martin.stancsics@quantco.com>

Reindex categorical codes from Polars

f86b9aa

Luca Bittarello (lbittarello) requested review from Marc-Antoine Schmidt (MarcAntoineSchmidtQC), Jan Tilly (jtilly) and Uwe L. Korn (xhochy) as code owners September 11, 2024 17:22

Jan Tilly (jtilly) approved these changes Sep 11, 2024

View reviewed changes

Martin Stancsics (stanmart) requested changes Sep 12, 2024

View reviewed changes

Localize

db89cdd

Add test for non-contiguous polars categoricals

9a56ea8

Martin Stancsics (stanmart) approved these changes Sep 12, 2024

View reviewed changes

Luca Bittarello (lbittarello) merged commit e288b47 into main Sep 12, 2024

Luca Bittarello (lbittarello) deleted the polars-4 branch September 12, 2024 09:48

Martin Stancsics (stanmart) added a commit that referenced this pull request Sep 13, 2024

Merge fix from #387

1d58498

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reindex categorical codes from Polars#387

Reindex categorical codes from Polars#387
Luca Bittarello (lbittarello) merged 3 commits intomainfrom
polars-4

Luca Bittarello (lbittarello) commented Sep 11, 2024

Uh oh!

Jan Tilly (jtilly) left a comment

Uh oh!

Martin Stancsics (stanmart) commented Sep 12, 2024 •

edited

Loading

Uh oh!

Martin Stancsics (stanmart) commented Sep 12, 2024 •

edited

Loading

Uh oh!

Martin Stancsics (stanmart) commented Sep 12, 2024

Uh oh!

Martin Stancsics (stanmart) left a comment

Uh oh!

Luca Bittarello (lbittarello) commented Sep 12, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

Luca Bittarello (lbittarello) commented Sep 11, 2024

Uh oh!

Jan Tilly (jtilly) left a comment

Choose a reason for hiding this comment

Uh oh!

Martin Stancsics (stanmart) commented Sep 12, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Martin Stancsics (stanmart) commented Sep 12, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Martin Stancsics (stanmart) commented Sep 12, 2024

Uh oh!

Martin Stancsics (stanmart) left a comment

Choose a reason for hiding this comment

Uh oh!

Luca Bittarello (lbittarello) commented Sep 12, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Martin Stancsics (stanmart) commented Sep 12, 2024 •

edited

Loading

Martin Stancsics (stanmart) commented Sep 12, 2024 •

edited

Loading