`from_polars` by lbittarello · Pull Request #370 · Quantco/tabmat

Luca Bittarello (lbittarello) · 2024-06-17T09:36:01Z

This PR adds basic support for Polars.

On the surface level, it adds a from_polars function, which mirrors from_pandas (except in that it doesn't allow for object columns, which are painful to work with in Polars). Because accessor methods differ between Pandas and Polars, it also restructures the CategoricalMatrix so it stores categories instead of a categorical array (which should also lower its memory footprint).

This PR also streamlines from_pandas and collects tests into a dedicated file.

I haven't extended the formula interface, which will be left to a future PR.

Marc-Antoine Schmidt (MarcAntoineSchmidtQC) · 2024-06-17T12:46:50Z

Trying to fix the build error with numpy 2.0. In the meantime, you are missing the dependency in setup.py (line 160).

Christian Bourjau (cbourjau)

Interesting to see some Pandas/Polars compatibility in action! I left a few comments as ideas to possibly decrease code duplication, if you haven't yet considered them.

Christian Bourjau (cbourjau)

I suppose we are breaking new grounds by supporting polars, pandas and NumPy in here. I think/hope we might find a slightly easier-to-maintain approach still. Here are some thoughts:

I think some of the difficulties originate from having the mapping of different backend libraries intertwined with the business logic of this library. Things might get easier if the mapping is more cleanly separated behind wrapper classes. The CategoricalMatrix takes a cat_vec as an argument. Right now, the __init__ method maps all possible input types to the disassembled representation of a tuple of NumPy arrays+dtype (namely indices, categories, and _input_dtype). The cat property then reassembles these with a similar mapping. Instead, one may consider wrapping the cat_vec in a class that exposes exactly the data that we might want to derive from cat_vec as properties/functions. These functions would encapsulate the necessary mapping logic. Something like:

class CatVec:
    _wrapped: np.ndarray | pd.Series | pl.Series
    
    @property
    def categories(self) -> np.ndarray:
        # get categories from `self._wrapped` on the fly
        ...
        
    @property
    def codes(self) -> np.ndarray: ...
    
    @property
    def shape(self) -> tuple[int, ...]: ...
    
    def contains_missing(self) -> bool: ...

The CategoricalMatrix.index, CategoricalMatrix.shape, etc members could be replaced by properties that simply call through to the (privately) stored CatVec instance. Might this work better?

Christian Bourjau (cbourjau) · 2024-06-24T21:06:33Z

    def __init__(
        self,
-        cat_vec: Union[list, np.ndarray, pd.Categorical],
+        cat_vec,


mypy might actually be useful here. It is quite good with union types and narrowing them in if-statements these days.

mypy doesn't play nice with optional imports. I tried conditioning imports on type hinting to no avail.

Using TYPE_CHECKING didn't work? I think it is reasonable to assume both Pandas and Polars are installed in the dev environment, isn't it?

Using TYPE_CHECKING didn't work?

No. :\ I may have done it wrong though.

Luca Bittarello (lbittarello) · 2024-06-25T08:05:48Z

I appreciate that the current code is messy and wrapper classes would help streamline it, but it'll look better once we are past the transition phase.

We currently have a cat attribute that returns a pandas.Categorical type, regardless of the input type. This PR transformed it into a property, which returns a polars.Series if the input is one and a pandas.Categorical otherwise. We use the _input_dtype attribute to keep track of the input type.

At some point, we will do away with the cat property. At that point, we'll also be able to remove the _Categorical helper class or the _input_dtype attribute. We'll only need to extract the categories and indices from the input vector, which is straightforward.

If your concern is maintenance, it's all temporary, so it might not warrant a lot of infrastructure.

Christian Bourjau (cbourjau) · 2024-06-25T08:47:53Z

I appreciate that the current code is messy and wrapper classes would help streamline it, but it'll look better once we are past the transition phase.

As you prefer

Marc-Antoine Schmidt (MarcAntoineSchmidtQC) · 2024-07-03T09:59:09Z

-        )
-        indices.append(dense_mxidx)
+    if dense_columns:
+        matrices.append(_dense_matrix(df, dense_columns, dtype))


The problem in glum comes from here. It should be df[dense_columns]. However, using column names doesn't work here because of potential duplicates. This is why we had a logic to use the column index.

Luca Bittarello (lbittarello) requested review from Marc-Antoine Schmidt (MarcAntoineSchmidtQC) and Jan Tilly (jtilly) June 17, 2024 09:36

Luca Bittarello (lbittarello) requested a review from Uwe L. Korn (xhochy) as a code owner June 17, 2024 09:36

Christian Bourjau (cbourjau) reviewed Jun 17, 2024

View reviewed changes

Comment thread src/tabmat/categorical_matrix.py Outdated

Comment thread src/tabmat/constructor.py

Comment thread src/tabmat/constructor.py

Comment thread tests/test_constructor.py Outdated

Comment thread tests/test_constructor.py Outdated

Comment thread src/tabmat/constructor.py

Luca Bittarello (lbittarello) force-pushed the polars-1 branch 4 times, most recently from f5ad514 to 873a556 Compare June 18, 2024 14:56

Marc-Antoine Schmidt (MarcAntoineSchmidtQC) linked an issue Jun 18, 2024 that may be closed by this pull request

Create SplitMatrix from polars data frame #329

Closed

Matthias Schmidtblaicher (MatthiasSchmidtblaicherQC) reviewed Jun 19, 2024

View reviewed changes

Comment thread src/tabmat/constructor.py Outdated

Christian Bourjau (cbourjau) reviewed Jun 24, 2024

View reviewed changes

Jan Tilly (jtilly) approved these changes Jun 28, 2024

View reviewed changes

Luca Bittarello (lbittarello) added 13 commits June 28, 2024 10:24

Environment

8c44035

Categorical matrix

2532dc9

Constructor

207c548

Tests

7be43d0

Change log

f80a977

Patch

fd872c1

Dependency

f1728f3

Helpers

530b519

Simplify tests

90e43cb

It's all optional

9100ddc

Patch

7b21296

Docstrings [skip ci]

e2d059e

Helper function

f83e918

Luca Bittarello (lbittarello) force-pushed the polars-1 branch from a3b6555 to f83e918 Compare June 28, 2024 09:24

Luca Bittarello (lbittarello) merged commit 06cddf9 into main Jul 2, 2024

Luca Bittarello (lbittarello) deleted the polars-1 branch July 2, 2024 07:07

Marc-Antoine Schmidt (MarcAntoineSchmidtQC) mentioned this pull request Jul 3, 2024

Daily run failure: Unit tests Quantco/glum#816

Closed

Marc-Antoine Schmidt (MarcAntoineSchmidtQC) reviewed Jul 3, 2024

View reviewed changes

This was referenced Sep 12, 2024

How to avoid any unnecessary copies from numpy -> tabmat / glum, including categoricals? Quantco/glum#838

Closed

Use narwhals to support Polars, cuDF, Modin, etc. #388

Merged

Martin Stancsics (stanmart) mentioned this pull request Dec 19, 2025

Add polars/narwhals support to the formula interface #502

Merged

1 task

Conversation

Luca Bittarello (lbittarello) commented Jun 17, 2024

Uh oh!

Marc-Antoine Schmidt (MarcAntoineSchmidtQC) commented Jun 17, 2024

Uh oh!

Christian Bourjau (cbourjau) left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Christian Bourjau (cbourjau) left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Christian Bourjau (cbourjau) Jun 24, 2024

Choose a reason for hiding this comment

Uh oh!

Luca Bittarello (lbittarello) Jun 25, 2024

Choose a reason for hiding this comment

Uh oh!

Christian Bourjau (cbourjau) Jun 25, 2024

Choose a reason for hiding this comment

Uh oh!

Luca Bittarello (lbittarello) Jun 25, 2024

Choose a reason for hiding this comment

Uh oh!

Luca Bittarello (lbittarello) commented Jun 25, 2024

Uh oh!

Christian Bourjau (cbourjau) commented Jun 25, 2024

Uh oh!

Marc-Antoine Schmidt (MarcAntoineSchmidtQC) Jul 3, 2024

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Christian Bourjau (cbourjau) left a comment •

edited

Loading