Skip to content

from_polars#370

Merged
Luca Bittarello (lbittarello) merged 13 commits intomainfrom
polars-1
Jul 2, 2024
Merged

from_polars#370
Luca Bittarello (lbittarello) merged 13 commits intomainfrom
polars-1

Conversation

@lbittarello
Copy link
Copy Markdown
Member

This PR adds basic support for Polars.

On the surface level, it adds a from_polars function, which mirrors from_pandas (except in that it doesn't allow for object columns, which are painful to work with in Polars). Because accessor methods differ between Pandas and Polars, it also restructures the CategoricalMatrix so it stores categories instead of a categorical array (which should also lower its memory footprint).

This PR also streamlines from_pandas and collects tests into a dedicated file.

I haven't extended the formula interface, which will be left to a future PR.

@MarcAntoineSchmidtQC
Copy link
Copy Markdown
Member

Trying to fix the build error with numpy 2.0. In the meantime, you are missing the dependency in setup.py (line 160).

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting to see some Pandas/Polars compatibility in action! I left a few comments as ideas to possibly decrease code duplication, if you haven't yet considered them.

Comment thread src/tabmat/categorical_matrix.py Outdated
Comment thread src/tabmat/constructor.py
Comment thread src/tabmat/constructor.py
Comment thread tests/test_constructor.py Outdated
Comment thread tests/test_constructor.py Outdated
Comment thread src/tabmat/constructor.py
Comment thread src/tabmat/constructor.py Outdated
Copy link
Copy Markdown

@cbourjau Christian Bourjau (cbourjau) left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose we are breaking new grounds by supporting polars, pandas and NumPy in here. I think/hope we might find a slightly easier-to-maintain approach still. Here are some thoughts:

I think some of the difficulties originate from having the mapping of different backend libraries intertwined with the business logic of this library. Things might get easier if the mapping is more cleanly separated behind wrapper classes. The CategoricalMatrix takes a cat_vec as an argument. Right now, the __init__ method maps all possible input types to the disassembled representation of a tuple of NumPy arrays+dtype (namely indices, categories, and _input_dtype). The cat property then reassembles these with a similar mapping. Instead, one may consider wrapping the cat_vec in a class that exposes exactly the data that we might want to derive from cat_vec as properties/functions. These functions would encapsulate the necessary mapping logic. Something like:

class CatVec:
    _wrapped: np.ndarray | pd.Series | pl.Series
    
    @property
    def categories(self) -> np.ndarray:
        # get categories from `self._wrapped` on the fly
        ...
        
    @property
    def codes(self) -> np.ndarray: ...
    
    @property
    def shape(self) -> tuple[int, ...]: ...
    
    def contains_missing(self) -> bool: ...

The CategoricalMatrix.index, CategoricalMatrix.shape, etc members could be replaced by properties that simply call through to the (privately) stored CatVec instance. Might this work better?

Comment thread src/tabmat/categorical_matrix.py Outdated
Comment thread src/tabmat/categorical_matrix.py
def __init__(
self,
cat_vec: Union[list, np.ndarray, pd.Categorical],
cat_vec,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mypy might actually be useful here. It is quite good with union types and narrowing them in if-statements these days.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mypy doesn't play nice with optional imports. I tried conditioning imports on type hinting to no avail.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using TYPE_CHECKING didn't work? I think it is reasonable to assume both Pandas and Polars are installed in the dev environment, isn't it?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using TYPE_CHECKING didn't work?

No. :\ I may have done it wrong though.

@lbittarello
Copy link
Copy Markdown
Member Author

I appreciate that the current code is messy and wrapper classes would help streamline it, but it'll look better once we are past the transition phase.

We currently have a cat attribute that returns a pandas.Categorical type, regardless of the input type. This PR transformed it into a property, which returns a polars.Series if the input is one and a pandas.Categorical otherwise. We use the _input_dtype attribute to keep track of the input type.

At some point, we will do away with the cat property. At that point, we'll also be able to remove the _Categorical helper class or the _input_dtype attribute. We'll only need to extract the categories and indices from the input vector, which is straightforward.

If your concern is maintenance, it's all temporary, so it might not warrant a lot of infrastructure.

@cbourjau
Copy link
Copy Markdown

I appreciate that the current code is messy and wrapper classes would help streamline it, but it'll look better once we are past the transition phase.

As you prefer

Comment thread src/tabmat/constructor.py
)
indices.append(dense_mxidx)
if dense_columns:
matrices.append(_dense_matrix(df, dense_columns, dtype))

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem in glum comes from here. It should be df[dense_columns]. However, using column names doesn't work here because of potential duplicates. This is why we had a logic to use the column index.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Create SplitMatrix from polars data frame

5 participants