Skip to content

Use narwhals to support Polars, cuDF, Modin, etc.#388

Merged
Marc-Antoine Schmidt (MarcAntoineSchmidtQC) merged 40 commits intomainfrom
narwhals
Oct 25, 2024
Merged

Use narwhals to support Polars, cuDF, Modin, etc.#388
Marc-Antoine Schmidt (MarcAntoineSchmidtQC) merged 40 commits intomainfrom
narwhals

Conversation

@MarcAntoineSchmidtQC
Copy link
Copy Markdown
Member

Checklist

  • Added a CHANGELOG.rst entry

Comment thread pixi.lock
- conda: https://conda.anaconda.org/conda-forge/linux-64/binutils_linux-64-2.40-hb3c18ed_1.conda
- conda: https://conda.anaconda.org/conda-forge/linux-64/bzip2-1.0.8-h4bc722e_7.conda
- conda: https://conda.anaconda.org/conda-forge/linux-64/c-ares-1.33.1-heb4867d_0.conda
- conda: https://conda.anaconda.org/conda-forge/linux-64/c-ares-1.32.3-h4bc722e_0.conda
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure why it's asking to downgrade here

@MarcAntoineSchmidtQC Marc-Antoine Schmidt (MarcAntoineSchmidtQC) marked this pull request as ready for review September 12, 2024 18:33
@stanmart
Copy link
Copy Markdown
Collaborator

Martin Stancsics (stanmart) commented Sep 12, 2024

I think this will raise an exception if neither pandas nor polars is installed (which is now a posibility):

else:
indices, categories = pd.factorize(cat_vec, sort=True)

We could replace it with some numpy-based solution, which is probably not very efficient, but that's probably fine for now. E.g.,

    else:
        categories, indices = np.unique(cat_vec.to_numpy(), return_inverse=True)

@stanmart
Copy link
Copy Markdown
Collaborator

Same here, but the solution might be less straightforward for the non-pandas-or-polars case:

return pd.Categorical.from_codes(self.indices, categories=self.categories)

Maybe we could say that the property is deprecated and is only there for backwards compatibility, so we are not implementing it for non-pandas (or polars) input?

@stanmart
Copy link
Copy Markdown
Collaborator

I don't think we need this: anymore

try:
import polars as pl
except ImportError:
pl = None # type: ignore

Copy link
Copy Markdown
Collaborator

@stanmart Martin Stancsics (stanmart) left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! All that remains is fixing these couple of issues in categorical_matrix.py.

It might also be nice to have a test for from_df which is not polars or pandas. Pyarrow is test dependency because of polars already; would it be okay if I added a test for pyarrow dataframes?

@stanmart
Copy link
Copy Markdown
Collaborator

I made the following updates to fix the issues above:

Deconstructing the cat argument of CatagoricalMatrix.__init__:

  • When the constructor of CategoricalMatrix gets a pandas or polars input, it uses Luca Bittarello (@lbittarello)'s solution
  • In any other case, it uses a pure numpy fallback. It is necessary as we cannot assume that pandas will be available. It is not super efficient when the input is already a (e.g. pyarrow) categorical, as we convert it into a string vector, then reconstruct the incdices/categories representation again. However, narwhals does not have a way to access the underlying representation of a categorical, and I'm not too keen on adding more custom cases besides pandas and polars.

The cat property

  • When the CategoricalMatrix was created based on a series that narwhals supports, use narwhals.new_series to create a categorical series of the same type. Special case pandas for performance reasons and differing behavior, as before.
  • When the CategoricalMatrix is created from a list, numpy array or similar, just return a dictionary with indices and categories. We cannot return a categorical as it is not guaranteed that any dataframe library is available.
  • As Luca Bittarello (@lbittarello) mentioned in the discussion of from_polars #370, cat is deprecated anyways, so it is just a temporary solution, and the extra complexity can be removed when we release the next major version.

@stanmart
Copy link
Copy Markdown
Collaborator

Marc-Antoine Schmidt (@MarcAntoineSchmidtQC), if we don't want to add new functionality to the deprecated cat property then let me know and I'll remove that part. I think the more important addition is handling non-pandas-or-polars categoricals when constructing the CategoricalMatrix, which is an orthogonal issue.

@stanmart
Copy link
Copy Markdown
Collaborator

Perhaps one final question: all three matrix types have an unpack method that returns the container storing the data. For CategoricalMatrix it used to be a pandas.Categorical, so it simply returned .cat(). We should probably either deprecate this function for the CategoricalMatrix class or show a FutureWarning and from the next breaking release return (indices, categories) instead.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wow, amazing, seeing how far you got with Narwhals independently here really warms my heart 🤗

If you're happy with this, may I suggest to use import narwhals.stable.v1 as nw instead of import narwhals as nw? This will future-proof you against breaking changes (if we need to make them), see https://narwhals-dev.github.io/narwhals/backcompat/ for an explanation

Well done here, and please always feel free to ping us from Narwhals if you have any questions / requests / comments 🙏

@stanmart
Copy link
Copy Markdown
Collaborator

Thanks so much for the review and the fantastic library Marco Edward Gorelli (@MarcoGorelli)! 🙏 Using the v1 API is a great idea. The only thing I ran into is that there does not seem to be a narwhals.stable.v1.dtypes.NumericType counterpart to narwhals.dtypes.NumericType. Is there an easy way to refer to any numeric type under the v1 API?

@MarcoGorelli
Copy link
Copy Markdown

thanks! yup, .is_numeric should work:

In [10]: import narwhals.stable.v1 as nw

In [11]: nw.from_native(pl.Series([1,2,3]), series_only=True).dtype.is_numeric()
Out[11]: True

In [12]: nw.from_native(pl.Series(['foo', 'bar']), series_only=True).dtype.is_numeric()
Out[12]: False

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants