Feature engineering on polars and pandas dataframes for machine learning!
tubular implements pre-processing steps for tabular data commonly used in machine learning pipelines.
The transformers are compatible with scikit-learn Pipelines. Each has a transform method to apply the pre-processing step to data and a fit method to learn the relevant information from the data, if applicable.
The transformers in tubular are written in narwhals narwhals, so are agnostic between pandas and polars dataframes, and will utilise the chosen (pandas/polars) API under the hood.
There are a variety of transformers to assist with;
- capping
- dates
- imputation
- mapping
- categorical encoding
- numeric operations
Here is a simple example of applying capping to two columns;
import polars as pl
transformer = CappingTransformer(
capping_values={"a": [10, 20], "b": [1, 3]},
)
test_df = pl.DataFrame({"a": [1, 15, 18, 25], "b": [6, 2, 7, 1], "c": [1, 2, 3, 4]})
transformer.transform(test_df)
# ->
# shape: (4, 3)
# ┌─────┬─────┬─────┐
# │ a ┆ b ┆ c │
# │ --- ┆ --- ┆ --- │
# │ i64 ┆ i64 ┆ i64 │
# ╞═════╪═════╪═════╡
# │ 10 ┆ 3 ┆ 1 │
# │ 15 ┆ 2 ┆ 2 │
# │ 18 ┆ 3 ┆ 3 │
# │ 20 ┆ 1 ┆ 4 │
# └─────┴─────┴─────┘We are currently in the process of rolling out support for some new features:
- to/from json methods for our transformers to allow json storage
- polars lazyframe support
You can track our progress here:
| polars_compatible | pandas_compatible | jsonable | lazyframe_compatible | |
|---|---|---|---|---|
| AggregateColumnsOverRowTransformer | ✔️ | ✔️ | ❌ | ✔️ |
| AggregateRowsOverColumnTransformer | ✔️ | ✔️ | ❌ | ✔️ |
| ArbitraryImputer | ✔️ | ✔️ | ✔️ | ✔️ |
| BetweenDatesTransformer | ✔️ | ✔️ | ✔️ | ❌ |
| CappingTransformer | ✔️ | ✔️ | ✔️ | ❌ |
| DateDifferenceTransformer | ✔️ | ✔️ | ✔️ | ✔️ |
| DatetimeComponentExtractor | ✔️ | ✔️ | ✔️ | ✔️ |
| DatetimeInfoExtractor | ✔️ | ✔️ | ✔️ | ✔️ |
| DatetimeSinusoidCalculator | ✔️ | ✔️ | ✔️ | ❌ |
| DifferenceTransformer | ✔️ | ✔️ | ✔️ | ✔️ |
| GroupRareLevelsTransformer | ✔️ | ✔️ | ✔️ | ❌ |
| MappingTransformer | ✔️ | ✔️ | ✔️ | ❌ |
| MeanImputer | ✔️ | ✔️ | ✔️ | ❌ |
| MeanResponseTransformer | ✔️ | ✔️ | ✔️ | ❌ |
| MedianImputer | ✔️ | ✔️ | ✔️ | ❌ |
| ModeImputer | ✔️ | ✔️ | ✔️ | ❌ |
| NullIndicator | ✔️ | ✔️ | ✔️ | ✔️ |
| OneDKmeansTransformer | ✔️ | ✔️ | ✔️ | ❌ |
| OneHotEncodingTransformer | ✔️ | ✔️ | ✔️ | ❌ |
| OutOfRangeNullTransformer | ✔️ | ✔️ | ✔️ | ❌ |
| RatioTransformer | ✔️ | ✔️ | ✔️ | ✔️ |
| SetValueTransformer | ✔️ | ✔️ | ✔️ | ✔️ |
| ToDatetimeTransformer | ✔️ | ✔️ | ✔️ | ✔️ |
The easiest way to get tubular is directly from pypi with;
pip install tubular
The documentation for tubular can be found on readthedocs.
Instructions for building the docs locally can be found in docs/README.
We utilise doctest to keep valid usage examples in the docstrings of transformers in the package, so please see these for getting started!
For bugs and feature requests please open an issue.
The test framework we are using for this project is pytest. To build the package locally and run the tests follow the steps below.
First clone the repo and move to the root directory;
git clone https://github.com/azukds/tubular.git
cd tubularNext install tubular and development dependencies;
pip install . -r requirements-dev.txtFinally run the test suite with pytest;
pytesttubular is under active development, we're super excited if you're interested in contributing!
See the CONTRIBUTING file for the full details of our working practices.
