You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
SparkFeatureKit is a production-quality Python library for feature engineering, schema validation, and data quality profiling. All transforms are implemented on pandas DataFrames for local use and testing, with optional PySpark support for large-scale production workloads.
Installation
pip install sparkfeaturekit
# With PySpark support:
pip install sparkfeaturekit[spark]
# Development extras:
pip install sparkfeaturekit[dev]
Rolling aggregations (mean, sum, min, max, std, count)
Cumulative (transforms.cumulative)
Function
Description
cumulative_sum(df, col)
Expanding cumulative sum
cumulative_count(df, col)
Expanding non-null count
cumulative_mean(df, col)
Expanding mean
Imputation (transforms.impute)
Function
Description
impute_mean(df, col)
Fill NaN with column mean
impute_median(df, col)
Fill NaN with column median
impute_mode(df, col)
Fill NaN with most frequent value
impute_constant(df, col, value)
Fill NaN with a constant
forward_fill(df, col)
Carry forward the last valid value
Interaction (transforms.interaction)
Function
Description
ratio_feature(df, num_col, den_col, output_col)
num / (den + epsilon) — safe division
product_feature(df, col1, col2, output_col)
Element-wise product
difference_feature(df, col1, col2, output_col)
col1 - col2
Design Principles
Pandas-first, PySpark-ready: All transforms work with pandas DataFrames for testing and local use. PySpark is an optional heavy dependency imported lazily.
Immutable inputs: Every transform returns a copy — the input DataFrame is never modified.
Graceful edge cases: Constant columns return zeros rather than raising; NaN inputs propagate correctly; log of zero is guarded by offset.
Pydantic v2 validation: Schema and quality objects use Pydantic for runtime type safety.
Logging, not printing: All feedback goes through Python's logging module.