A small, reliable Python 3.x pipeline for binary classification on tabular data.
Inputs: one CSV with target, numeric + categorical features.
Outputs: a trained model + validation metrics/plots, all written to disk.
- Python 3.9+ (tested on 3.12)
- OS: Linux
- Packages listed in
requirements.txt
python3 -m venv .venv
source .venv/bin/activate
python3 -m pip install -U pip
python3 -m pip install -r requirements.txtTo run the whole pipeline
python3 src/main.py \
--data_path data/example.csv \
--output_path artifacts/run1 \
--config configs/default.yamlpython3 scripts/make_data.pypython3 scripts/train.py \
--data_path data/example.csv \
--output_dir artifacts/run1 \
--config configs/default.yaml
python3 scripts/eval.py \
--model_path artifacts/run1/model.joblib \
--data_path data/example.csv \
--output_dir artifacts/run1_eval
├── README.md
├── requirements.txt
├── artifacts
│ └── run1
├── configs
│ └── default.yaml
├── data
│ └── example.csv
├── scripts
│ ├── eval.py
│ ├── make_data.py
│ ├── show_artifacts.py
│ └── train.py
├── src
│ ├── __init__.py
│ ├── data.py
│ ├── evaluate.py
│ ├── main.py
│ ├── model.py
│ ├── preprocess.py
│ └── utils.py
└── tests
└── test_pipeline.py
The artifacts directory contains the outputs of the training pipeline. Each run will generate a new subdirectory (e.g., run1, run2, etc.).
best_params.json: Best hyperparameters found during cross-validation.confusion_matrix.png: Confusion matrix of the model on the test set.cv_results.csv: Cross-validation results.metrics.json: Performance metrics of the model on the test set.model.joblib: Trained model.pr_curve.png: Precision-recall curve of the model on the test set.roc_curve.png: ROC curve of the model on the test set.schema.json: Schema of the data used for training.
For the test, execute
pytest -q -sor with logs
pytest -q -s tests/test_pipeline.py -k test_end_to_end -vvpython3 scripts/show_artifacts.py --run_dir artifacts/run1 --open_plots-
One CSV with binary target, <= 100 columns, <= 100k rows.
-
Numeric features may have nulls/outliers.
-
Categorical features <= 5 unique non-null values.
-
Numeric: SimpleImputer (withmedian) -> RobustScaler (outlier-resistant)
-
Categorical: SimpleImputer(most_frequent) -> OneHotEncoder(handle_unknown="ignore")
-
Implemented via ColumnTransformer in a single sklearn Pipeline.
- Optional SelectFromModel with L1-penalized LogisticRegression after OHE (drops weak one-hot columns).
-
One estimator per run (YAML): RandomForestClassifier.
-
RandomizedSearchCV, StratifiedKFold, scoring = ROC AUC.
-
Metrics + ROC/PR/CM plots.
-
Artifacts are written to disk for inspection and reuse.
-
Simplicity & DRY: single entrypoint (src/main.py) reuses the existing scripts; minimal moving parts.
-
Reproducible: configured via YAML; seeds configurable.
-
Deployable: entire pipeline serialized as model.joblib.