Skip to content

AlanMLCH/binary_classification

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

MLOps Takehome Assignment

A small, reliable Python 3.x pipeline for binary classification on tabular data. Inputs: one CSV with target, numeric + categorical features. Outputs: a trained model + validation metrics/plots, all written to disk.


1) Environment Setup

Requirements

  • Python 3.9+ (tested on 3.12)
  • OS: Linux
  • Packages listed in requirements.txt

Create & activate a virtualenv, install deps

python3 -m venv .venv
source .venv/bin/activate
python3 -m pip install -U pip
python3 -m pip install -r requirements.txt

2) Run the pipeline

To run the whole pipeline

python3 src/main.py \
  --data_path data/example.csv \
  --output_path artifacts/run1 \
  --config configs/default.yaml

a) To only create the data

python3 scripts/make_data.py

b) To only train the model

python3 scripts/train.py \
  --data_path data/example.csv \
  --output_dir artifacts/run1 \
  --config configs/default.yaml

c) To only evaluate the model

python3 scripts/eval.py \
  --model_path artifacts/run1/model.joblib \
  --data_path data/example.csv \
  --output_dir artifacts/run1_eval

3) Project Structure

├── README.md
├── requirements.txt
├── artifacts
│   └── run1
├── configs
│   └── default.yaml
├── data
│   └── example.csv
├── scripts
│   ├── eval.py
│   ├── make_data.py
│   ├── show_artifacts.py
│   └── train.py
├── src
│   ├── __init__.py
│   ├── data.py
│   ├── evaluate.py
│   ├── main.py
│   ├── model.py
│   ├── preprocess.py
│   └── utils.py
└── tests
    └── test_pipeline.py

4) Artifacts

The artifacts directory contains the outputs of the training pipeline. Each run will generate a new subdirectory (e.g., run1, run2, etc.).

  • best_params.json: Best hyperparameters found during cross-validation.
  • confusion_matrix.png: Confusion matrix of the model on the test set.
  • cv_results.csv: Cross-validation results.
  • metrics.json: Performance metrics of the model on the test set.
  • model.joblib: Trained model.
  • pr_curve.png: Precision-recall curve of the model on the test set.
  • roc_curve.png: ROC curve of the model on the test set.
  • schema.json: Schema of the data used for training.

5) Testing

For the test, execute

pytest -q -s

or with logs

pytest -q -s tests/test_pipeline.py -k test_end_to_end -vv

6) Check the resulting artifacts

python3 scripts/show_artifacts.py --run_dir artifacts/run1 --open_plots

7) Methodology

Assumptions

  • One CSV with binary target, <= 100 columns, <= 100k rows.

  • Numeric features may have nulls/outliers.

  • Categorical features <= 5 unique non-null values.

Preprocessing

  • Numeric: SimpleImputer (withmedian) -> RobustScaler (outlier-resistant)

  • Categorical: SimpleImputer(most_frequent) -> OneHotEncoder(handle_unknown="ignore")

  • Implemented via ColumnTransformer in a single sklearn Pipeline.

Feature Selection

  • Optional SelectFromModel with L1-penalized LogisticRegression after OHE (drops weak one-hot columns).

Model Selection

  • One estimator per run (YAML): RandomForestClassifier.

  • RandomizedSearchCV, StratifiedKFold, scoring = ROC AUC.

Validation & Reporting

  • Metrics + ROC/PR/CM plots.

  • Artifacts are written to disk for inspection and reuse.

Principles

  • Simplicity & DRY: single entrypoint (src/main.py) reuses the existing scripts; minimal moving parts.

  • Reproducible: configured via YAML; seeds configurable.

  • Deployable: entire pipeline serialized as model.joblib.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages