MLOps Takehome Assignment

A small, reliable Python 3.x pipeline for binary classification on tabular data. Inputs: one CSV with target, numeric + categorical features. Outputs: a trained model + validation metrics/plots, all written to disk.

1) Environment Setup

Requirements

Python 3.9+ (tested on 3.12)
OS: Linux
Packages listed in requirements.txt

Create & activate a virtualenv, install deps

python3 -m venv .venv
source .venv/bin/activate
python3 -m pip install -U pip
python3 -m pip install -r requirements.txt

2) Run the pipeline

To run the whole pipeline

python3 src/main.py \
  --data_path data/example.csv \
  --output_path artifacts/run1 \
  --config configs/default.yaml

a) To only create the data

python3 scripts/make_data.py

b) To only train the model

python3 scripts/train.py \
  --data_path data/example.csv \
  --output_dir artifacts/run1 \
  --config configs/default.yaml

c) To only evaluate the model

python3 scripts/eval.py \
  --model_path artifacts/run1/model.joblib \
  --data_path data/example.csv \
  --output_dir artifacts/run1_eval

3) Project Structure

├── README.md
├── requirements.txt
├── artifacts
│   └── run1
├── configs
│   └── default.yaml
├── data
│   └── example.csv
├── scripts
│   ├── eval.py
│   ├── make_data.py
│   ├── show_artifacts.py
│   └── train.py
├── src
│   ├── __init__.py
│   ├── data.py
│   ├── evaluate.py
│   ├── main.py
│   ├── model.py
│   ├── preprocess.py
│   └── utils.py
└── tests
    └── test_pipeline.py

4) Artifacts

The artifacts directory contains the outputs of the training pipeline. Each run will generate a new subdirectory (e.g., run1, run2, etc.).

best_params.json: Best hyperparameters found during cross-validation.
confusion_matrix.png: Confusion matrix of the model on the test set.
cv_results.csv: Cross-validation results.
metrics.json: Performance metrics of the model on the test set.
model.joblib: Trained model.
pr_curve.png: Precision-recall curve of the model on the test set.
roc_curve.png: ROC curve of the model on the test set.
schema.json: Schema of the data used for training.

5) Testing

For the test, execute

pytest -q -s

or with logs

pytest -q -s tests/test_pipeline.py -k test_end_to_end -vv

6) Check the resulting artifacts

python3 scripts/show_artifacts.py --run_dir artifacts/run1 --open_plots

7) Methodology

Assumptions

One CSV with binary target, <= 100 columns, <= 100k rows.
Numeric features may have nulls/outliers.
Categorical features <= 5 unique non-null values.

Preprocessing

Numeric: SimpleImputer (withmedian) -> RobustScaler (outlier-resistant)
Categorical: SimpleImputer(most_frequent) -> OneHotEncoder(handle_unknown="ignore")
Implemented via ColumnTransformer in a single sklearn Pipeline.

Feature Selection

Optional SelectFromModel with L1-penalized LogisticRegression after OHE (drops weak one-hot columns).

Model Selection

One estimator per run (YAML): RandomForestClassifier.
RandomizedSearchCV, StratifiedKFold, scoring = ROC AUC.

Validation & Reporting

Metrics + ROC/PR/CM plots.
Artifacts are written to disk for inspection and reuse.

Principles

Simplicity & DRY: single entrypoint (src/main.py) reuses the existing scripts; minimal moving parts.
Reproducible: configured via YAML; seeds configurable.
Deployable: entire pipeline serialized as model.joblib.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
configs		configs
scripts		scripts
src		src
tests		tests
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
project_explanation.md		project_explanation.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MLOps Takehome Assignment

1) Environment Setup

Requirements

Create & activate a virtualenv, install deps

2) Run the pipeline

a) To only create the data

b) To only train the model

c) To only evaluate the model

3) Project Structure

4) Artifacts

5) Testing

6) Check the resulting artifacts

7) Methodology

Assumptions

Preprocessing

Feature Selection

Model Selection

Validation & Reporting

Principles

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

MLOps Takehome Assignment

1) Environment Setup

Requirements

Create & activate a virtualenv, install deps

2) Run the pipeline

a) To only create the data

b) To only train the model

c) To only evaluate the model

3) Project Structure

4) Artifacts

5) Testing

6) Check the resulting artifacts

7) Methodology

Assumptions

Preprocessing

Feature Selection

Model Selection

Validation & Reporting

Principles

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages