Scikit pipeline optimization by vile319 · Pull Request #42 · Gleghorn-Lab/Protify

vile319 · 2026-02-19T20:30:46Z

Changes
main.py:
When --scikit_model_name is specified, skips LazyPredict and goes directly to training
Added --scikit_model_args flag to pass pre-specified hyperparameters as JSON (skips hyperparameter tuning entirely when provided)
Added log_metrics() call in run_scikit_scheme so results are written to TSV and plots are generated

lazy_predict.py:
Precompute preprocessing once instead of refitting StandardScaler/Imputer per model
Added n_jobs=-1 to parallelizable models (RandomForest, etc.)
Removed slow models from LazyPredict (SVC, NuSVC, AdaBoost, KNeighbors, DecisionTree, LDA/QDA, etc.)
Added XGBoost/LightGBM to model dictionaries correctly

scikit_classes.py:
Fixed --scikit_model_name CLI arg mapping to model_name
Refactored _calculate_metrics() to use the shared metrics.py functions via EvalPrediction, returning: accuracy, f1, precision, recall, mcc, roc_auc, pr_auc (classification) and mse, rmse, r_squared, mae, spearman_rho, pearson_rho (regression)
All code paths (run_specific_model, find_best_classifier, find_best_regressor) now use _calculate_metrics

New Usage
Full pipeline (LazyPredict → best model → hyperparameter tuning):
python main.py --model_names ESMC-600 --data_names gold-ppi --use_scikit --n_jobs -1

Skip LazyPredict, tune XGBoost:
python main.py --model_names ESMC-600 --data_names gold-ppi --use_scikit --scikit_model_name XGBClassifier --scikit_n_iter 10

Skip everything — use known optimal hyperparameters directly:
python main.py --model_names ESMC-600 --data_names gold-ppi --use_scikit --scikit_model_name XGBClassifier --scikit_model_args '{"n_estimators": 500, "max_depth": 7, "learning_rate": 0.1}'

…g for scikit probe - Add --scikit_model_args flag to skip hyperparameter tuning with known params - Add _calculate_metrics() in ScikitProbe returning accuracy, f1, mcc, precision, recall, roc_auc, pr_auc (classification) and mse, rmse, r_squared, mae, spearman_rho, pearson_rho (regression) - Fix log_metrics() call missing from run_scikit_scheme so plots are generated - Matches metrics output of neural probe for consistency

…mpute_regression_metrics in scikit probe

vile319 added 2 commits February 19, 2026 13:24

refactor: use shared compute_single_label_classification_metrics / co…

dd450f4

…mpute_regression_metrics in scikit probe

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

Scikit pipeline optimization#42

Scikit pipeline optimization#42
vile319 wants to merge 2 commits intoGleghorn-Lab:mainfrom
vile319:scikit-pipeline-optimization

vile319 commented Feb 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Comments

Conversation

vile319 commented Feb 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant