Skip to content

Prometheus example with torch model#98

Open
rz4 wants to merge 4 commits intomainfrom
example-prometheus-torch
Open

Prometheus example with torch model#98
rz4 wants to merge 4 commits intomainfrom
example-prometheus-torch

Conversation

@rz4
Copy link
Copy Markdown
Collaborator

@rz4 rz4 commented Apr 13, 2026

Summary

Example harness for torch implementation of Prometheus' TemporalPredict.
A script is provided to reproduce the model in torch and compare against baseline.
The reproduced torch model runs on APEIRON.

Motivation & Context

  • Baseline Prometheus model was provided as a Keras model.
  • Not directly compatible with APEIRON.

Approach

  • Write model implementations in keras referencing npm1_pwr_model.keras model file.
  • Train keras model from scratch, then compare with base model.
  • Write model implementation in torch.
  • Train torch model from scratch, then compare with base model.
  • Select early checkpoint during training as starting model for APEIRON.
  • Adjust APEIRON config to detect drift.
  • Run and compare checkpoint with base Prometheus model.

Screenshots / Logs (optional)

Comparing APEIRON trained torch model with baseline

test02_2025-07-23

Training Log (Torch Reproduction)

Loading data ...
  23 training files, 4 test files
TorchTemporalModel(
  (lstm1): LSTM(12, 128, batch_first=True)
  (drop1): Dropout(p=0.1, inplace=False)
  (lstm2): LSTM(128, 64, batch_first=True)
  (drop2): Dropout(p=0.1, inplace=False)
  (lstm3): LSTM(64, 32, batch_first=True)
  (head): Linear(in_features=32, out_features=1, bias=True)
)

--- Case 0 [2025-02-27] (torch): X=(12029, 10, 12), Y=(12029, 10, 1) (stats from 1 case(s)) ---
  Epoch   1/1  train_loss=0.3614  val_loss=0.1466
  checkpoint -> ./output/prometheus_torch/checkpoints/torch_case_00.pt
  stats      -> ./output/prometheus_torch/checkpoints/torch_case_00.stats.json

=== Evaluation [torch_case_00] (torch) ===
  test 0 [2025-03-20]: R2=-0.9407  MAE=119793.776
  test 1 [2025-05-12]: R2=-11.7010  MAE=228090.305
  test 2 [2025-07-23]: R2=-11.0850  MAE=226998.640
  test 3 [2025-09-18]: R2=-13.6690  MAE=230475.408

--- Case 1 [2025-03-12] (torch): X=(12405, 10, 12), Y=(12405, 10, 1) (stats from 2 case(s)) ---
  Epoch   1/1  train_loss=0.0236  val_loss=0.0054
  checkpoint -> ./output/prometheus_torch/checkpoints/torch_case_01.pt
  stats      -> ./output/prometheus_torch/checkpoints/torch_case_01.stats.json

=== Evaluation [torch_case_01] (torch) ===
  test 0 [2025-03-20]: R2=-0.2049  MAE=76539.053
  test 1 [2025-05-12]: R2=-0.1590  MAE=28893.670
  test 2 [2025-07-23]: R2=-0.1835  MAE=29299.993
  test 3 [2025-09-18]: R2=-0.1579  MAE=24972.951

--- Case 2 [2025-03-19] (torch): X=(14681, 10, 12), Y=(14681, 10, 1) (stats from 3 case(s)) ---
  Epoch   1/1  train_loss=0.0439  val_loss=0.0078
  checkpoint -> ./output/prometheus_torch/checkpoints/torch_case_02.pt
  stats      -> ./output/prometheus_torch/checkpoints/torch_case_02.stats.json

=== Evaluation [torch_case_02] (torch) ===
  test 0 [2025-03-20]: R2=0.9812  MAE=5532.674
  test 1 [2025-05-12]: R2=0.4989  MAE=14322.213
  test 2 [2025-07-23]: R2=0.3201  MAE=32789.465
  test 3 [2025-09-18]: R2=0.1602  MAE=34926.435

...

--- Case 19 [2025-09-02] (torch): X=(6416, 10, 12), Y=(6416, 10, 1) (stats from 20 case(s)) ---
  Epoch   1/1  train_loss=0.0010  val_loss=0.0012
  checkpoint -> ./output/prometheus_torch/checkpoints/torch_case_19.pt
  stats      -> ./output/prometheus_torch/checkpoints/torch_case_19.stats.json

=== Evaluation [torch_case_19] (torch) ===
  test 0 [2025-03-20]: R2=0.9714  MAE=3851.500
  test 1 [2025-05-12]: R2=0.9887  MAE=2989.728
  test 2 [2025-07-23]: R2=0.9926  MAE=2802.097
  test 3 [2025-09-18]: R2=0.9316  MAE=4084.549

--- Case 20 [2025-09-16] (torch): X=(18274, 10, 12), Y=(18274, 10, 1) (stats from 21 case(s)) ---
  Epoch   1/1  train_loss=0.0007  val_loss=0.0006
  checkpoint -> ./output/prometheus_torch/checkpoints/torch_case_20.pt
  stats      -> ./output/prometheus_torch/checkpoints/torch_case_20.stats.json

=== Evaluation [torch_case_20] (torch) ===
  test 0 [2025-03-20]: R2=0.9760  MAE=4034.054
  test 1 [2025-05-12]: R2=0.9908  MAE=2303.845
  test 2 [2025-07-23]: R2=0.9974  MAE=1943.545
  test 3 [2025-09-18]: R2=0.9867  MAE=2424.433

--- Case 21 [2025-09-17] (torch): X=(31086, 10, 12), Y=(31086, 10, 1) (stats from 22 case(s)) ---
  Epoch   1/1  train_loss=0.0007  val_loss=0.0010
  checkpoint -> ./output/prometheus_torch/checkpoints/torch_case_21.pt
  stats      -> ./output/prometheus_torch/checkpoints/torch_case_21.stats.json

=== Evaluation [torch_case_21] (torch) ===
  test 0 [2025-03-20]: R2=0.9723  MAE=5257.041
  test 1 [2025-05-12]: R2=0.9895  MAE=2249.922
  test 2 [2025-07-23]: R2=0.9983  MAE=1910.335
  test 3 [2025-09-18]: R2=0.9844  MAE=2494.421

--- Case 22 [2025-09-25] (torch): X=(7377, 10, 12), Y=(7377, 10, 1) (stats from 23 case(s)) ---
  Epoch   1/1  train_loss=0.0335  val_loss=0.0300
  checkpoint -> ./output/prometheus_torch/checkpoints/torch_case_22.pt
  stats      -> ./output/prometheus_torch/checkpoints/torch_case_22.stats.json

=== Evaluation [torch_case_22] (torch) ===
  test 0 [2025-03-20]: R2=0.9100  MAE=23678.136
  test 1 [2025-05-12]: R2=0.6486  MAE=37530.078
  test 2 [2025-07-23]: R2=0.6880  MAE=36466.123
  test 3 [2025-09-18]: R2=0.6278  MAE=36732.376

Saved final torch model to ./output/prometheus_torch/reproduced_prometheus.pt
Saved final stats sidecar to ./output/prometheus_torch/reproduced_prometheus.stats.json

API / CLI Changes

  • foo.bar(x: int) -> str (new)
  • baz(qux: PathLike) (removed strict: bool)

Breaking Changes

  • None

Performance (optional)

n/A

Security & Privacy

  • No secrets committed
  • Input validation added where needed

Dependencies

  • Tensorflow

Testing Plan

n/A

Documentation

README in examples folder explains how to setup data and reproduce model in torch.

Checklist

  • Code formatted (Ruff) → ruff format --check
  • Lint passes (Ruff) → ruff check .
  • Types pass (mypy/pyright) → mypy src
  • Tests pass (pytest) → pytest -q
  • Backward compatibility considered
  • Adequate comments for tricky parts
  • CI green

Risk & Rollback Plan

Probably not needed in the beginning

Notes for Reviewers

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant