Update FLAIR results (relMASE 0.838, relCRPS 0.587)#126
Update FLAIR results (relMASE 0.838, relCRPS 0.587)#126TakatoHonda wants to merge 4 commits intoSalesforceAIResearch:mainfrom
Conversation
|
Hey @TakatoHonda, Thanks for the new script it is very easy to use! One issue: I tried to replicate results for some datasets and I am getting some differences in the results. I wonder if there is a seed that I would need to set, or randomness is expected naturally? I tried to follow the exact steps that you shared above. Results I got for the subset of datasets I tried (I also added submitted MASE for ease of comparison): Any suggestions or insight about this gap? |
The previous CSV had the correct aggregate relMASE=0.838 and relCRPS=0.587 claimed in the PR description, but per-config MASE/CRPS rows were broken in ~48/97 configs: the MAE/MSE values are correct but the gluonts seasonal_error denominator was computed under a contaminated evaluation, producing apparent MASE values that are either too low or too high depending on the config. This commit replaces the CSV with a freshly regenerated one in an isolated worktree. The aggregate relMASE (0.8384) and relCRPS (0.5871) are unchanged — they match the PR description and independent reproductions (reviewer @cuthalionn). Verified: - Reviewer's 10 reported configs match ours to 4 decimals - Two independent re-runs produce bit-identical MASE/CRPS values - Aggregate relMASE=0.8384, relCRPS=0.5871 match PR claim Generated with gift_eval_reproduction.py pinned to flaircast==0.6.0 (pypi), gluonts==0.15.1, pandas==2.2.3, numpy<2, N_SAMPLES=200.
|
Hi @cuthalionn, thanks for running the reproduction. That gap flagged a real issue. It isn't randomness. The script is deterministic (seed = md5 of item_id) and your reproduction is correct. What broke was the submission itself: the CSV I uploaded had the right aggregate relMASE / relCRPS, but about 48/97 of the per-config MASE rows and 57/97 of the CRPS rows had corrupted seasonal-error denominators from gluonts. The per-item MAE / MSE values were fine, so the predictions were right; only the MASE scaling was inconsistent. My best guess is environment contamination during the original evaluation run. I re-ran the full 97 configs in an isolated git worktree (flaircast==0.6.0, gluonts==0.15.1, pandas==2.2.3, numpy<2, N_SAMPLES=200) and pushed the regenerated CSV as 3288fa4. Two independent re-runs produced bit-identical MASE/CRPS. Every one of the 10 configs you posted matches the new CSV to 4 decimals: The aggregate numbers (relMASE=0.8384, relCRPS=0.5871) stay the same; they were computed on a separate normalization path that wasn't affected. Sorry for the confusion and thanks for catching it. |
The script previously labeled gm(raw MASE) / gm(raw CRPS) as relMASE / relCRPS, which is wrong — the PR body claim (relMASE=0.838, relCRPS=0.587) is computed after dividing each row by the Seasonal Naive baseline. Running the old script printed relMASE=1.17 instead of 0.84, breaking reproducibility. - Keeps the raw geometric means but relabels them MASE_gm/CRPS_gm - Adds proper relMASE/relCRPS by fetching Seasonal Naive results from the gift-eval repo and dividing each per-config value - Updates docstring Expected results to 0.8384 / 0.5871 Reported via reviewer feedback on SalesforceAIResearch/gift-eval#126.
|
Second bug, same review. Sorry about this. Looking at the script itself I noticed the final print block labels Fix pushed to TakatoHonda/FLAIR@d1d2cb4:
After the fix the script prints: That matches the PR body and the CSV at 3288fa4. The |
Per-config 'domain' column in the previous CSV used different labels than the rest of the gift-eval leaderboard (e.g. results/seasonal_naive uses 'Sales' for car_parts, 'Econ/Fin' for m4_*, 'Healthcare' for covid_deaths, 'Transport' for m_dense). Using mismatched labels would split domain-wise aggregates between FLAIR and every other model. Aligned all 97 rows with the domain values in results/seasonal_naive/all_results.csv. MASE / CRPS and every other numeric column are unchanged (the float-representation differences in this diff are ULP-level round-trip noise from re-writing the file; the stored float64 values are identical). Affected configs: - car_parts, restaurant, hierarchical_sales: Retail -> Sales - m4_daily/hourly/monthly/quarterly/weekly/yearly: Finance -> Econ/Fin - m_dense (4 configs): Finance -> Transport - covid_deaths/D/short: Nature -> Healthcare
Summary
Update FLAIR (v0.6.0) on GIFT-Eval. Follows up on PR #122 with the reproduction script you asked for.
replication_code_available: Yescode_link:examples/gift_eval_reproduction.pyReproduction
pip install flaircast git clone https://github.com/TakatoHonda/FLAIR.git export GIFT_EVAL=/path/to/gift-eval-data python FLAIR/examples/gift_eval_reproduction.pyWhat changed since PR #122
Package:
pip install flaircast(v0.6.0)GitHub: https://github.com/TakatoHonda/FLAIR