Skip to content

Beyond Accuracy: A Stability-Aware Metric for Multi-Horizon Forecasting

TL;DR Traditional forecasting models optimize only for accuracy, ignoring an important issue: predictions that fluctuate significantly from day to day undermine confidence in production. This paper introduces the AC score metric that balances accuracy and temporal stability, achieving 91% reduction in forecast volatility while improving multi-step prediction accuracy by up to 26%.

Paper Overview#

Title: Beyond Accuracy: A Stability-Aware Metric for Multi-Horizon Forecasting

Authors: C. Ma, G. Pomazkin, G.P. Saggese, P. Smith

Publication: arXiv preprint arXiv:2601.10863, 2026

Links: arXiv:2601.10863

Code: causify-ai/beyond_accuracy

  • Novel evaluation metric: Introduces "forecast accuracy and coherence score" (AC score) for assessing probabilistic forecasts across multiple time horizons, combining both prediction accuracy and temporal consistency in a single measure
  • Stability-aware approach: Ensures forecasts remain stable as the starting point shifts, addressing the critical problem of forecast volatility that affects decision-making in production systems
  • Empirical validation: Demonstrates 75% reduction in forecast volatility on M4 Hourly benchmark while maintaining comparable or improved point forecast accuracy through differentiable optimization of seasonal ARIMA models

Abstract#

Traditional time series forecasting methods optimize for accuracy alone. This objective neglects temporal consistency, in other words, how consistently a model predicts the same future event as the forecast origin changes.

We introduce the forecast accuracy and coherence score (forecast AC score for short) for measuring the quality of probabilistic multi-horizon forecasts in a way that accounts for both multi-horizon accuracy and stability. Our score additionally allows user-specified weights to balance accuracy and consistency requirements.

As an example application, we implement the score as a differentiable objective function for training seasonal auto-regressive integrated models and evaluate it on the M4 Hourly benchmark dataset.

Results demonstrate substantial improvements over traditional maximum likelihood estimation. Regarding stability, the AC-optimized model generated out-of-sample forecasts with 91.1% reduced vertical variance relative to the MLE-fitted model. In terms of accuracy, the AC-optimized model achieved considerable improvements for medium-to-long-horizon forecasts. While one-step-ahead forecasts exhibited a 7.5% increase in MAPE, all subsequent horizons experienced an improved accuracy as measured by MAPE of up to 26%. These results indicate that our metric successfully trains models to produce more stable and accurate multi-step forecasts in exchange for some degradation in one-step-ahead performance.