Hey HN, I'm an ML engineer and I kept running into this annoying pattern: build a forecast model, get great R² and low RMSE, deploy it, watch it completely fail in production. Took me way too long to realize the issue. Traditional metrics reward you for being numerically close to the actual value. But in real applications (trading, inventory, fraud detection), closeness doesn't matter. Getting the direction right matters. Simple example: You predict a stock at $101, actual is $99. Error is tiny (2 points), but you predicted UP and it went DOWN, so you lost money. RMSE says "great job!" Reality says you're broke. So for my MSc thesis I built metrics that measure decision utility instead of statistical accuracy:
FIS (Forecast Investment Score): directional accuracy + profit factor + gain-to-risk CER (Confidence Efficiency Ratio): error normalized by price scale
Validated it across equities and crypto. Models selected by FIS had 250% higher alpha and 4,141% better risk-adjusted returns than models selected by MSE. Same models, just different selection criteria. Then I got annoyed at a second problem: jumping straight to modeling without understanding the dataset. Wrong target selected, leakage in features, random splits on time series data, etc. You only find out after wasting GPU time. So I built a second layer that diagnoses datasets before training:
Detects time series vs tabular automatically Calculates a Dataset Health Score (leakage risk, signal quality, missingness) Recommends concrete models with validation strategy and feature engineering
The whole thing is live at quantsynth.org. Two main use cases:
Upload predictions + actuals → see if they would have actually worked (not just how close they were) Upload raw dataset → get diagnostic report + model recommendations before training
Built it solo while working full-time because these problems were driving me crazy. Not trying to sell anything, just curious if this resonates with other people or if I'm solving problems only I have. The evaluation stuff works across any domain where direction matters (trading, inventory, clinical decisions). The dataset diagnostics work for both time series and tabular. Would love feedback, especially around:
Does the "great metrics but terrible production" problem happen to you? What other domains need decision-aligned evaluation besides trading? Is automated dataset diagnostics useful or just noise?
Happy to answer questions about the methodology