I've been improving RobustDiff for #165, but as I've done so I've found that the optimization results are actually getting worse. The degradation starts when I decouple qr_ratio into log_q and log_r, because the scale of the two matters relative to huberM. Suddenly the method has more expressive power and can find lower-cost solutions, but with worse RMSE and R^2 against the true underlying derivative. I've realized $L(\Phi)$ isn't itself robust, so the hyperparameter optimization is favoring solutions that bend in the direction of outliers.
The solution? Pop a Huber loss in the RMSE evaluation, to get something we might call "Root Mean Huber Error" or "Robust Root Mean Error". We now have another parameter to pick, where that Huber switches from quadratic to linear. If we normalize the inputs by their standard deviation, then we can choose the parameter as some number of sigma, like 2 or 3, which would then count ~95% or ~99.73% of inliers as inliers, assuming a Gaussian distribution. This will affect the scale of the first term in the loss function, so we'll have to account for this against the total variation smoothing term.
The TV term is potentially problematic itself too, because if we are allowed to independently drive down RobustDiff's process term's Huber parameter, optimization might favor approximating the 1-norm to make it artificially sparse. I've run in to this in other cases too, and disallow order 1 for TVR for most datasets, because it can "hack" the loss function.
I've been improving RobustDiff for #165, but as I've done so I've found that the optimization results are actually getting worse. The degradation starts when I decouple$L(\Phi)$ isn't itself robust, so the hyperparameter optimization is favoring solutions that bend in the direction of outliers.
qr_ratiointolog_qandlog_r, because the scale of the two matters relative tohuberM. Suddenly the method has more expressive power and can find lower-cost solutions, but with worse RMSE and R^2 against the true underlying derivative. I've realizedThe solution? Pop a Huber loss in the RMSE evaluation, to get something we might call "Root Mean Huber Error" or "Robust Root Mean Error". We now have another parameter to pick, where that Huber switches from quadratic to linear. If we normalize the inputs by their standard deviation, then we can choose the parameter as some number of sigma, like 2 or 3, which would then count ~95% or ~99.73% of inliers as inliers, assuming a Gaussian distribution. This will affect the scale of the first term in the loss function, so we'll have to account for this against the total variation smoothing term.
The TV term is potentially problematic itself too, because if we are allowed to independently drive down RobustDiff's process term's Huber parameter, optimization might favor approximating the 1-norm to make it artificially sparse. I've run in to this in other cases too, and disallow order 1 for TVR for most datasets, because it can "hack" the loss function.