Statistics: Linear Regression Error

Linear Regression Error
Surprisingly, there is enough information in the five aid coefficients to compute the actual RMS error between estimated output and actual output without the need to store and compute the RMS error from the individual values of the sets.

It takes a huge amount of math to get there, but it's a job that needs to be done only once.

1Error Estimation

I want to estimate the error of the linear regression. It is possible to do so from the aid coefficients, without using the original set values. This means jumping from 2N to 5 memory slots.



I can go further. I see that the function depends on gain and bias. I want to extract the formula function of just the aid coefficients. This is gruesome.



I uses excel to make sure this is still correct. It takes less time if I catch errors early on.
Next, I eliminate the dependence from the intermediate coefficients. Develops the individual products.



Inject the developed products in the numerator. In retrospect I could have extracted Sum YY fro the argument as it's the only place where it appears anywhere.



I use excel to verify the formula is still correct and catch mistakes.
Next step is to collect and pack the coefficients.
partial fraction decomposition is too taxing to do with so many terms, and wolfram just gives up.

I do it the old fashioned way, searching for terms that have a common elements and when grouped shows the same structure as the denominator or a part of it, allowing me to unpack the fraction in a clean way using fewer terms and operations.





I inject the final developed argument in the original definition of the error.



I use dimensional analysis to test whatever the function has physical meaning.
I use excel to verify that the formula is still correct.

Next, I find the error function of the alternate set of aid coefficients, the averages.



2Meaning of the Error

Error has the same units of the output and is referred to a Gaussian distribution.



The sigma is the multiplier of the RMS error estimated. The center of the distribution is the actual trend line that has been computed through linear regression.

The error lines can be drawn just by applying an offset to the trend lines. 95% of values (blue) are expected to lay between +2 RMS error (gray) and -2 RMS error (gray) from the trend line (black).



I applied the linear regression formulae on a noisy line with equal amount of linear dependence and noise in them. Just this once, I counted the number of samples within 0.5, 1.0, 1.5 and 2.0 sigma to see if things works out. They do!

The effective error diameter is four times the RMS error.

3Recap



4Conclusions

From the aid coefficient it's possible to compute the RMS error the estimated output of the linear regression. 95% of set values are expected to be found within a diameter of 4 times the RMS error from the trend line.
This error allows for a much more robust correlation FOM.

No comments: