Statistics: Linear Regression


Linear Regression

1Description

Linear Regression is a tool that allows to compute the best fit line according to a linear law between an input set X and an output set Y. The two set must be of same size.

2Definitions

Linear Regression estimates two parameters from the sets. Bias and Gain. Bias has unit of measure of output set Y. Gain has unit of measure of output set over input set. Coefficients are computed in order to minimize the square RMS error. Output estimate minus real output.




3Aid Coefficients

Linear Regression can be defined so that it uses only five aid coefficients, making it computationally and memory friendly as only five elements are needed to make the calculations invariant to the size of the sets.
Four coefficients are the same used to compute the statistic of a set (average, RMS average, variance and standard deviation). One additional aid coefficient is the accumulation of the mixed products and is used to compute the covariance



The coefficients are computed by accumulating the two sets in just one pass, making their computation both stream compatible and fast. All very attractive properties.

Rather than using the accumulators, I can compute the average of the accumulators over the sets and use them as aid coefficients. Depending on the software implementation one may be better than the others. You only need one set of aid coefficients. Either the accumulators or the averages.




4Variance, Covariance and Standard Deviation

Variance is defined as average of squares minus square of averages. The covariance is the same, it simply uses the mixed products of input and output. Covariance measures the sensitivity of the output to the input.





5Linear Regression

Gain is defined as covariance over variance of input. Bias is computed in order to make sure the average of the output estimate is the same as the average of the actual output set.
Linear regression cannot be computed if



6Recap

Linear Regression function of accumulators


Linear Regression function of averages



7Example

A sensor is measuring a distance in [m] saving the timestamp of the measure [s].
Time is the independent variable, position the dependent one. I want to use linear regression to get the slope and the error of the approximation.
Aid parameters are accumulated from the set. From there, the set are no longer needed and everything else can be estimated just from the aid parameters.



Linear regression output are:
  • Bias of the trend line
  • Gain of the trend line
  • Error of the estimation
It is truly remarkable that the error can be computed without having the individual datapoints. The error allows a meaningful estimation of the goodness of the fit.

The data (blue) is plotted in a XY chart.



The trend line (black) has been computed using linear regression.

Around the trend line, the dark gray lines shows the error range at +/-one sigma of confidence, which is the RMS error of the estimation. 68% of measures are expected to be within plus or minus one standard deviation.

Light gray lines shows confidence at +/- half sigma, which means 38% of samples are expected to be in range. At +/- two sigma, confidence grows to 95%.

8Conclusions


By using either the accumulation of two sets, or the averages of the accumulation of two sets, it's possible to estimate Linear Regression parameters, including Bias, Gain and even RMS error of the estimate.

No comments: