Statistics: Correlation


Correlation
Correlation is a tool that measures how two sets of number correlate with each others. High correlation means that a variation in one set of number accounts fully for a variation in the second sets. Low correlation means that two sets of numbers are largely independent and unrelated to each others.

1Definitions

Previously I defined aid coefficients to compute statistic of the set without storing in memory all the set for processing. This allows computation of Linear regression parameters between two set, as well computation of the average, RMS average and standard deviation metrics.
Accumulators:




Averages:




2Pearson Correlation

The simplest correlation coefficient is the Pearson Correlation. It is simple but it has a few limitations. It shows only linear correlation and is dependent on the slope of the correlation. It won't show other kind of causal relationships.



Pearson correlation is defined as covariance divided by the product of the standard deviations.



3Goodness of Pearson Correlation

I need to evaluate the performance of the correlation metric.
I generate two sets each with two parameters. The first parameter controls the gain of a RNG with range -0.5:+0.5. The second parameter controls the gain of a line from 0 to +1.
By controlling the gain, I can generate sets with controlled levels of noise.
Below two examples of sets generated with differing parameters:



Above, two sets with differing parameters (blue) with their regression line (black) and estimated error range at two sigma(gray).
Set one is fully random in both axis with a range of 1 in both.
Set two is linear in X and has 1 part of noise and 4 parts of linear in Y and resembles a line.
Of course I want at least the set below to have a much higher correlation than the set above. A fuly uncorrelated set like the first one should have a correlation value of zero. A perfect set where all values are lined up in a line should have a correlation of 1. Ideally, transition should be linear.
To evaluate the performance I create a metric to evaluate how much of the generated samples is random, and how much is linear. This gives me a definite starting point to evaluate correlation against. I would like the FOM to be zero if the set are fully random, and give one if the sets are fully linear.
Below the definition of FOM:





Next step is to generate many random sets with different parameters. I keep source linear, and inject randomness only on Y set to keep things linear.
I used this chart to develop a correlation metric that scales linearly with my FOM, and therefore is better suited to evaluate the goodness of the linear regression.

4Custom Linear Correlation

I can see that the correlation is better when the error range is low compared to the range of Y.
I do have a closed expression for both. Using the sampled values with differing levels of randomness, I fine tuned a closed correlation that behave linearly with my randomness FOM.
Below the definitions for error, standard deviations and correlation:




5Comparison between Pearson and Linear Correlation

I tested the random set generator at various settings, with even large gains to ensure the metric is independent of magnitude of signal.

5.1Data Sets

Below, the full data points used to evaluate and build the new correlation metric:





Random Ratio is the dependent FOM I test the correlation against. Random Ratio = 0 means the signal is fully linear. Random Ratio = 1 means the signal is fully random. This FOM is supposed to behave linearly, so Random Ratio = 0.5 means the signal is generated with equal amounts of randomness and linear components.


5.2Example Difference

Pearson correlation is the formula find in literature. I used positive set to avoid having a negative Pearson coefficient. Linear correlation is the new metric I generated.





In this example, it can be seen that Pearson correlation estimates this fit to be just 3% away from a perfect line, while Linear Correlation estimate it to be 13% away from a perfect line.
Obviously, this line should be much more than 3% away from linear. This is across the board, with Pearson correlation overestimating the goodness of the fit, ad making it hard to differentiate between a poor set like this, and a much better set.
Comparing the Y spread of the line with the Y spread of the error and rooting it, yield a much more meaningful metric. Linear Correlation behaves better even at a glance.



5.3Full Range Difference

In the chart below I plotted Pearson (blue) and Linear (red) Correlations for all examples in the goodness data set. i applied regression to the samples to estimate their linearity and error.




Pearson Correlation is not linear, behaving like a square root or a logarithm, and flattening the difference between good fits. The bias is +8.7%, the bias is +1.054[%/%] against the desired 1.000[%/%]. The error diameter of a linear fit is 31.2[%], particularly poor.
Even at a glance, Linear Correlation behave much better, being closer to a line. The bias is lowered by a fifth to 1.7[%], the Gain is closer to 1, being 1.029[%/%] and the error diameter halves to 16.9[%], which is still poor, but a much better fit.


6Recap



7Conclusions

Correlation measures how much the change in the output set can be attributed to the input set.
Pearson correlation is a metric that is non linear and overestimate correlation, worse, this correlation changes sign depending on the slope of the correlation.
Linear Correlation has been devised in order to have a better linearity, independence on slope and more accurately reflect just how much of the output variation can be attributed to the input according to a linear law.
This is actually needed in neural networks. My version of the learning algorithms uses advanced statistics to control the evolution of the neural network, and a more linear correlation tool is required to improve learning times.

No comments: