CORRELATION

When to use it
--------------

You have two variables (that are at least on an ordinal scale) and you
want to know how "related" they are, i.e., to what extent when one
varies the other also varies in a proportional manner (without
particular assumptions about whether they can be interpreted as
dependent vs. independent variable -- unlike in regression!)

For example:

- are the frequencies with which word A and word B occur with all
  content words in our corpus correlated? (= cosine measure)

- is there a correlation between the ratings assigned by judge A and
  judge B to a certain set of words on a 10-point scale?

- are log-likelihood and mutual information correlated?

- are log-likelihood values of collocations in corpus A more
  correlated with log-likelihood values of collocations in corpus B or
  with log-likelihood values of collocations in corpus C?

- is the perceived semantic transparency of a word (inversely)
  correlated with its frequency?


The math
--------

Covariance:

Given N data-points (x_1 y_1), (x_2 y_2), ... , (x_N y_N)

      COV = Sum(i->N) (x_i - AVG_x) (y_i - AVG_y)

Intuition about extreme cases:

Large positive covariance: if, whenever x_i is larger than the average
x value y_i is larger than the average y value, and whenever x_i is
smaller than the x average y_i is smaller than the y average, products
will be always positive, and covariance will be large positive value.

Large negative covariance: if, whenever x_i is larger than the average
x value y_i is smaller than the average y value, and whenever x_i is
smaller than the x average y_i is larger than the y average, products
will be always negative, and covariance will be large negative value.

Covariance close to zero: if sometimes x_i has same sign as y_i (and
thus their product is positive) and sometimes x_i and y_i have
opposite sign (and thus their product is negative), then the products
will tend to cancel out when summing them, and covariance will be
close to 0.

Of course, size also matters (if when x_i and y_i have same sign the
two values are large, whereas when they have opposite sign the values
are small, covariance will still be large).

Incidentally, the variance is the covariance of a variable with itself
-- neat, isn't it?

Correlation is simply covariance divided by the product of the
standard deviations of the two variables:

	 SD_x = SQRT( Sum(i->N) (x_i - AVG_x)^2 )

	 COR = COV/(SD_x * SD_y)

Unlike covariance, that depends on the magnitude of x and y,
correlation will always range between -1 (perfect inverse correlation)
and 1 (perfect correlation), with 0 indicating no relation between x
and y.

For reasons not discussed here, the square of the correlation
coefficient r^2 can be interpreted as the "amount of variation in one
variable that can be explained by variation in the other
variable". Pay attention to whether a study reports r or r^2 since
r^2<=r.

Correlation with toy data in R:

> a <- (1:100)*3
> b <- (100:1)*10
> c <- rpois(100,50)

> cov(a,b)/(sd(a)*sd(b))
[1] -1
> cov(a,c)/(sd(a)*sd(c))
[1] -0.06243577
> cov(b,c)/(sd(b)*sd(c))
[1] 0.06243577

> cor(a,b)
[1] -1
> cor(a,c)
[1] -0.06243577
> cor(b,c)
[1] 0.06243577

> cor(data.frame(a,b,c))
            a           b           c
a  1.00000000 -1.00000000 -0.06243577
b -1.00000000  1.00000000  0.06243577
c -0.06243577  0.06243577  1.00000000

Plotting can be very instructive...

> plot(a,b)
> plot(a,c)

Notice obvious relation to simple linear regression.


Spearman's correlation coefficient
----------------------------------

In many cases, we are willing to trust the differences in ranks, but
not the difference in magnitude between the values of our variables,
(i.e., we believe that our measurements are too coarse to detect
differences stronger than ordinal in a reliable way).

For example, suppose that we are comparing the log-likelihood values
of the same collocations as calculated on corpus a and corpus b:

        a   	b
coll_1	20.33	19.33
coll_2	15.2	4.3
coll_3	2.1	5.2
...

Perhaps, we are willing to assume that the fact that coll_2 is more
highly ranked than coll_3 in corpus a and vice versa in corpus b is
meaningful, but the fact that the absolute differences in values among
the collocations are different in the two corpora is not something
that should be taken into account by the analysis (e.g., because of
differences in absolute size of the two corpora, or because we do not
believe that collocativity should have interval-like properties).

Then, we can replace the values with ranks, so that we preserve the
ordinal information but we no longer take size differences into
account:

        a   	b
coll_1	1	1
coll_2	2	3
coll_3	3	2
...

Spearman's correlation coefficient is simply correlation computed on
ranks instead of the original values:

> a <- rpois(100,50)
> b <- rpois(100,50)

> cor(a,b,method="spearman")
[1] 0.1132521
> cor(rank(a),rank(b))
[1] 0.1132521

A real life example:

> redata <- read.table("re_semtrasp_data",header=TRUE)
> cor(redata)
> cor(redata,method="spearman")


Caveats
-------

a) Stefan

Stefan hates correlations.

b) Meaningfulness of coefficients

There is no fixed rule on what counts as "highly correlated",
although, in most real-life examples I've seen, anything above .7 is
pretty impressive -- but of course this will depend a lot on the
nature of specific experiments (psycholinguists and phoneticians will
often be pretty happy with .4!)

Below, I'll show how classic hypothesis testing methods can be used to
assess the "significance" of a correlation. However, as I'll discuss,
often this is not so meaningful.

A better approach is perhaps to consider the values of correlation
coefficients for different variable couplings in the same study in
perspective (e.g., is the correlation coefficient between subcorpora
of the same domain systematically higher than the correlation
coefficient between subcorpora of different domains? is log-likelihood
more strongly correlated with human collocativity judgments than
mutual information? etc.)

Again, this relative approach has the problem that we still do not
know how large a difference must be to be considered meaningful (we
could use statistical tests to assess the significance of differences
among correlations, although we might run into indepence assumption
problems...)

c) Linearity assumption

Correlation coefficient measures linear relations. Non-linear
dependencies will not be captured. Try:

> x <- rnorm(100,0)
> y <- x^2
> cor(x,y)

What happened?

d) Ties

Massive ties are a problem for correlation coefficients, in particular
if we use ranks.

This is a huge issue with corpus data because of Zipf -- consider,
e.g., comparisons of frequencies in different corpora, collocations
that occur 0 times in both corpora, etc.

The most obvious problem is that large numbers of uninterestingly low
frequency tied values will boost up correlation:

> x <- rpois(100,50)
> y <- rpois(100,50)
> cor(rank(x),rank(y))
[1] -0.02155369

> x2 <- c(x,rep(1,100))
> y2 <- c(y,rep(1,100))
> cor(rank(x2),rank(y2))
[1] 0.8545265

There are also technical problems (e.g., with the computation of
p-values).

e) Strong effect of a few large rank mismatches

> x <- c(1:100)
> y<-c(6:100,1:5)
> cor(x,y)
[1] 0.7149715

f) Casuality (no!) and hidden "third" explanatory variables

Does low frequency "cause" word length? What is the relation between
semantic transparency and productivity? What did the US data about
wine drinking correlating with good health mean? (Hint: in the US,
wine is expensive...)


Significance of correlation (and its significance)
--------------------------------------------------

A simple example of "significance" testing for Spearman's correlation
coefficient.

General concept of hypothesis testing will be discussed in future
handouts, but basic idea is that there is a "true" underlying
correlation between the two "populations" we are comparing.

It is likely that the specific correlation we happened to obtain from
our sample is close to the true correlation, but it is highly unlikely
that it is exactly the same. Morevoer, if we are unlucky, it could be
that we picked a number of unusual data points, such that the
empirical correlation is far away from the true correlation.

In the specific setting I am presenting here, we test "null
hypothesis" that there is no relation between variables x and y, and
thus any combination of rankings of x and y is equally likely (given
this null hypothesis, if we sample N data-points a number of times and
we compute the correlation coefficient each time, this will result in
an average correlation of 0, which is another way to say that there is
no relation between the populations).

What is probability of obtaining an empirical Spearman's correlation
coefficient of, say, 0.5 from a sample of N cases, under the null
hypothesis?

The exact probability of having a value as large as 0.5 by chance can
be computed by considering all possible rank orders of y given a
certain order of x, calculating the correlation coefficient for all
these possible orders (which, under the null hypothesis, are equally
likely), counting how many of these coefficients are as large as 0.5,
and dividing this quantity by the number of possible orders (i.e., by
the total number of correlation coefficients we computed).

This probability is the famous p-value, indicating how likely the data
we observed would be under the null hypothesis.

Toy example:

x   y
1   1
2   3
3   2

AVG_x = AVG_y = 2

COV = (1-2)(1-2) + (2-2)(3-2) + (3-2)(2-2) = 1

SD_x = SD_y = SQRT ((1-2)^2 + (2-2)^2 + (3-2)^2) = SQRT(2)

COR = 1 / ( SQRT(2) * SQRT(2) ) = 0.5

How "good" is a correlation of 0.5 when N=3?

For a fixed x, we consider all N!=6 permutations of y (we can fix x
since the various correlation coefficients would be distributed
exactly in the same way if we considered any other order of x) and we
compute the corresponding correlation coefficients:

x   y1   y2   y3   y4   y5   y6
1   1    1    2    2    3    3 	 
2   2	 3    1    3    1    2
3   3	 2    3    1    2    1

    1    0.5  0.5  -0.5	-0.5 -1		--> NB: average is 0, as expected
    	      	   	     		    under null hp!

What is the probability that a value as large as 0.5 will occur by
chance?

One-tailed probability: we can safely assume that, if there is a
correlation (i.e., if null hypothesis is false) this correlation will
be positive/negative -> numerator of p-value given by sum of all corr
coeffs >= 0.5 (<= 0.5).

Two-tailed probability: we do not know direction of correlation ->
numerator of p-value given by sum of all corr coeffs >= |0.5|.

[Incidentally, this example also shows why correlation when the
variables have just few rank-levels does not make much sense...]

This is example of exact/non-parametric test of significance: we
considered all possible outcomes of experiment (in our case, computing
the correlation coefficient between x and y) under the null hypothesis
and we counted the proportion of outcomes that had values as large as
0.5; given that under the null hypothesis all outcomes have equal
probability, this proportion is exact probability of obtaining a value
as large as 0.5 under null hypothesis.

Advantages of exact/non-parametric tests:

- easy to understand;
- (almost) assumption-free;
- accurate for small samples as well.

Main problem: computationally very expensive (typically, used only for
small samples). [BTW: we don't really count all the possible outcomes
and the corresponding correlation coefficients -- we can calculate
these values using formulas from combinatorics]

(More common) alternative: asymptotic/parametric tests.

Asymptotic: for various distributions (including the count data
distributions linguists often deal with) they hold for N tending to
Inf.

Parametric: they assume that our sample comes from population with a
certain distribution (e.g., normal distribution) and use empirical
data to estimate parameters of distribution.

(Asymptotic does not imply parametric, nor vice versa, although there
is a tendency for the two properties to co-occur.)

All else being equal, parametric tests tend to have more "power",
i.e., they are less likely to accept the null hypothesis if it is
false.

For standard correlation (aka, Pearson's correlation), we use an
asymptotic/parametric method to compute p-values.

Asymptotic method also used to compute Spearman's correlation for high
Ns.

Often, in estimation for significance testing, we find ourselves in a
win-win situation: for low Ns, asymptotic methods would not be
justified, but we can use exact tests, for high Ns, exact tests would
be too inefficient, but asymptotic tests become justified.

[A recent "third way" to significance tests: simulation (Monte Carlo
methods, the bootstrap): imho, very promising for linguistics
problems.]

In R:

> cor.test(x,y)
> cor.test(x,y,method="spearman")
> cor.test(redata$sa,redata$mi)
> cor.test(redata$sa,redata$mi,method="spearman")

(Notice how Spearman complains that p-value could be incorrect because
of ties.)

Reasons to be wary of significance testing:

- null hypotheses are often silly (how interesting is it that the
correlation between frequency lists from two corpora is not 0???);

- thresholds for "interesting" results is arbitrary (psychology
  journals used to require p <= 0.05 to publish paper);

- assumptions necessary to compute p-value are often unwarranted
  at best;

- significance levels are sometimes an excuse for social/behavioral
  scientists not to understand statistics!