CORRELATION When to use it -------------- You have two variables (that are at least on an ordinal scale) and you want to know how "related" they are, i.e., to what extent when one varies the other also varies in a proportional manner (without particular assumptions about whether they can be interpreted as dependent vs. independent variable -- unlike in regression!) For example: - are the frequencies with which word A and word B occur with all content words in our corpus correlated? (= cosine measure) - is there a correlation between the ratings assigned by judge A and judge B to a certain set of words on a 10-point scale? - are log-likelihood and mutual information correlated? - are log-likelihood values of collocations in corpus A more correlated with log-likelihood values of collocations in corpus B or with log-likelihood values of collocations in corpus C? - is the perceived semantic transparency of a word (inversely) correlated with its frequency? The math -------- Covariance: Given N data-points (x_1 y_1), (x_2 y_2), ... , (x_N y_N) COV = Sum(i->N) (x_i - AVG_x) (y_i - AVG_y) Intuition about extreme cases: Large positive covariance: if, whenever x_i is larger than the average x value y_i is larger than the average y value, and whenever x_i is smaller than the x average y_i is smaller than the y average, products will be always positive, and covariance will be large positive value. Large negative covariance: if, whenever x_i is larger than the average x value y_i is smaller than the average y value, and whenever x_i is smaller than the x average y_i is larger than the y average, products will be always negative, and covariance will be large negative value. Covariance close to zero: if sometimes x_i has same sign as y_i (and thus their product is positive) and sometimes x_i and y_i have opposite sign (and thus their product is negative), then the products will tend to cancel out when summing them, and covariance will be close to 0. Of course, size also matters (if when x_i and y_i have same sign the two values are large, whereas when they have opposite sign the values are small, covariance will still be large). Incidentally, the variance is the covariance of a variable with itself -- neat, isn't it? Correlation is simply covariance divided by the product of the standard deviations of the two variables: SD_x = SQRT( Sum(i->N) (x_i - AVG_x)^2 ) COR = COV/(SD_x * SD_y) Unlike covariance, that depends on the magnitude of x and y, correlation will always range between -1 (perfect inverse correlation) and 1 (perfect correlation), with 0 indicating no relation between x and y. For reasons not discussed here, the square of the correlation coefficient r^2 can be interpreted as the "amount of variation in one variable that can be explained by variation in the other variable". Pay attention to whether a study reports r or r^2 since r^2<=r. Correlation with toy data in R: > a <- (1:100)*3 > b <- (100:1)*10 > c <- rpois(100,50) > cov(a,b)/(sd(a)*sd(b)) [1] -1 > cov(a,c)/(sd(a)*sd(c)) [1] -0.06243577 > cov(b,c)/(sd(b)*sd(c)) [1] 0.06243577 > cor(a,b) [1] -1 > cor(a,c) [1] -0.06243577 > cor(b,c) [1] 0.06243577 > cor(data.frame(a,b,c)) a b c a 1.00000000 -1.00000000 -0.06243577 b -1.00000000 1.00000000 0.06243577 c -0.06243577 0.06243577 1.00000000 Plotting can be very instructive... > plot(a,b) > plot(a,c) Notice obvious relation to simple linear regression. Spearman's correlation coefficient ---------------------------------- In many cases, we are willing to trust the differences in ranks, but not the difference in magnitude between the values of our variables, (i.e., we believe that our measurements are too coarse to detect differences stronger than ordinal in a reliable way). For example, suppose that we are comparing the log-likelihood values of the same collocations as calculated on corpus a and corpus b: a b coll_1 20.33 19.33 coll_2 15.2 4.3 coll_3 2.1 5.2 ... Perhaps, we are willing to assume that the fact that coll_2 is more highly ranked than coll_3 in corpus a and vice versa in corpus b is meaningful, but the fact that the absolute differences in values among the collocations are different in the two corpora is not something that should be taken into account by the analysis (e.g., because of differences in absolute size of the two corpora, or because we do not believe that collocativity should have interval-like properties). Then, we can replace the values with ranks, so that we preserve the ordinal information but we no longer take size differences into account: a b coll_1 1 1 coll_2 2 3 coll_3 3 2 ... Spearman's correlation coefficient is simply correlation computed on ranks instead of the original values: > a <- rpois(100,50) > b <- rpois(100,50) > cor(a,b,method="spearman") [1] 0.1132521 > cor(rank(a),rank(b)) [1] 0.1132521 A real life example: > redata <- read.table("re_semtrasp_data",header=TRUE) > cor(redata) > cor(redata,method="spearman") Caveats ------- a) Stefan Stefan hates correlations. b) Meaningfulness of coefficients There is no fixed rule on what counts as "highly correlated", although, in most real-life examples I've seen, anything above .7 is pretty impressive -- but of course this will depend a lot on the nature of specific experiments (psycholinguists and phoneticians will often be pretty happy with .4!) Below, I'll show how classic hypothesis testing methods can be used to assess the "significance" of a correlation. However, as I'll discuss, often this is not so meaningful. A better approach is perhaps to consider the values of correlation coefficients for different variable couplings in the same study in perspective (e.g., is the correlation coefficient between subcorpora of the same domain systematically higher than the correlation coefficient between subcorpora of different domains? is log-likelihood more strongly correlated with human collocativity judgments than mutual information? etc.) Again, this relative approach has the problem that we still do not know how large a difference must be to be considered meaningful (we could use statistical tests to assess the significance of differences among correlations, although we might run into indepence assumption problems...) c) Linearity assumption Correlation coefficient measures linear relations. Non-linear dependencies will not be captured. Try: > x <- rnorm(100,0) > y <- x^2 > cor(x,y) What happened? d) Ties Massive ties are a problem for correlation coefficients, in particular if we use ranks. This is a huge issue with corpus data because of Zipf -- consider, e.g., comparisons of frequencies in different corpora, collocations that occur 0 times in both corpora, etc. The most obvious problem is that large numbers of uninterestingly low frequency tied values will boost up correlation: > x <- rpois(100,50) > y <- rpois(100,50) > cor(rank(x),rank(y)) [1] -0.02155369 > x2 <- c(x,rep(1,100)) > y2 <- c(y,rep(1,100)) > cor(rank(x2),rank(y2)) [1] 0.8545265 There are also technical problems (e.g., with the computation of p-values). e) Strong effect of a few large rank mismatches > x <- c(1:100) > y<-c(6:100,1:5) > cor(x,y) [1] 0.7149715 f) Casuality (no!) and hidden "third" explanatory variables Does low frequency "cause" word length? What is the relation between semantic transparency and productivity? What did the US data about wine drinking correlating with good health mean? (Hint: in the US, wine is expensive...) Significance of correlation (and its significance) -------------------------------------------------- A simple example of "significance" testing for Spearman's correlation coefficient. General concept of hypothesis testing will be discussed in future handouts, but basic idea is that there is a "true" underlying correlation between the two "populations" we are comparing. It is likely that the specific correlation we happened to obtain from our sample is close to the true correlation, but it is highly unlikely that it is exactly the same. Morevoer, if we are unlucky, it could be that we picked a number of unusual data points, such that the empirical correlation is far away from the true correlation. In the specific setting I am presenting here, we test "null hypothesis" that there is no relation between variables x and y, and thus any combination of rankings of x and y is equally likely (given this null hypothesis, if we sample N data-points a number of times and we compute the correlation coefficient each time, this will result in an average correlation of 0, which is another way to say that there is no relation between the populations). What is probability of obtaining an empirical Spearman's correlation coefficient of, say, 0.5 from a sample of N cases, under the null hypothesis? The exact probability of having a value as large as 0.5 by chance can be computed by considering all possible rank orders of y given a certain order of x, calculating the correlation coefficient for all these possible orders (which, under the null hypothesis, are equally likely), counting how many of these coefficients are as large as 0.5, and dividing this quantity by the number of possible orders (i.e., by the total number of correlation coefficients we computed). This probability is the famous p-value, indicating how likely the data we observed would be under the null hypothesis. Toy example: x y 1 1 2 3 3 2 AVG_x = AVG_y = 2 COV = (1-2)(1-2) + (2-2)(3-2) + (3-2)(2-2) = 1 SD_x = SD_y = SQRT ((1-2)^2 + (2-2)^2 + (3-2)^2) = SQRT(2) COR = 1 / ( SQRT(2) * SQRT(2) ) = 0.5 How "good" is a correlation of 0.5 when N=3? For a fixed x, we consider all N!=6 permutations of y (we can fix x since the various correlation coefficients would be distributed exactly in the same way if we considered any other order of x) and we compute the corresponding correlation coefficients: x y1 y2 y3 y4 y5 y6 1 1 1 2 2 3 3 2 2 3 1 3 1 2 3 3 2 3 1 2 1 1 0.5 0.5 -0.5 -0.5 -1 --> NB: average is 0, as expected under null hp! What is the probability that a value as large as 0.5 will occur by chance? One-tailed probability: we can safely assume that, if there is a correlation (i.e., if null hypothesis is false) this correlation will be positive/negative -> numerator of p-value given by sum of all corr coeffs >= 0.5 (<= 0.5). Two-tailed probability: we do not know direction of correlation -> numerator of p-value given by sum of all corr coeffs >= |0.5|. [Incidentally, this example also shows why correlation when the variables have just few rank-levels does not make much sense...] This is example of exact/non-parametric test of significance: we considered all possible outcomes of experiment (in our case, computing the correlation coefficient between x and y) under the null hypothesis and we counted the proportion of outcomes that had values as large as 0.5; given that under the null hypothesis all outcomes have equal probability, this proportion is exact probability of obtaining a value as large as 0.5 under null hypothesis. Advantages of exact/non-parametric tests: - easy to understand; - (almost) assumption-free; - accurate for small samples as well. Main problem: computationally very expensive (typically, used only for small samples). [BTW: we don't really count all the possible outcomes and the corresponding correlation coefficients -- we can calculate these values using formulas from combinatorics] (More common) alternative: asymptotic/parametric tests. Asymptotic: for various distributions (including the count data distributions linguists often deal with) they hold for N tending to Inf. Parametric: they assume that our sample comes from population with a certain distribution (e.g., normal distribution) and use empirical data to estimate parameters of distribution. (Asymptotic does not imply parametric, nor vice versa, although there is a tendency for the two properties to co-occur.) All else being equal, parametric tests tend to have more "power", i.e., they are less likely to accept the null hypothesis if it is false. For standard correlation (aka, Pearson's correlation), we use an asymptotic/parametric method to compute p-values. Asymptotic method also used to compute Spearman's correlation for high Ns. Often, in estimation for significance testing, we find ourselves in a win-win situation: for low Ns, asymptotic methods would not be justified, but we can use exact tests, for high Ns, exact tests would be too inefficient, but asymptotic tests become justified. [A recent "third way" to significance tests: simulation (Monte Carlo methods, the bootstrap): imho, very promising for linguistics problems.] In R: > cor.test(x,y) > cor.test(x,y,method="spearman") > cor.test(redata$sa,redata$mi) > cor.test(redata$sa,redata$mi,method="spearman") (Notice how Spearman complains that p-value could be incorrect because of ties.) Reasons to be wary of significance testing: - null hypotheses are often silly (how interesting is it that the correlation between frequency lists from two corpora is not 0???); - thresholds for "interesting" results is arbitrary (psychology journals used to require p <= 0.05 to publish paper); - assumptions necessary to compute p-value are often unwarranted at best; - significance levels are sometimes an excuse for social/behavioral scientists not to understand statistics!