R BASICS: FORMATS, PLOTS AND ZIPF'S LAW Basically, R likes data in tab-delimited columns with a header (but other formats, e.g. comma-delimited text from Excel, should also work): r f 1 62642 2 35971 3 27831 4 25608 5 21883 ... ... On the command line, start R from the directory where the data live; through GUI, move to that directory via menu commands). Many similarities with command line, e.g.: > ls() > rm(list=ls()) plus some form of tab completion (at least for names of files in current directory), and arrows to navigate history. However, command/function without brackets displays code instead of running command; e.g., try: > ls Documentation: > help(cor) > ?cor > apropos("cor") > help.start() The most common way to read data in (tab completion should work when typing file name): > rf <- read.table("brown.rf",header=TRUE) Simply typing an "object" shows its contents: > rf (Not necessarily a good idea ;-) ) Accessing data points inside the objects: > rf$f[1:10] Descriptive statistics: > summary(rf) r f Min. : 1 Min. : 1.00 1st Qu.:13270 1st Qu.: 1.00 Median :26538 Median : 2.00 Mean :26538 Mean : 18.78 3rd Qu.:39807 3rd Qu.: 5.00 Max. :53076 Max. :62642.00 How can you obtain descriptive statistics for f only? Measures of variation (not particularly interesting here): > var(rf$f) [1] 166116.6 > sd(rf$f) [1] 407.5741 Bare-bone plotting: > plot(rf$r,rf$f) How can we plot the first 100 data points only? A log transform: > plot(rf$r,log(rf$f)) Let's do something cuter: > plot(x=rf$r,y=log(rf$f),xlab="rank",ylab="log fq", main="Rank/Frequency Profile") Two log transforms (the classic Zipf plot): > plot(log(rf$r),log(rf$f)) This looks like a straight line, kind of... A simple linear model: log(fq) = a + b(log(rank)) a: intercept (y value where line crosses y-axis) b: slope (how steep the line is) a and b values can be found analytically with least squares method. In R: > lm(log(rf$f)~log(rf$r)) Call: lm(formula = log(rf$f) ~ log(rf$r)) Coefficients: (Intercept) log(rf$r) 13.667 -1.281 > plot(log(rf$r),log(rf$f)) > lines(x=log(rf$r),y=13.667-1.281*log(rf$r)) What am I doing with the following steps? > lm(log(rf$f[1:100])~log(rf$r[1:100])) Call: lm(formula = log(rf$f[1:100]) ~ log(rf$r[1:100])) Coefficients: (Intercept) log(rf$r[1:100]) 11.3482 -0.9522 > lines(x=log(rf$r),y=11.3482-0.9522*log(rf$r),lty=2) Feeding the regression line directly into the plot: > plot(log(rf$r),log(rf$f)) > abline(lm(log(rf$f[1:100])~log(rf$r[1:100]))) Doing things nicely: > plot(x=log(rf$r),y=log(rf$f),xlab="log rank",ylab="log fq",main="Log Rank/Log Frequency with Zipf Curves") > lines(x=log(rf$r),y=13.667-1.281*log(rf$r)) > lines(x=log(rf$r),y=11.3482-0.9522*log(rf$r),lty=2) Saving to a file: > png("zipfplot.png") > plot(x=log(rf$r),y=log(rf$f),xlab="log rank",ylab="log fq",main="Log Rank/Log Frequency with Zipf Curves") > lines(x=log(rf$r),y=13.667-1.281*log(rf$r)) > lines(x=log(rf$r),y=11.3482-0.9522*log(rf$r),lty=2) > dev.off() You don't know if you did it right until you are done: always try first in interactive mode! Eventually, you should pick up a bit of par magic, e.g.: > par(mfrow=c(3,2), mar=c(4,4,2,2)+0.1, mex=0.8) (See the Dalgaard for an explanation.) Here, we used lm() for plotting, but main function of this function is to compute (simple and multiple) linear regression models. Multiple regression: y = ax_1 + bx_2 + cx_3 + ... Try: > summary(lm(log(rf$f)~log(rf$r))) Call: lm(formula = log(rf$f) ~ log(rf$r)) Residuals: Min 1Q Median 3Q Max -2.62142 -0.11787 0.06052 0.16528 0.27466 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 13.6666096 0.0090022 1518 <2e-16 *** log(rf$r) -1.2814279 0.0009066 -1413 <2e-16 *** --- Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1 Residual standard error: 0.2087 on 53074 degrees of freedom Multiple R-Squared: 0.9741, Adjusted R-squared: 0.9741 F-statistic: 1.998e+06 on 1 and 53074 DF, p-value: < 2.2e-16 (See comments in Dalgaard.) Consider multiple regression if you have a continuous dependent variable (or a variable for which it is reasonable to pretend that it is continuous) and you want to study the combined effect on it of several independent variables (that can be both continuous and dichotomous). Logistic regression is a related technique that is more appropriate if the dependent variable is binary.