R BASICS: FORMATS, PLOTS AND ZIPF'S LAW

Basically, R likes data in tab-delimited columns with a header (but
other formats, e.g. comma-delimited text from Excel, should also
work):

r       f
1       62642
2       35971
3       27831
4       25608
5       21883
...	...

On the command line, start R from the directory where the data live;
through GUI, move to that directory via menu commands).

Many similarities with command line, e.g.:

> ls()
> rm(list=ls())

plus some form of tab completion (at least for names of files in
current directory), and arrows to navigate history.

However, command/function without brackets displays code instead of
running command; e.g., try:

> ls

Documentation:

> help(cor)
> ?cor

> apropos("cor")

> help.start()

The most common way to read data in (tab completion should work when
typing file name):

> rf <- read.table("brown.rf",header=TRUE)

Simply typing an "object" shows its contents:

> rf

(Not necessarily a good idea ;-) )

Accessing data points inside the objects:

> rf$f[1:10]

Descriptive statistics:

> summary(rf)
       r               f           
 Min.   :    1   Min.   :    1.00  
 1st Qu.:13270   1st Qu.:    1.00  
 Median :26538   Median :    2.00  
 Mean   :26538   Mean   :   18.78  
 3rd Qu.:39807   3rd Qu.:    5.00  
 Max.   :53076   Max.   :62642.00  

How can you obtain descriptive statistics for f only?

Measures of variation (not particularly interesting here):

> var(rf$f)
[1] 166116.6
> sd(rf$f)
[1] 407.5741

Bare-bone plotting:

> plot(rf$r,rf$f)

How can we plot the first 100 data points only?

A log transform:

> plot(rf$r,log(rf$f))

Let's do something cuter:

> plot(x=rf$r,y=log(rf$f),xlab="rank",ylab="log fq",
  main="Rank/Frequency Profile")

Two log transforms (the classic Zipf plot):

> plot(log(rf$r),log(rf$f))

This looks like a straight line, kind of...

A simple linear model:

log(fq) = a + b(log(rank))

a: intercept (y value where line crosses y-axis)
b: slope (how steep the line is)

a and b values can be found analytically with least squares method.

In R:

> lm(log(rf$f)~log(rf$r))

Call:
lm(formula = log(rf$f) ~ log(rf$r))

Coefficients:
(Intercept)    log(rf$r)  
     13.667       -1.281  

> plot(log(rf$r),log(rf$f))
> lines(x=log(rf$r),y=13.667-1.281*log(rf$r))

What am I doing with the following steps?

> lm(log(rf$f[1:100])~log(rf$r[1:100])) 

Call:
lm(formula = log(rf$f[1:100]) ~ log(rf$r[1:100]))

Coefficients:
     (Intercept)  log(rf$r[1:100])  
         11.3482           -0.9522  

> lines(x=log(rf$r),y=11.3482-0.9522*log(rf$r),lty=2)

Feeding the regression line directly into the plot:

> plot(log(rf$r),log(rf$f))
> abline(lm(log(rf$f[1:100])~log(rf$r[1:100])))

Doing things nicely:

> plot(x=log(rf$r),y=log(rf$f),xlab="log rank",ylab="log fq",main="Log Rank/Log Frequency with Zipf Curves")
> lines(x=log(rf$r),y=13.667-1.281*log(rf$r))
> lines(x=log(rf$r),y=11.3482-0.9522*log(rf$r),lty=2)

Saving to a file:

> png("zipfplot.png")
> plot(x=log(rf$r),y=log(rf$f),xlab="log rank",ylab="log fq",main="Log Rank/Log Frequency with Zipf Curves")
> lines(x=log(rf$r),y=13.667-1.281*log(rf$r))
> lines(x=log(rf$r),y=11.3482-0.9522*log(rf$r),lty=2)
> dev.off()

You don't know if you did it right until you are done: always try
first in interactive mode!

Eventually, you should pick up a bit of par magic, e.g.:

> par(mfrow=c(3,2), mar=c(4,4,2,2)+0.1, mex=0.8)

(See the Dalgaard for an explanation.)

Here, we used lm() for plotting, but main function of this function is
to compute (simple and multiple) linear regression models.

Multiple regression:

y = ax_1 + bx_2 + cx_3 + ...

Try:

> summary(lm(log(rf$f)~log(rf$r)))

Call:
lm(formula = log(rf$f) ~ log(rf$r))

Residuals:
     Min       1Q   Median       3Q      Max 
-2.62142 -0.11787  0.06052  0.16528  0.27466 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) 13.6666096  0.0090022    1518   <2e-16 ***
log(rf$r)   -1.2814279  0.0009066   -1413   <2e-16 ***
---
Signif. codes:  0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1 

Residual standard error: 0.2087 on 53074 degrees of freedom
Multiple R-Squared: 0.9741,     Adjusted R-squared: 0.9741 
F-statistic: 1.998e+06 on 1 and 53074 DF,  p-value: < 2.2e-16 

(See comments in Dalgaard.)

Consider multiple regression if you have a continuous dependent
variable (or a variable for which it is reasonable to pretend that it
is continuous) and you want to study the combined effect on it of
several independent variables (that can be both continuous and
dichotomous).

Logistic regression is a related technique that is more appropriate if
the dependent variable is binary.