R BASICS

The R project (downloads, documentation, etc.):

http://www.r-project.org/

R as an "overgrown" calculator:

> 3+4

> avector <- c(1:10)
> avector

> anothervector <- avector*2
> anothervector

Many similarities with command line, e.g.:

> ls()
> rm(list=ls())

plus some form of tab completion (although what exactly gets completed
seems to change with the operating system) and arrows to navigate
history.

Notice, however, that command/function without brackets displays code
instead of running command; e.g., try:

> ls

Documentation:

> ?cor.test
> help.start()

In some cases, you just need a few data-points (e.g., the number of
verbs and sizes of sub-corpus A and sub-corpus B) and you can just
cut-and-paste the relevant quantities from another application, or
type them into the R window.

In many other cases, however, you want to import data from somewhere
else (e.g., lists of frequencies and other information generated with
cwb-scan-corpus and other tools).

There are various data import options (R can even handle data exported
from Excel), but most typically your data will be in tab- or
space-delimited columns, as in:

13318772        NN1
539787          DTQ
1027535         VBZ
9695780         PUN
427985          PUL

In order to import data into R, you have first of all to change your
working directory to the one where the data file resides. Under
Windows, you can select "Change dir..." from the File menu.

In order to read in the file, use a command like:

> mytable <- read.table("myfile",col.names=c("col1name","col2name",...))

For example:

> sp <- read.table("pos_dist.spoken.txt",col.names=c("f","p"))

(This file and the other files used in this practice session are
available in the shared directory on gollum.)

The Rs of different operating systems have different ways to help you
"finding" the data-set. For example, under Windows you can also do

> sp <- read.table(file.choose(),col.names=c("f","p"))

to choose the file through a "browse" window.

If your file already has a "header" (i.e., a first line with names for
the columns), the correct importing syntax is:

> mytable <- read.table("myfile",header=TRUE)

E.g.:

> adj <- read.table("adj.table.txt",header=TRUE)

Finally, if the input contains words (not only numbers and simple
labels) and it is tab-delimited, read.delim might be a better option:

> mytable <- read.delim("myfile",col.names=c("col1name","col2name"))

(The problem with words is that they might contain symbols, e.g., ',
that are extremely confusing for R...)

At this point, you should have imported a table with the distribution
of POS tags in the BNC spoken texts, and a table with data on the
collocates of the adjectives "strong", "powerful", "blue" and "green".

You can take a look at what's inside a table by typing its name
(although when the table is big this might not be so informative...)

If you want to play with it, you can import the written distribution
as well, e.g., in a table named wr.

Simple descriptive statistics:

> summary(sp$f)

> summary(wr$f)

(Notice use of dollar sign to refer to a variable inside a data table.)

Other simple functions:

> mean(sp$f)

> var(sp$f)

> sd(sp$f)

> max(sp$f)

> min(sp$f)

> sum(sp$f)

Looking at subsets of data:

> sp$f[1]

> sp$f[1:5]

> sp$f[sp$p=="NN1"]

> sp$p[sp$f>100000]

What does the following do?

> (sp$f[sp$p=="NN1"]/sum(sp$f))*100

Correlation between co-occurrence vectors:

> cor.test(adj$strong,adj$powerful)

Try the other combinations (variables are: strong, powerful, blue,
green). Do the results make sense?

Plotting:

> plot(adj$blue,adj$green)

Zooming in on low(er) frequency elements:

> plot(adj$blue,adj$green,xlim=range(0,50),ylim=range(0,50))

Alternatively, "squeeze" the high frequency elements by applying a
logarithmic transformation:

> plot(adj$blue,adj$green,log="xy")

You can save this and any other plot by making sure that the plot
window is the active one and using the "Save as" option in the File
menu (pdf is often a good output format choice).

Compare blue/green, powerful/strong, blue/strong, etc...

Just for fun, a perfect correlation:

> cor.text(adj$blue,(4*adj$blue)+3)
> plot(adj$blue,(4*adj$blue)+3)

and a lack of correlation among artificial variables:

> cor.test(rnorm(100),rnorm(100))
> plot(rnorm(100),rnorm(100))

(Try repeating the former commands multiple times...)

OK, let's go back to productivity, frequency spectra and vocabulary
growth plots.

I will compare nouns in the female and male portions of the BNC, you
can later try other data-sets (for smaller data-sets, consider also
possibility of cleaning frequency list(s) by hand before generating
spectra, and perhaps comparing "dirty" and "clean" spectra for same
phenomenon).

> fe <- read.table(file.choose(),header=TRUE)
> ma <- read.table(file.choose(),header=TRUE)

(With the former commands, I am reading in the files nn.f.fq.spc and
nn.m.fq.spc that are also available in the shared directory on
gollum.)

Frequency spectra:

> plot(f$m,f$Vm)

What are we doing in the following plots?

> plot(f$m,f$Vm,log="x")
> plot(f$m[1:15],f$Vm[1:15])

Zipf's (second) law:

> plot(f$m,f$Vm,log="xy")

Adding niceties:

> plot(f$m,f$Vm,log="x",xlab="m",ylab="Vm",main="Women's Noun Distribution")

Compare this to frequency spectrum of men.

How do you compute N from a frequency spectrum?

Once you know N, how do you compute P?

Now, we need to import Stefan Evert's functions to generate binomially
interpolated values for the vocabulary growth plots (you can download
the file wflsplit.R from the shared directory on gollum).

First, read wflsplit.R into your session by using the "Source R code"
option from the file menu.

We can now calculate binomially interpolated V and V1 at arbitrary
sample sizes (up to the overall N of the analyzed process).

In order to do this, we have to create a vector of sample sizes.

In the case of the women, the overall N is:

> sum(fe$m*fe$Vm)
[1] 2683614

so, let's say that we will compute 1000 sub-Ns at intervals of 2680
tokens each:

> f.sample.sizes<-c((1:1000)*2680)

For men, the overall N is:

> sum(ma$m*ma$Vm)
[1] 6704193

So, in order to sample at similar intervals to the ones used for
women, we can select 2500 sample sizes at intervals of 2680 tokens:

> m.sample.sizes<-c((1:2500)*2680)
 
Now, we use the function binomint to obtain vectors with the
"expected" values of V and V1 at the relevant sample sizes:

> feV<-binomint.EV(fe,f.sample.sizes)
> feV1<-binomint.EVm(fe,1,f.sample.sizes)

> maV<-binomint.EV(ma,m.sample.sizes)
> maV1<-binomint.EVm(ma,1,m.sample.sizes)

Plotting (if you are going to overlay lines, start with the longest
vector!):

> plot(m.sample.sizes,maV,type="l")
> lines(f.sample.sizes,feV,lty=2)
> lines(m.sample.sizes,maV1,lty=3)
> lines(f.sample.sizes,feV1,lty=4)

(If instead of lines you wanted to overlay points, can you guess the
name of the command you should have used?)

Just the VGCs, with a title and x/y labels:

> plot(m.sample.sizes,maV,type="l",xlab="N",ylab="V and V1", 
main="VGCs of nouns for men and women")
> lines(f.sample.sizes,feV,lty=2)

For other data, you might have to play around with parameters such as
xlim, ylim, log...

Compute P for men at the largest N we have available:

> maV1[2500]/m.sample.sizes[2500]
[1] 0.007949183

Same for women:

> feV1[1000]/f.sample.sizes[1000]
[1] 0.009422495

P for men at the largest N of women:

> maV1[1000]/m.sample.sizes[1000]
[1] 0.01229008

P in function of N:

> plot(m.sample.sizes,maV1/m.sample.sizes,type="l")
> plot(m.sample.sizes,maV1/m.sample.sizes,log="y",type="l")