R BASICS The R project (downloads, documentation, etc.): http://www.r-project.org/ R as an "overgrown" calculator: > 3+4 > avector <- c(1:10) > avector > anothervector <- avector*2 > anothervector Many similarities with command line, e.g.: > ls() > rm(list=ls()) plus some form of tab completion (although what exactly gets completed seems to change with the operating system) and arrows to navigate history. Notice, however, that command/function without brackets displays code instead of running command; e.g., try: > ls Documentation: > ?cor.test > help.start() In some cases, you just need a few data-points (e.g., the number of verbs and sizes of sub-corpus A and sub-corpus B) and you can just cut-and-paste the relevant quantities from another application, or type them into the R window. In many other cases, however, you want to import data from somewhere else (e.g., lists of frequencies and other information generated with cwb-scan-corpus and other tools). There are various data import options (R can even handle data exported from Excel), but most typically your data will be in tab- or space-delimited columns, as in: 13318772 NN1 539787 DTQ 1027535 VBZ 9695780 PUN 427985 PUL In order to import data into R, you have first of all to change your working directory to the one where the data file resides. Under Windows, you can select "Change dir..." from the File menu. In order to read in the file, use a command like: > mytable <- read.table("myfile",col.names=c("col1name","col2name",...)) For example: > sp <- read.table("pos_dist.spoken.txt",col.names=c("f","p")) (This file and the other files used in this practice session are available in the shared directory on gollum.) The Rs of different operating systems have different ways to help you "finding" the data-set. For example, under Windows you can also do > sp <- read.table(file.choose(),col.names=c("f","p")) to choose the file through a "browse" window. If your file already has a "header" (i.e., a first line with names for the columns), the correct importing syntax is: > mytable <- read.table("myfile",header=TRUE) E.g.: > adj <- read.table("adj.table.txt",header=TRUE) Finally, if the input contains words (not only numbers and simple labels) and it is tab-delimited, read.delim might be a better option: > mytable <- read.delim("myfile",col.names=c("col1name","col2name")) (The problem with words is that they might contain symbols, e.g., ', that are extremely confusing for R...) At this point, you should have imported a table with the distribution of POS tags in the BNC spoken texts, and a table with data on the collocates of the adjectives "strong", "powerful", "blue" and "green". You can take a look at what's inside a table by typing its name (although when the table is big this might not be so informative...) If you want to play with it, you can import the written distribution as well, e.g., in a table named wr. Simple descriptive statistics: > summary(sp$f) > summary(wr$f) (Notice use of dollar sign to refer to a variable inside a data table.) Other simple functions: > mean(sp$f) > var(sp$f) > sd(sp$f) > max(sp$f) > min(sp$f) > sum(sp$f) Looking at subsets of data: > sp$f[1] > sp$f[1:5] > sp$f[sp$p=="NN1"] > sp$p[sp$f>100000] What does the following do? > (sp$f[sp$p=="NN1"]/sum(sp$f))*100 Correlation between co-occurrence vectors: > cor.test(adj$strong,adj$powerful) Try the other combinations (variables are: strong, powerful, blue, green). Do the results make sense? Plotting: > plot(adj$blue,adj$green) Zooming in on low(er) frequency elements: > plot(adj$blue,adj$green,xlim=range(0,50),ylim=range(0,50)) Alternatively, "squeeze" the high frequency elements by applying a logarithmic transformation: > plot(adj$blue,adj$green,log="xy") You can save this and any other plot by making sure that the plot window is the active one and using the "Save as" option in the File menu (pdf is often a good output format choice). Compare blue/green, powerful/strong, blue/strong, etc... Just for fun, a perfect correlation: > cor.text(adj$blue,(4*adj$blue)+3) > plot(adj$blue,(4*adj$blue)+3) and a lack of correlation among artificial variables: > cor.test(rnorm(100),rnorm(100)) > plot(rnorm(100),rnorm(100)) (Try repeating the former commands multiple times...) OK, let's go back to productivity, frequency spectra and vocabulary growth plots. I will compare nouns in the female and male portions of the BNC, you can later try other data-sets (for smaller data-sets, consider also possibility of cleaning frequency list(s) by hand before generating spectra, and perhaps comparing "dirty" and "clean" spectra for same phenomenon). > fe <- read.table(file.choose(),header=TRUE) > ma <- read.table(file.choose(),header=TRUE) (With the former commands, I am reading in the files nn.f.fq.spc and nn.m.fq.spc that are also available in the shared directory on gollum.) Frequency spectra: > plot(f$m,f$Vm) What are we doing in the following plots? > plot(f$m,f$Vm,log="x") > plot(f$m[1:15],f$Vm[1:15]) Zipf's (second) law: > plot(f$m,f$Vm,log="xy") Adding niceties: > plot(f$m,f$Vm,log="x",xlab="m",ylab="Vm",main="Women's Noun Distribution") Compare this to frequency spectrum of men. How do you compute N from a frequency spectrum? Once you know N, how do you compute P? Now, we need to import Stefan Evert's functions to generate binomially interpolated values for the vocabulary growth plots (you can download the file wflsplit.R from the shared directory on gollum). First, read wflsplit.R into your session by using the "Source R code" option from the file menu. We can now calculate binomially interpolated V and V1 at arbitrary sample sizes (up to the overall N of the analyzed process). In order to do this, we have to create a vector of sample sizes. In the case of the women, the overall N is: > sum(fe$m*fe$Vm) [1] 2683614 so, let's say that we will compute 1000 sub-Ns at intervals of 2680 tokens each: > f.sample.sizes<-c((1:1000)*2680) For men, the overall N is: > sum(ma$m*ma$Vm) [1] 6704193 So, in order to sample at similar intervals to the ones used for women, we can select 2500 sample sizes at intervals of 2680 tokens: > m.sample.sizes<-c((1:2500)*2680) Now, we use the function binomint to obtain vectors with the "expected" values of V and V1 at the relevant sample sizes: > feV<-binomint.EV(fe,f.sample.sizes) > feV1<-binomint.EVm(fe,1,f.sample.sizes) > maV<-binomint.EV(ma,m.sample.sizes) > maV1<-binomint.EVm(ma,1,m.sample.sizes) Plotting (if you are going to overlay lines, start with the longest vector!): > plot(m.sample.sizes,maV,type="l") > lines(f.sample.sizes,feV,lty=2) > lines(m.sample.sizes,maV1,lty=3) > lines(f.sample.sizes,feV1,lty=4) (If instead of lines you wanted to overlay points, can you guess the name of the command you should have used?) Just the VGCs, with a title and x/y labels: > plot(m.sample.sizes,maV,type="l",xlab="N",ylab="V and V1", main="VGCs of nouns for men and women") > lines(f.sample.sizes,feV,lty=2) For other data, you might have to play around with parameters such as xlim, ylim, log... Compute P for men at the largest N we have available: > maV1[2500]/m.sample.sizes[2500] [1] 0.007949183 Same for women: > feV1[1000]/f.sample.sizes[1000] [1] 0.009422495 P for men at the largest N of women: > maV1[1000]/m.sample.sizes[1000] [1] 0.01229008 P in function of N: > plot(m.sample.sizes,maV1/m.sample.sizes,type="l") > plot(m.sample.sizes,maV1/m.sample.sizes,log="y",type="l")