COLLECTING FREQUENCY LISTS

There are various ways to collect frequency lists using the cqp tool,
such as the "count" function we have already used (see the cqp
tutorial for more options).

However, the most efficient way to extract frequency data from a
CWB-encoded corpus is via cwb-scan-corpus, which is a command line
program distinct from cqp (recall that CWB is a toolkit).

As with any command line utility, you can use tab completion when
typing the name of this command. Thus, if you type cwb-sc (or any
longer substring) and then press the tab key, the terminal will
complete the command name for you.

Remember that you can always use the arrow pointing upwards to recall
a previous command.

As a simple example, you can collect unigram lemma frequencies as
follows:

cwb-scan-corpus -o unigram.fq.txt BNCV4 lemma+0

Notice that output is not ordered by frequency. You can manage the
output data with your favourite program (e.g., a spreadsheet tool), or
directly on the command line.

E.g., to sort by decreasing frequency and view:

sort -nrk1 unigram.fq.txt | more

To save to file:

sort -nrk1 unigram.fq.txt > unigram.fq.sorted.txt

Bigram wordforms:

cwb-scan-corpus -o bigram.fq.txt BNCV4 word+0 word+1

Bigram adjective-noun lemma sequences:

cwb-scan-corpus -o adj_noun.fq.txt BNCV4 ?pos+0=/AJ.*/ lemma+0 
?pos+1=/NN.*/ lemma+1

Frequency of parts of speech:

cwb-scan-corpus -o pos_dist.txt BNCV4 pos+0

Frequency of parts of speech in spoken and written English:

cwb-scan-corpus -o pos_dist.written.txt BNCV4 pos+0 ?text_mode=/W/

cwb-scan-corpus -o pos_dist.spoken.txt BNCV4 pos+0 ?text_mode=/S/

Nominal lemmas ending in -ment:

cwb-scan-corpus -o ment.fq.txt BNCV4 lemma+0=/.*ment/ ?pos+0=/NN.*/

Nominal lemmas ending in -ment, with at least two vowels occurring
before ment:

cwb-scan-corpus -o ment.fq.txt BNCV4 lemma+0=/.*[aeiou].*[aeiou].*ment/
?pos+0=/NN.*/

The same, but restricting the query to women's texts:

cwb-scan-corpus -o ment.women.fq.txt BNCV4 lemma+0=/.*[aeiou].*[aeiou].*ment/
?pos+0=/NN.*/ ?text_author_sex=/Female/

Looking for candidate N+N compounds:

cwb-scan-corpus -o nn.txt BNCV4 ?pos+0=/[^N].*/ ?pos+1=/NN.*/ lemma+1 
?pos+2=/NN.*/ lemma+2 ?pos+3=/[^N].*/

Practice: find the most common adverb-adjective sequences in the texts
by women and men.

Finally, for our purposes we use another simple tool to create a
frequency spectrum from a frequency list:

build_frequency_spectrum.pl ment.fq.txt > ment.spc.txt