COLLECTING FREQUENCY LISTS There are various ways to collect frequency lists using the cqp tool, such as the "count" function we have already used (see the cqp tutorial for more options). However, the most efficient way to extract frequency data from a CWB-encoded corpus is via cwb-scan-corpus, which is a command line program distinct from cqp (recall that CWB is a toolkit). As with any command line utility, you can use tab completion when typing the name of this command. Thus, if you type cwb-sc (or any longer substring) and then press the tab key, the terminal will complete the command name for you. Remember that you can always use the arrow pointing upwards to recall a previous command. As a simple example, you can collect unigram lemma frequencies as follows: cwb-scan-corpus -o unigram.fq.txt BNCV4 lemma+0 Notice that output is not ordered by frequency. You can manage the output data with your favourite program (e.g., a spreadsheet tool), or directly on the command line. E.g., to sort by decreasing frequency and view: sort -nrk1 unigram.fq.txt | more To save to file: sort -nrk1 unigram.fq.txt > unigram.fq.sorted.txt Bigram wordforms: cwb-scan-corpus -o bigram.fq.txt BNCV4 word+0 word+1 Bigram adjective-noun lemma sequences: cwb-scan-corpus -o adj_noun.fq.txt BNCV4 ?pos+0=/AJ.*/ lemma+0 ?pos+1=/NN.*/ lemma+1 Frequency of parts of speech: cwb-scan-corpus -o pos_dist.txt BNCV4 pos+0 Frequency of parts of speech in spoken and written English: cwb-scan-corpus -o pos_dist.written.txt BNCV4 pos+0 ?text_mode=/W/ cwb-scan-corpus -o pos_dist.spoken.txt BNCV4 pos+0 ?text_mode=/S/ Nominal lemmas ending in -ment: cwb-scan-corpus -o ment.fq.txt BNCV4 lemma+0=/.*ment/ ?pos+0=/NN.*/ Nominal lemmas ending in -ment, with at least two vowels occurring before ment: cwb-scan-corpus -o ment.fq.txt BNCV4 lemma+0=/.*[aeiou].*[aeiou].*ment/ ?pos+0=/NN.*/ The same, but restricting the query to women's texts: cwb-scan-corpus -o ment.women.fq.txt BNCV4 lemma+0=/.*[aeiou].*[aeiou].*ment/ ?pos+0=/NN.*/ ?text_author_sex=/Female/ Looking for candidate N+N compounds: cwb-scan-corpus -o nn.txt BNCV4 ?pos+0=/[^N].*/ ?pos+1=/NN.*/ lemma+1 ?pos+2=/NN.*/ lemma+2 ?pos+3=/[^N].*/ Practice: find the most common adverb-adjective sequences in the texts by women and men. Finally, for our purposes we use another simple tool to create a frequency spectrum from a frequency list: build_frequency_spectrum.pl ment.fq.txt > ment.spc.txt