*********************
                            *                   *
                            * A Quick CQP Guide *
                            *                   *
                            *********************

The CWB web-page:

http://www.ims.uni-stuttgart.de/projekte/CorpusWorkbench/

Soon to be released on  SourceForge:

http://cwb.sourceforge.net/


Stefan Evert's CQP tutorial:

http://www.ims.uni-stuttgart.de/projekte/CorpusWorkbench/CQPTutorial/html
http://www.ims.uni-stuttgart.de/projekte/CorpusWorkbench/
                                             CQPTutorial/cqp-tutorial.pdf


Some Web-based interfaces:

Serge Sharoff's Internet Corpora:
http://corpus.leeds.ac.uk/internet.html
CucWeb:
http://ramsesii.upf.es/cgi-bin/cucweb/search-form.pl?lang=en_US
SSLMITDev:
http://sslmitdev-online.sslmit.unibo.it/corpora/corpora.php


Invoke cqp (press enter/return key after this and all other commands):

cqp -e
cqp -eC

Exit from cqp:

exit

(If you just did that, please enter again!)

show corpora;

While in cqp, keep in mind that some things work like on the Unix
terminal -- in particular, you can recall previous commands with the
upwards-pointing arrow, and you navigate the kwic results with
more/less-like syntax (space to move to next page, q to quit, etc.)

Select corpus (remember the semi-colon at the end of each command),
e.g.:

BNCV4;
PSYCHIATRY-EN;
WEB-EN;

etc.

A quick way to know how many tokens there are in a corpus:

info;

Simple kwic:

"food";
"food" %c;
"good" "food";

If you have problems seeing accented characters (as in vowels with
umlaut in German or with accents in Italian and Spanish), try:

set Pager more;

You can move through the kwic results like in a standard Unix pager:
space to see next page, b to go back one page, q to exit kwic display.

Whenever q does not work, use ctrl+C to interrupt any command.

To see the frequency of occurrence of your last query:

size Last;

Order results by left and right context (funny syntax because these
are "macros" written by Stefan Evert):

"food";

/sort_left[];
/sort_right[];

If you have too many results, it is a good idea to take a look at a
random sample...

First, "save" query into a variable:

A = "often";

Then, "reduce" A to the desired number of randomly selected contexts, e.g.:

reduce A to 20;  

Finally, take a look at these contexts:

cat A;    

Change context size:

set Context 60;
set Context 5 words;
set Context s;
set Context 3 s;
set Context default;

Other visualization options:

show +pos;
show +lemma;
show -pos -lemma;
show -cpos;
set PrintStructures text_domain;
set PrintStructures ""; 

Doing queries using morphosyntactic annotation (if you've been
experimenting with show and set, now it's a good moment to go back to
a normal-looking kwic-display):

[word = "obsessive"] [pos = "NN.*"]; 
[word = "obsessive" %c] [pos = "NN.*"];

[word = "cause"];
[lemma = "cause"];
[lemma = "cause" & pos = "VV.*"];

The BNC tagset:
http://sslmit.unibo.it/~baroni/collocazioni/bnctagset.txt

The WEB-EN (i.e., standard TreeTagger) tagset:
http://sslmit.unibo.it/~baroni/collocazioni/english.tt.tagset

[pos = "AJ.*"] [pos = "NN.*"];

For a query like the latter, often it is more meaningful to look at
frequency lists:

[pos = "AJ.*"] [pos = "NN.*"];
count by word %c; 

Or, more cleanly:

A = [pos = "AJ.*"] [pos = "NN.*"];
count A by word %c; 

(What happens if you counting by lemma?)

A frequency list for a collocate extracted from a "flexible" context:

[lemma = "cause" & pos = "VV.*"][pos="AT0"]?[pos="AJ.*"]*[pos="NN.*"];
count by lemma on matchend;

If you want to avoid compound modifiers (no need to try this now):

[lemma = "cause" & pos = "VV.*"][pos="AT0"]?[pos="AJ.*"]*[pos="NN.*"]
[pos!="NN.*"];
count by lemma on matchend[-1]; 

Words ending in -dom, and their frequency:

A = [word = ".*dom" & pos = "NN.*"];
count A by lemma;

You can also save the results to an output file:

cat > "myconc.txt"; 
count by word %c > "myfqlist.txt"; 

Practice: look at the nominal collocates of "strong" and "powerful";
save them to two separate files, and compare.

Spans and structural constraints:

"as" []{1,3} "as" within s;

Some of the BNC structural attributes, with their values:

text_domain: S_Demog_AB, S_Demog_C1, S_Demog_C2, S_Demog_DE,
S_Demog_Unclassified, S_cg_business, S_cg_education, S_cg_leisure,
S_cg_public_instit, W_app_science, W_arts, W_belief_thought,
W_commerce, W_imaginative, W_leisure, W_nat_science, W_soc_science,
W_world_affairs

text_mode: S, W

text_author_sex: ---, Female, Male, Mixed, Unknown

text_interaction_type: ---, Dialogue, Monologue

The word "opportunist" used by women and men:

[lemma="opportunist"] :: match.text_author_sex="Female";
[lemma="opportunist"] :: match.text_author_sex="Male";

Practice: find the favorite nouns (lemmas) of men and women.