********************* * * * A Quick CQP Guide * * * ********************* The CWB web-page: http://www.ims.uni-stuttgart.de/projekte/CorpusWorkbench/ Soon to be released on SourceForge: http://cwb.sourceforge.net/ Stefan Evert's CQP tutorial: http://www.ims.uni-stuttgart.de/projekte/CorpusWorkbench/CQPTutorial/html http://www.ims.uni-stuttgart.de/projekte/CorpusWorkbench/ CQPTutorial/cqp-tutorial.pdf Some Web-based interfaces: Serge Sharoff's Internet Corpora: http://corpus.leeds.ac.uk/internet.html CucWeb: http://ramsesii.upf.es/cgi-bin/cucweb/search-form.pl?lang=en_US SSLMITDev: http://sslmitdev-online.sslmit.unibo.it/corpora/corpora.php Invoke cqp (press enter/return key after this and all other commands): cqp -e cqp -eC Exit from cqp: exit (If you just did that, please enter again!) show corpora; While in cqp, keep in mind that some things work like on the Unix terminal -- in particular, you can recall previous commands with the upwards-pointing arrow, and you navigate the kwic results with more/less-like syntax (space to move to next page, q to quit, etc.) Select corpus (remember the semi-colon at the end of each command), e.g.: BNCV4; PSYCHIATRY-EN; WEB-EN; etc. A quick way to know how many tokens there are in a corpus: info; Simple kwic: "food"; "food" %c; "good" "food"; If you have problems seeing accented characters (as in vowels with umlaut in German or with accents in Italian and Spanish), try: set Pager more; You can move through the kwic results like in a standard Unix pager: space to see next page, b to go back one page, q to exit kwic display. Whenever q does not work, use ctrl+C to interrupt any command. To see the frequency of occurrence of your last query: size Last; Order results by left and right context (funny syntax because these are "macros" written by Stefan Evert): "food"; /sort_left[]; /sort_right[]; If you have too many results, it is a good idea to take a look at a random sample... First, "save" query into a variable: A = "often"; Then, "reduce" A to the desired number of randomly selected contexts, e.g.: reduce A to 20; Finally, take a look at these contexts: cat A; Change context size: set Context 60; set Context 5 words; set Context s; set Context 3 s; set Context default; Other visualization options: show +pos; show +lemma; show -pos -lemma; show -cpos; set PrintStructures text_domain; set PrintStructures ""; Doing queries using morphosyntactic annotation (if you've been experimenting with show and set, now it's a good moment to go back to a normal-looking kwic-display): [word = "obsessive"] [pos = "NN.*"]; [word = "obsessive" %c] [pos = "NN.*"]; [word = "cause"]; [lemma = "cause"]; [lemma = "cause" & pos = "VV.*"]; The BNC tagset: http://sslmit.unibo.it/~baroni/collocazioni/bnctagset.txt The WEB-EN (i.e., standard TreeTagger) tagset: http://sslmit.unibo.it/~baroni/collocazioni/english.tt.tagset [pos = "AJ.*"] [pos = "NN.*"]; For a query like the latter, often it is more meaningful to look at frequency lists: [pos = "AJ.*"] [pos = "NN.*"]; count by word %c; Or, more cleanly: A = [pos = "AJ.*"] [pos = "NN.*"]; count A by word %c; (What happens if you counting by lemma?) A frequency list for a collocate extracted from a "flexible" context: [lemma = "cause" & pos = "VV.*"][pos="AT0"]?[pos="AJ.*"]*[pos="NN.*"]; count by lemma on matchend; If you want to avoid compound modifiers (no need to try this now): [lemma = "cause" & pos = "VV.*"][pos="AT0"]?[pos="AJ.*"]*[pos="NN.*"] [pos!="NN.*"]; count by lemma on matchend[-1]; Words ending in -dom, and their frequency: A = [word = ".*dom" & pos = "NN.*"]; count A by lemma; You can also save the results to an output file: cat > "myconc.txt"; count by word %c > "myfqlist.txt"; Practice: look at the nominal collocates of "strong" and "powerful"; save them to two separate files, and compare. Spans and structural constraints: "as" []{1,3} "as" within s; Some of the BNC structural attributes, with their values: text_domain: S_Demog_AB, S_Demog_C1, S_Demog_C2, S_Demog_DE, S_Demog_Unclassified, S_cg_business, S_cg_education, S_cg_leisure, S_cg_public_instit, W_app_science, W_arts, W_belief_thought, W_commerce, W_imaginative, W_leisure, W_nat_science, W_soc_science, W_world_affairs text_mode: S, W text_author_sex: ---, Female, Male, Mixed, Unknown text_interaction_type: ---, Dialogue, Monologue The word "opportunist" used by women and men: [lemma="opportunist"] :: match.text_author_sex="Female"; [lemma="opportunist"] :: match.text_author_sex="Male"; Practice: find the favorite nouns (lemmas) of men and women.