********************* * * * A Quick CQP Guide * * * ********************* The CWB web-page (soon to be on SourceForge!): http://www.ims.uni-stuttgart.de/projekte/CorpusWorkbench/ Stefan Evert's CQP tutorial: http://www.ims.uni-stuttgart.de/projekte/CorpusWorkbench/CQPTutorial/html http://www.ims.uni-stuttgart.de/projekte/CorpusWorkbench/ CQPTutorial/cqp-tutorial.pdf Invoke cqp: cqp -e cqp -eC Exit from cqp: exit (If you just did that, please enter again!) show corpora; While in cqp, keep in mind that some things work like on the Unix terminal -- in particular, you can recall previous commands with the upwards-pointing arrow, and you navigate the kwic results with more/less-like syntax (space to move to next page, q to quit, etc.) Select corpus (remember the semi-colon at the end of each command), e.g.: PSYCHIATRY-EN; WEB-EN; etc. A quick way to know how many tokens are in a corpus: info; Simple kwic: "compulsive"; "compulsive" %c; "obsessive" "compulsive" "disorder"; If you have problems seeing accented characters (as in vowels with umlaut in German or with accents in Italian), try: set Pager more; To see the frequency of occurrence of your last query: size Last; Order results by left and right context (funny syntax because these are "macros" written by Stefan Evert): "compulsive"; /sort_left[]; /sort_right[]; If you have too many results, it is a good idea to take a look at a random sample... First, "save" query into a variable: A = "often"; Then, "reduce" A to the desired number of randomly selected contexts, e.g.: reduce A to 20; Finally, take a look at these contexts: cat A; Change context size: set Context 60; set Context 5 words; set Context s; set Context 3 s; set Context default; Other visualization options: show + pos; show + lem; show -pos -lem; show -cpos; set PrintStructures text_id; set PrintStructures ""; Doing queries using morphosyntactic annotation (if you've been experimenting with show and set, now it's a good moment to go back to a normal-looking kwic-display...): [word = "obsessive"] [pos = "NN.*"]; [word = "obsessive" %c] [pos = "NN.*"]; [word = "cause"]; [lem = "cause"]; [lem = "cause" & pos = "V.*"]; [pos = "JJ"] [pos = "NN.*"]; For a query like the latter, often it is more meaningful to look at frequency lists: [pos = "JJ"] [pos = "NN.*"]; count by word %c; A frequency list for a collocate extracted from a "flexible" context: [lem = "cause" & pos = "V.*"][pos="DT"]?[pos="JJ"]*[pos="NN.*"]; count by lem %c on matchend; You can also save the results to an output file: cat > "myconc.txt"; count by word %c > "myfqlist.txt"; Rather advanced, but very useful: construct a frequency list of collocations from a "flexible" context, e.g., all noun/verb pairs with optionally one article/determiner and zero or more adjectives in the middle (not strictly necessary to save query to variable A, but handy since we don't care about seeing the ad interim kwics): A = [pos = "VV.*"][pos="DT"]?[pos="JJ"]*[pos="NN.*"]; tabulate A match lem, matchend lem > "pairs.txt"; Now, external file pairs.txt contains all tab-delimited pairs of shape V-N extracted from previous query, without the elements in the middle (so, both "meeting deadlines" and "meet a difficult deadline" become "meet deadline"), ready to be used as input for UCS. Alternatively, you can collect a frequency list like this: tabulate A match lem, matchend lem > "| sort | uniq -c | sort -nrk1 > vn.f.txt"; More fun with cqp: set MatchingStrategy longest; [lem ="cause" & pos = "V.*"] [pos = "NN.*"]+; [lem = "cause" & pos = "V.*"] [pos = "DT"]? [pos = "JJ"]* [pos = "NN.*"]+ ([word = "of"]|[word = "and"])? [pos = "DT"]? [pos = "JJ"]* [pos = "NN.*"]+; "as" []{1,3} "as" within s;