The BootCaT Toolkit

Simple Utilities for Bootstrapping Corpora and Terms from the Web
=================================================================

                           version 0.1.2

                        Copyright (c) 2003
                Marco Baroni (baroni@sslmit.unibo.it)
              Silvia Bernardini (silvia@sslmit.unibo.it)
  	     	    
	http://sslmit.unibo.it/~baroni/tools_and_resources.html


Despite certain obvious drawbacks (e.g. lack of control, sampling,
documentation etc.), there is no doubt that the WWW is a mine of
language data of unprecedented richness and ease of access.

It is also the only viable source of ``disposable'' corpora (Varantola
2003) built ad hoc for a specific purpose (e.g. a translation or
interpreting task, the compilation of a terminological database,
domain-specific machine learning tasks). These corpora are essential
resources for language professionals who routinely work with
specialized languages, often in areas where neologisms and new terms
are introduced at a fast pace and where standard reference corpora
have to be complemented by easy-to-construct, focused, up-to-date text
collections.

While it is possible to construct a web-based corpus through manual
queries and downloads, this process is extremely time-consuming. The
time investment is particularly unjustified if the final result is
meant to be a single-use corpus.

The perl scripts included in the BootCaT toolkit implement an
iterative procedure to bootstrap specialized corpora and terms from
the web, requiring only a list of ``seeds'' (terms that are expected
to be typical of the domain of interest) as input.

In implementing the algorithm, we followed the old UNIX adage that
each program should do only one thing, but do it well. Thus, we
developed a small, independent tool for each separate subtask of the
algorithm.

As a result, BootCaT is extremely modular: One can easily run a subset
of the programs, look at intermediate output files, add new tools to
the suite, or change one program without having to worry about the
others.

Once you install the scripts, documentation about each of them
(including usage and explanation of command line options) is available
by typing:

name_of_the_program.pl -h

Sometimes, perldoc documentation is also available by typing:

perldoc name_of_the_program.pl

This document describes installation, it provides a high level
description of the procedure to create corpora and extract terms, it
presents an example of how the scripts can be used with detailed
comments and it provides licensing information.


INSTALLATION
============

We have used the scripts on computers running the Linux and Mac OS X
operating systems with perl v.5.8.0. We have no reason to believe that
they would not run on other OSs of the UNIX family (including cygwin),
whereas there could be problems on Windows, since some of the scripts
call UNIX text utilities such as cat (if you try, e.g. under cygwin,
and you have problems, please let us know). Here, we assume that you
are using the scripts on the UNIX command line.

After you download the archive, decompress it by typing the following
command at the prompt:

tar xvzf BootCaT-0.1.tar.gz

This will generate a BootCaT-0.1 directory, containing this readme
file, the examples directory and the following scripts:

add1_smoothing.pl
basic_tokenizer.pl
build_random_tuples.pl
collect_mw_terms.pl
collect_urls_from_google.pl
connect_bi_connectors.pl
doc_delimited_uniq.pl
filter_unigrams.pl
get_connector_grams.pl
get_top_percentage.pl
log_odds_ratio.pl
print_good_ngrams.pl
print_pages_from_url_list.pl
print_rank.pl
simple_filter.pl

The examples directory contains the output of the experiment we
describe below.

Move BootCaT-0.1 wherever you want and add its path to the PATH
variable.

If you use tcsh, add something like the following line to the .tcshrc
file:

setenv PATH "${PATH}:/home/marco/sw/BootCaT-0.1"

If you use the bash shell, add something like the following line to
.bashrc:

PATH=$PATH:$HOME/sw/BootCaT-0.1

That's it. You should now be able to run the scripts and to look at
their documentation from wherever you are, without specifying the
path to BootCaT.

Some of the modules required by the scripts are not part of the
standard perl distribution. If, when you run a script, perl complains
about not being able to locate a certain module in @INC, you will have
to install it.

If you are lucky, the following will be sufficient to install the
missing module:

sudo perl -MCPAN -e shell
cpan> install Name_Of_Module
...
cpan> quit

In order to use the script collect_urls_from_google.pl, you will need
to obtain a Google API key from the Google Web API page:

http://www.google.com/apis

This entitles you to 1000 automated searches per day, for a maximum of
10000 results. If this is not enough for your purposes, you may
consider engaging in some ``Google-scraping''. Until now, we never
needed to retrieve more than 10000 pages per day, and we prefer to use
the Google API since it should provide a more robust and stable
interface to the Google engine than the web scraping approach.


THE BOOTCAT PROCEDURE
=====================

The basic idea is very simple: Build a corpus by searching Google
(http://www.google.com/) for a small set of seed terms, extract new
(single-word) terms from this corpus, use the latter to build a new
corpus via a new set of Google queries, extract new terms/seeds from
this corpus and so forth. The final corpus and unigram term list are
then used to extract a list of multi-word terms. These are sequences
of words that must respect a set of constraints on their structure,
frequency and distribution.


Initial seed selection
----------------------

The bootstrapping process starts with a small list of seeds that are
expected to be representative of the domain under investigation. The
initial seed selection can be manual or automated (below, we describe
an experiment in which we extracted the initial seeds automatically).

We found that, for well-defined specialized domains, a small list of
seeds (in the 5-to-15 range) is typically sufficient, and we obtained
interesting results by starting with as little as two seeds.


Bootstrap of corpora and single terms
-------------------------------------

This is the core of the algorithm. The seed terms are randomly
combined and each combination is used as a Google query string.  The
top n pages returned for each query are retrieved and formatted as
text.

New single-word seeds are extracted from the corpus of retrieved pages
by comparing the frequency of occurrence of each word in this set with
its frequency of occurrence in a reference corpus. In particular, we
compare frequencies using the log odds ratio measure (see e.g. Everitt
1992, 2.8).

We compute the log odds ratio by assuming the following contingency
table:

                 curr_w=w     curr_w!=w
 corpus=spec        a            b
 corpus=ref         c            d

where a is the frequency of a word in the specialized corpus, b is the
size of the specialized corpus minus a, c is the frequency of the word
in the general corpus and d is the size of the general corpus minus c.

The odds ratio is given by ad/cb, and the log odds ratio is:

log_odds_ratio = ln(ad/cb) = ln(a) + ln(d) - ln(c) - ln(b)

Since it is almost sure that not all words in the specialized corpus
will occur in the general corpus, we provide a script to perform add-1
smoothing (Jurafsky and Martin 2001, 6.3) of the reference
corpus frequencies.

In future versions of the toolkit, we would like to support other
statistics, beyond the log odds ratio, and more sophisticated smoothing
schemes.

We provide frequency lists for English and Italian reference corpora
from the same site from which BootCaT can be downloaded.

At this point, random combinations of the newly extracted seed terms
are used for a new round of Google queries and a new corpus is created
by retrieving and formatting the top n pages found by the search
engine.

The iterative term extraction / corpus downloading procedure is
repeated as many times as desired (e.g., until the corpus reaches a
certain size, or the quality of seeds starts decreasing). In our
experiments, we rarely found the need to repeat the process more than
two or three times.

Several important search parameters have to be controlled by the user,
such as the number of queries to be issued for each iteration, the
number of seeds combined to build a query, the number of pages to be
retrieved for each query, etc.

After a corpus has been downloaded, the user must decide whether to
base the frequency comparison for seed extraction on token
frequencies, document frequencies or a combination of both.

Once the words are ranked by odds ratio, the user has to decide the
proportion of words at the top of the list that are to be picked as
seeds for the next round.

Depending, probably, on the quality of the initial seeds, the user
will also have to decide whether to keep the corpora and term lists
extracted during each round, or just the ones constructed during the
last round.

Until now, we have made this kind of choices on a trial and error
basis. The modular nature of the toolkit allows one to inspect the
intermediate results and to experiment with various parameters at each
stage.


Extraction of multi-word terms
------------------------------

If emphasis is on term extraction, we can use the list of single-word
terms and the corpus constructed in the previous stage to look for
multi-word terms.

The first step of this phase is to extract a list of single- and
two-word connectors from the corpus, by looking for words and bigrams
that frequently occur between two single-word terms. We also extract a
list of stop words, which are, simply, words with a very high document
frequency that are not connectors.

At this point, we can look for multi-word terms, i.e. sequences of
words that meet the following constraints:

- they have frequency above a certain threshold (dependent on length);

- they contain at least one single-word term;

- they do not contain stop words;

- they may contain connectors, but these cannot occur at the edges nor
  be adjacent to each other;

- a candidate multi-word term cannot be part of a longer multi-word
  term with frequency above k*fq, where k is a constant between 0 and
  1 (but typically much closer to the upper end of the range) and fq
  is the frequency of the current term; in other words, a multi-word
  term cannot be part of a longer term with frequency close to its
  own;

- conversely, a multi-word term cannot contain a shorter multi-word
  term with frequency above (1/k)*fq.

The multi-word terms are searched recursively. Starting with bigrams,
we look left and right for a n+1gram term containing the current ngram
and meeting the constraints we just listed, except the one banning
edge connectors (otherwise, we would not find longer terms with inner
connectors). For each seed bigram, the longest well-formed term
containing it and without edge connectors is returned (this, of
course, can be the bigram itself).

Again, the user must set various parameters, such as the minimum
frequency for bigram terms and the value for the constant k (the
minimum frequency threshold for longer terms will follow from these
two parameters).

Although we never tried, it would be interesting (and easy) to add a
filter to keep only bigrams with a high mutual information and/or
log-likelihood ratio as possible starting points for the recursive
multi-word term search procedure.

If the relevant resources are available, it could also be useful to
filter out multi-word terms that do not match certain POS-patterns.


Efficiency issues
-----------------

The whole procedure is implemented in perl, and we did not worry about
efficiency in the implementation (although we did call efficient
external command-line tools such as sort wherever possible). This
means that, if you are using the scripts with huge corpora, you will
probably run into trouble. We experimented with corpora of about 10M
words, and we had no efficiency problem (on a G4 running OS X 10.2
with 768 MB of RAM and a dual 450MHz processor no script took more
than a few minutes to run).


Related work
-----------

Our corpus construction method is very similar to the one proposed by
Ghani et al. (2001) for minority language corpora.

The idea of comparing frequencies in specialized and reference corpora
to look for terms typical of the former is fairly common. See, for
example, Rayson and Garside (2000).

The multi-word term extraction method has some similarities with the
ones proposed by Enguehard and Pantera (1995) (in particular, the
connector idea) and Pantel and Lin (2001) (in particular, the
recursive multi-word term search).

However, as far as we know, we are the first to propose and implement
a procedure to construct specialized corpora from the web while at the
same time extracting terms from the same domain.


AN EXAMPLE EXPERIMENT
=====================

We need to come up with a corpus and a term list to aid a technical
translator who needs to translate the following psychiatric article:

Fleisher W., D. Staley, P. Krawetz, N. Pillay, J. Arnett,
J. Maher. 2002. A Comparative Study of Trauma-Related Phenomena in
Subjects With Pseudoseizures and Subjects With Epilepsy. American
Journal of Psychiatry 159: 660-663.


Initial seed selection
----------------------

We use as initial seeds all the words in the abstract of the study
that do not occur in the Brown corpus. In this way, we automatically
extract the following six seeds:

$ cat seeds 
dissociative
epilepsy
interventions
posttraumatic
pseudoseizures
ptsd


Corpus/unigram term bootstrap
-----------------------------

We form 15 random triplets from the seeds:

$ build_random_tuples.pl -l15 -n3 seeds > triplets

We use the triplets to perform a set of Google searches, asking Google
to return a maximum of 20 URLs per query, and to look for English
pages only:

$ collect_urls_from_google.pl -k GOOGLE_API_KEY -l English \
-c 20 triplets > first_url_list
$ wc first_url_list
 315 360 17988 first_url_list

Many URLS found by the different queries, of course, are duplicates:

$ grep -v "CURRENT_SEED" first_url_list | sort | uniq | wc
 181 181 10499

We retrieve a first corpus:

$ grep -v "CURRENT_SEED" first_url_list | sort | uniq |\
print_pages_from_url_list.pl > first_corpus &

$ wc first_corpus 
  55542  396308 2748410 first_corpus

We extract token frequencies (basic_tokenizer.pl with the specified
options removes strings containing characters outside the [a-zA-Z\']
set, converts the remaining strings to lower case and prints the
output one word per line):

$ grep -v "CURRENT URL" first_corpus | basic_tokenizer.pl -aei - |\
sort | uniq -c |gawk '(length($2)>2)&&($1>2){print $2,$1}' > first_fqs

Now, we will compare these frequencies to those in the Brown corpus,
which we previously collected in the brown_ci_tok_fqs file. This a
frequency list from the Brown tokenized with all words converted to
lower case (ci stands for case insensitive).

The script add1_smoothing.pl will produce a list of all the words in
our specialized frequency list with their frequency in the Brown
corpus + 1, i.e. if a word has a frequency of 3 in the Brown, its
output frequency will be 4, if a word did not appear in the Brown list
(i.e., it had a frequency of 0 there), its output frequency will be 1.

Moreover, with the option -t set the script also prints to STDERR an 
estimate of the number of words in the *smoothed* Brown, given by the
sum of all the frequencies in the Brown frequency list + 1 per word +
the number of words that are in the specialized list but were not in
the original Brown list.

$ add1_smoothing.pl -t brown_ci_tok_fqs first_fqs
getting the types in general corpus
adding one
estimate of size of smoothed corpus
1054850
creating smoothed general corpus fq list for words in specific corpus

The list created by add1_smoothing.pl is called brown_ci_tok_fqs.add1

Sanity check:

$ wc first_fqs brown_ci_tok_fqs.add1    
   8991   17982  102323 first_fqs
   8991   17982  103335 brown_ci_tok_fqs.add1
  17982   35964  205658 total

We align the specialized and Brown corpus frequencies and we compute
the log odds ratios (notice that the script to compute log odds ratios
requires the size of the specialized corpus and that of the smoothed
Brown as input arguments):

$ paste first_fqs brown_ci_tok_fqs.add1 | gawk '{print $1,$2,$4}' |\
 log_odds_ratio.pl 396308 1054850 - | sort -nrk2 > first_odds

After visual inspection, we decide to take the top 40 words from this
list as the seeds for the second run:

$ head -40 first_odds | gawk '{print $1}' | sort > second_seeds

Since we have more seeds, we generate more triples:

$ build_random_tuples.pl -l30 -n3 second_seeds > second_triples

And we follow the usual corpus creation steps (since we will also keep
the first corpus, this time we need to remove the URLs we already
retrieved from the new URL list):

$ collect_urls_from_google.pl -k GOOGLE_API_KEY -l English \
-c 20 second_triples > second_url_list
$ wc second_url_list 
    594     684   34883 second_url_list
$ grep -v "CURRENT_SEED" second_url_list | sort | uniq | wc
    540     540   32367
$ simple_filter.pl -s first_url_list second_url_list |\
 grep -v "CURRENT_SEED" | sort | uniq | wc
    516     516   31017
$ simple_filter.pl -s first_url_list second_url_list |\
 grep -v "CURRENT_SEED" | sort | uniq |\
 print_pages_from_url_list.pl > second_corpus &
$ wc second_corpus
 153804 1126218 7846802 second_corpus

At this point, we feel that we have enough data for our current
purposes. We put the two corpora together: this is our ``final''
corpus:

$ cat first_corpus second_corpus > final_corpus
$ wc final_corpus 
 209346 1521736 10595212 final_corpus


Extraction of final unigram term list
-------------------------------------

We will now proceed to select a final set of unigram terms from this
corpus.

First, we tokenize the final_corpus, printing one word per line. This
time, since we are also interested in looking for multi-word terms,
and thus at the context of unigrams, we are more careful with
tokenization.

In future versions of the toolkit, we would like to add a better
tokenizer. For the moment, we tokenize the corpus with command-line
scripts, generating the following output format: one word per line,
plus lines with CURR_DOC to delimit one original web-page from the
other (there is one delimiter at the very beginning, but no delimiter
at the end). Moreover, words or sequences of words that contain
non-alphabetic characters (except the dash and the apostrophe) are
replaced by the string _STOP_. Notice also that this time we preserve
capitalization.

$ perl -ne 'if(/CURRENT URL/){print "CURR_DOC\n";next} \
 s/\x92/\x27/g; s/^(.)/ _ $1/; s/\s\s/ _ /g; \
 s/\-\-/\- \-/g; s/[^\s+]+\.[^\s]+/ _ /g; \
 s/[^a-zA-Z\x27\s0-9\-]+/ _STOP_ /g; \
 s/([a-z])([A-Z])/$1 $2/g; s/\s+/\n/g; \
 print' final_corpus | egrep "[A-Za-z]" |\
 perl -ne 'BEGIN{$seen=0} \
 if ( /^CURR_DOC/ ) {print; next} if ( /^_STOP_/ ) \
 {if ($seen==0){$seen=1; print}next;} $seen=0; \
 s/^[\x27\-]+//g; \
 s/[\x27\-]+$//g; if (/[A-Za-z]/){print}' > final_tok_corpus

$ wc final_tok_corpus 
1818804 1818804 12140736 final_tok_corpus

We collect token and document frequencies (to obtain the latter, we
run the script doc_delimited_uniq.pl, which keeps only one instance of
a word for each document, so that then we can apply the same procedure
we use for token frequencies):

$ grep -v _ final_tok_corpus | sort | uniq -c |\
 gawk '{print $2,$1}' > final_tok_fqs
$ wc final_tok_fqs
  71967  143934  814595 final_tok_fqs
$ grep -v _STOP_ final_tok_corpus |\
 doc_delimited_uniq.pl "CURR_DOC" - | grep -v "CURR_DOC" |\
 sort | uniq -c | gawk '{print $2,$1}' > final_doc_fqs
$ wc final_doc_fqs
  71967  143934  809084 final_doc_fqs

We compute the log odds ratios:

$ add1_smoothing.pl -t brown_tok_fqs final_tok_fqs
getting the types in general corpus
adding one
estimate of size of smoothed corpus
1110821
creating smoothed general corpus fq list for words in specific corpus
$ wc final_tok_fqs brown_tok_fqs.add1
  71967  143934  814595 final_tok_fqs
  71967  143934  809823 brown_tok_fqs.add1
 143934  287868 1624418 total

$ paste final_tok_fqs brown_tok_fqs.add1 | gawk '{print $1,$2,$4}' |\
 log_odds_ratio.pl 1819103 1110821 - | sort -nrk2 > final_tok_odds

Now we do doc fqs:

$ add1_smoothing.pl brown_doc_fqs final_doc_fqs
getting the types in general corpus
adding one
creating smoothed general corpus fq list for words in specific corpus

$ wc brown_doc_fqs.add1 final_doc_fqs 
  71967  143934  807761 brown_doc_fqs.add1
  71967  143934  809084 final_doc_fqs
 143934  287868 1616845 total

In this case, estimate of smoothed corpus ``size'' (which should be
the maximum value that the frequency of a word could in principle
reach) will be given by the number of documents in the original corpus
plus 1 (if a word occurred in all documents of the original corpus,
after add-1-smoothing its frequency will be equal to the number of
documents in the original corpus plus 1). The Brown corpus contains
500 documents, thus its smoothed size is 501.

For the specialized corpus, the number of documents is:

$ grep CURR_DOC final_tok_corpus | wc
    533     533    4797

We need to add another 1 to both totals, otherwise words occurring in
all documents would have 0 counts in the b or d cells of the
contingency table (see above). This would be a problem since the log
odds ratio script takes logarithms of these counts.

$ paste final_doc_fqs brown_doc_fqs.add1 | gawk '{print $1,$2,$4}' |\
 log_odds_ratio.pl 534 502 - | sort -nrk2 > final_doc_odds

This time, we will combine token and document frequencies on the basis
of their rank.

We combine them in the following simple way: For each word, we compute
its rank in the list of words ordered by token-frequency-based odds
ratio and its rank in the list of words ordered by
document-frequency-based odds ratio. Then, we simply sum the two
ranks.

For example, if a certain word has the highest odds ratio in the list
based on token frequency, and the third highest odds ratio in the list
based on document frequency, its sum-of-ranks will be 1+3=4.

The highest the ranks of a word in the odds ratio lists, the lower the
value of its sum-of-ranks will be. Thus, the sum-of-ranks list is
ordered by increasing sum-of-ranks value (a word with a sum-of-ranks
of 4 is a more likely term than a word with a sum-of-ranks of 400).

Of course, one could experiment with different ways to combine the
measures, or simply pick one of the two.

$ print_rank.pl -f2 final_tok_odds | gawk '{print $2,$1}' |\
 sort > final_tok_odd_ranks
$ print_rank.pl -f2 final_doc_odds | gawk '{print $2,$1}' |\
 sort > final_doc_odd_ranks

$ paste final_tok_odd_ranks final_doc_odd_ranks |\
 gawk '{c=$2+$4;print $1,c}' | sort -nk2 > final_combined_odd_ranks

We will pick the top 2.5% from the combination as the final set of our
unigram terms:

$ get_top_percentage.pl 2.5 final_combined_odd_ranks | gawk '{print $1}' |\
 sort > candidate_uniterms
$ wc candidate_uniterms 
   1800    1800   15619 candidate_uniterms


Multi-word term extraction
--------------------------

Now we start looking for multi-word terms.

First, we need to collect connectors. These are simply one- and
two-word sequences that often occur between two unigram terms.

Here, we look at type frequency: We count the number of distinct *term
frames* in which each potential connector occurs. Token frequency is
only taken into account in that if a term+connector+term sequence did
not occur at least three times in the corpus, it is not counted.

For unigrams, we keep the top 5% as potential connectors, for bigrams
we keep the top 2.5%.

$ get_connector_grams.pl 3 candidate_uniterms final_tok_corpus |\
 grep -v _ | sort | uniq -c |\
 perl -ane 'if ($F[0]>2){print $F[2];print "\n";}' | sort | uniq -c |\
 gawk '{print $2,$1}' |  sort -nrk2 | get_top_percentage.pl 5 - |\
 gawk '{print $1}' | sort > uni_connectors
$ get_connector_grams.pl 4 candidate_uniterms final_tok_corpus |\
 grep -v _ | sort | uniq -c |\
 perl -ane 'if ($F[0]>2){print join " ",@F[2...($#F-1)];print "\n";}' |\
 sort | uniq -c | gawk '{print $2,$3,$1}' | sort -nrk3 |\
 get_top_percentage.pl 2.5 - | gawk '{print $1,$2}' |\
 sort > bi_connectors

$ wc *connectors
     11      22     103 bi_connectors
     13      13      61 uni_connectors
     24      35     164 total

These are the two lists:

[einstein ~/web_and_terms/data] baroni$ cat uni_connectors
J
and
for
from
in
of
on
or
personality
status
stress
to
with

[einstein ~/web_and_terms/data] baroni$ cat bi_connectors
In Your
Professor of
and Other
and other
characterized by
in the
is a
of a
of the
to the
veterans with

As you can see, they are rather noisy. One could manually intervene to
remove bad connectors such as *personality* or *veterans with*. However,
inspection of the final output in various experiments suggests that
bad connectors tend to disappear from the final multi-word term
lists anyway.

The simplest way to deal with bigram connectors is to treat them as
unigrams. We re-tokenize the corpus replacing each bigram connector
with a single token created by joining the two parts with an
underscore (of the -> of_the):

$ connect_bi_connectors.pl bi_connectors final_tok_corpus \
 > final_bitok_corpus  
$ wc final_*tok_corpus
1801315 1801315 12140736 final_bitok_corpus
1818804 1818804 12140736 final_tok_corpus
3620119 3620119 24281472 total

We build a stop word list with words that have a high document
frequency in the Brown corpus and are not unigram terms nor connectors
(we could probably also use the specialized corpus for the same
purpose):

$ sort candidate_uniterms uni_connectors > keep_list
$ sort -nrk2 brown_doc_fqs | filter_unigrams.pl -s keep_list - |\
 get_top_percentage.pl 1 - | gawk '{print $1}' | sort > stop_words

We add the token _STOP_ to this list (recall that in the tokenized
specialized corpus strings that contain non-alphabetic characters
except the dash and the apostrophe were replaced by _STOP_):

$ echo "_STOP_" >> stop_words 
$ wc stop_words 
    556     556    3274 stop_words

OK, now we have the basic ingredients to collect ngrams.

Let us get all the up_to_5_grams that contain at least one candidate
term and no stop word (one could of course consider longer ngrams as
well):

$  gawk \
 '/CURR_DOC/{print "_STOP_"} $0 !~/CURR_DOC/{print}' \
 final_bitok_corpus |\
 print_good_ngrams.pl 5 stop_words candidate_uniterms - | sort |\
 uniq -c | perl -ane '$fq = shift @F; print join(" ",@F); \
 print " $fq\n";' > all_grams
$ wc all_grams 
 227361  985245 6250546 all_grams

Notice that, when the relevant resources are available, one can use
POS-pattern matching to filter the lists of connectors, stop words and
ngrams.

Next, we extract all the bigrams from the previous list, we sort them
and we preserve the top 5%:

$ gawk '(NF==3){print}' all_grams | sort -nrk3 |\
 get_top_percentage.pl 5 - > top_bigrams
$ wc top_bigrams 
   3158    9474   56989 top_bigrams

An interesting alternative to absolute frequency would be to pick
bigrams using mutual information or log-likelihood.

Now, we will use these bigrams as the seeds in the recursive procedure
to look for multi-word terms.

We start with the shortest ngrams (bigrams) and we expand each of them
towards the left and the right until we find the longest ngram that
has no connectors at the edge and which has a frequency equal to or
above k*component_fq, where component_fq is the frequency of the most
frequent n_minus_1_gram composing it.

Notice that the lowest frequency items in the bigram list have a
frequency of 10:

$ tail -1 top_bigrams 
AC splits 10

As we just said, for each ngram (from 3grams up), the minimum
acceptable frequency must be equal to k times the frequency of any of
the n-1grams (read: n minus 1 grams) composing it. Thus, the minimum
frequency of a 2+ngram is equal to 2gram_fq*k^n (e.g., the minimum
frequency for a 3gram containing a certain 2gram is 2gram_fq*k^1; for
a 4gram it is 2gram_fq*k^2, and so on).

The longest ngram we collected is a 5gram, i.e., a 2+3gram. Thus, its
minimum frequency is min_2gram_fq*k^3. Given that the minimum
frequency for the 2grams is 10 and that it is unlikely that we will
consider k's below .75, the minimum frequency for a 5gram will be:
10*.75^3, i.e. 4.22, i.e., since we work with integers, 5.

Thus, we can trim all ngrams with frequency below 5 (we could even
trim on the basis of min_bigram_fq*k^n for each 2+ngram category, but
since we do not have efficiency problems with the current
setting we'll save ourselves the effort of doing this).

$  gawk '(NF>3){print}' all_grams |\
 perl -ane 'if ($F[$#F]>4){print}' > top_morethanbi_grams
$ wc top_morethanbi_grams 
   4382   19282  119946 top_morethanbi_grams

Finally, we can get the multi-word terms!

$ perl -ne 's/ /_/;print' uni_connectors bi_connectors |\
 collect_mw_terms.pl .75 - top_morethanbi_grams top_bigrams |\
 sort  > mw_terms 
$ wc mw_terms
   1507    3271   26554 mw_terms

The output term lists (candidate_uniterms and mw_terms), and the list
of all the URLs retrieved to build the final corpus (final_urls) are
available in the BootCaT archive, inside the examples directory.


Evaluation
----------

Evaluating the performance of an unsupervised algorithm is always
hard, since we cannot use part of the training data for testing
purposes. In this case, the situation is further complicated by the
inherent difficulty of estimating precision (is a certain web-page or
term a hit or a miss?), and the impossibility of estimating recall
(have we obtained an exhaustive list of all the terms/web-pages
pertaining to a given domain?)

A few simple attempts at quantitative evaluation have however proved
encouraging:

a) We manually extracted all the terms from the article on
pseudoseizures quoted above and checked the number of these terms also
present in the automatically-generated unigram term list created by
BootCaT. 38 out of 43 terms were indeed present in the BootCaT list
(88.3%).

b) The same procedure applied to multi-word terms showed that 11 out
of 29 terms were present (37.9%). A search through the corpus
retrieved 9 more terms (31%). The remaining 9 terms were not present
in either the corpus or the term list. However, a search for their
component single terms showed that at least one word from each
multi-term was present in the corpus, and in general this was enough
to understand the term as a whole. For instance, the term ``sleep
dysfunction'' was not attested in the corpus, which however included
122 occurrences of ``dysfunction'', with, among its immediate
left-hand collocates, ``behavioral'', ``anatomic'', ``family'',
``sexual'', ``lobe'' etc.

c) Lastly, 100 randomly selected multi-word terms were extracted from
the BootCaT list and classified according to their well-formedness and
relevance. Of these, 10 were incomplete or badly-formed (e.g. ``Can
epilepsy'', ``Psychiatry October''), 4 belonged to internet jargon
(e.g. ``Synergy cookie'', ``browser settings''), 13 were proper names
(e.g. ``Harden CL''), 32 were general medical terms (e.g. ``clinical
symptoms'', ``Environmental health'') and 41 were technical medical
terms in the field of psychiatry (e.g. ``abuse survivors'',
``clinically significant distress''), for a total of 73 ``bona fide''
terms over 100.

However, we also believe that the ultimate criterion to evaluate these
tools is the extent to which the intended users find them useful, and
prefer them to manual procedures.

In this perspective, we are currently collecting reports from trainee
translators and terminologists about their experience with the BootCaT
tools. While these reports do not provide quantitative assessments, we
believe they can be extremely instructive to assess the performance of
the tools in realistic settings.

For example, the trainee-translator who worked with us on extracting
English and Italian corpora of psychiatric articles about the
phenomenon of pseudoseizures found that the automatically generated
corpora contained nearly all the pages that she had previously
identified via manual searches, and about 50% more pages that seemed
to be, in the overwhelming majority of cases, relevant to the topic.


LICENSING INFORMATION
=====================

The BootCaT scripts are free software. You can redistribute them
and/or modify them under the same terms as Perl itself.

If you publish work based on the BootCaT tools, please quote Baroni
and Bernardini (2004).


REFERENCES
==========

Baroni, M. and S. Bernardini. 2004. BootCaT: Bootstrapping corpora and
terms from the web. Proceedings of LREC 2004.

Enguehard, C. and L. Pantera. 1995. Automatic natural acquisition of
bilingual terminology. Journal of Quantitative Linguistics 2: 27-32

B. Everitt. 1992. The analysis of contingency tables. Second
edition. London: Chapman and Hall.

Ghani, R., R. Jones and D. Mladenic. 2001. Mining the Web to Create
Minority Language Corpora. CIKM 2001: 279-286

Jurafsky, D. and J. Martin. 2000. Speech and language processing: An
Introduction to Natural Language Processing, Computational
Linguistics, and Speech Recognition. Upper Saddle River:
Prentice-Hall.

Pantel, P. and D. Lin. 2001. A Statistical Corpus-Based Term
Extractor. Proceedings of AI 2001.

P. Rayson and Garside, R. 2000. Comparing corpora using frequency
profiling. Proceedings of Workshop on Comparing Corpora of ACL 2000:
1-6.

Varantola, K. 2003. Translators and disposable corpora. In
F. Zanettin, S. Bernardini and D. Stewart (eds.) Corpora in translator
education. Manchester: StJerome: 55-70