Read the readme file (newer documentation provided with the updates below!)
IMPORTANT UPDATE (February 2007): Please do also download the following archive, that includes new versions of some scripts, scripts to do other useful things, such as boilerplate stripping and basic de-duping, and, most importantly, a script kindly provided by Cyrus Shaoul to use Yahoo! instead of Google as search engine back-end (this has become of vital importance since Google recently stopped providing new API keys and it is not clear that it will continue supporting its service):
If you use the BootCaT toolkit, we would be very curious to get some feedback from you. Please send email to baroni AT unitn it and/or silvia AT sslmit unibo it.
If you publish work based on the BootCaT tools, please quote:
M. Baroni and S. Bernardini. 2004. BootCaT: Bootstrapping corpora and terms from the web. Proceedings of LREC 2004.
Other papers reporting about experiments with the BootCaT tools:
M. Baroni and M. Ueyama. 2004. Retrieving Japanese specialized terms and corpora from the World Wide Web. Proceedings of KONVENS 2004.
S. Sharoff. 2006. Creating general-purpose corpora using automated search engine queries. In Baroni and Bernardini (eds.) Wacky! Working papers on the Web as Corpus. Bologna: GEDIT.
BootCaT is a suite of perl scripts, and you may copy or redistribute it under the same terms as Perl itself.