BootCaT: Simple Utilities to Bootstrap Corpora and Terms from the Web

Read the readme file (newer documentation provided with the updates below!)

Download the BootCaT Tools.

IMPORTANT UPDATE (February 2007): Please do also download the following archive, that includes new versions of some scripts, scripts to do other useful things, such as boilerplate stripping and basic de-duping, and, most importantly, a script kindly provided by Cyrus Shaoul to use Yahoo! instead of Google as search engine back-end (this has become of vital importance since Google recently stopped providing new API keys and it is not clear that it will continue supporting its service):

Download the Updates.

If you use the BootCaT toolkit, we would be very curious to get some feedback from you. Please send email to baroni AT unitn it and/or silvia AT sslmit unibo it.

If you publish work based on the BootCaT tools, please quote:

M. Baroni and S. Bernardini. 2004. BootCaT: Bootstrapping corpora and terms from the web. Proceedings of LREC 2004.

Other papers reporting about experiments with the BootCaT tools:

M. Baroni and M. Ueyama. 2004. Retrieving Japanese specialized terms and corpora from the World Wide Web. Proceedings of KONVENS 2004.

S. Sharoff. 2006. Creating general-purpose corpora using automated search engine queries. In Baroni and Bernardini (eds.) Wacky! Working papers on the Web as Corpus. Bologna: GEDIT.

BootCaT is a suite of perl scripts, and you may copy or redistribute it under the same terms as Perl itself.

Back to the tools and resources page.