Web as Corpus Workshop at Corpus Linguistics 2005

Post-Workshop Update
Motivation
Contents
Program

Post-Workshop Update

The workshop is now over. We (the organizers) greatly enjoyed it, and we hope that it was a good experience for the people who attended as well.

In the program section, you can find some of the materials used by the speakers.

For information on some Web-as-Corpus actitivities, you can visit the WaCky project page.

Motivation

The World Wide Web is a mine of language data of unprecedented richness and ease of access (Kilgarriff and Grefenstette, 2003). A growing body of studies has shown that simple algorithms using Web-based evidence are successful at many linguistic tasks, often outperforming sophisticated methods based on smaller but more controlled data sources (e.g., Turney 2001), despite the many peculiarities of data that might be used in this way.

Current Internet-based linguistic studies differ in terms of strategies used to access Web data. For example, some researchers collect frequency data directly from commercial search engines (e.g., Turney 2001). Others use a search engine to find relevant pages, and then retrieve the pages to build a corpus (e.g., Ghani et al. 2001). Others yet build a corpus by spidering the web and manage the data with an ad-hoc search engine (e.g., Terra and Clarke 2003).

Different approaches have also been proposed to the task of sharing web-derived data. For example, some researchers make web-mining tools available (e.g., Baroni and Bernardini 2004) while others have proposed prototypes of Internet search engines for the linguists' community (Kehoe and Renouf 2002, Fletcher 2002, Kilgarriff 2003, Resnik and Elkiss 2003).

Many fundamental issues about the viability and exploitation of the web as a linguistic corpus must still be explored, or are just starting to be tackled. These issues range from word frequency distributions on the web to efficient handling of massive data sets, to the legal standing of web indexing.

Thus, we believe that the research on the web as corpus is currently in a very exciting stage: increasing evidence points to the enormous potential of the Internet as a source of linguistic data, but we are still far removed from anything like a working, fully-fledged linguist's search engine.

This full-day workshop will provide a general introduction to current research on the web as corpus as well as practical advice, concrete examples and moments of discussion among participants working on similar web-corpus-related tasks (e.g., in ontological engineering, terminology/specialized domains/language teaching, "classic" NLP tasks, general linguistics or extraction of world knowledge).

Some of the topics that are going to be covered in the tutorials are:

General overview of web-as-corpus work
Building large/general and small/special-purpose web corpora
Web crawling for linguistic purposes
(Near-)duplicate detection, boilerplate removal, language identification
Linguistic annotation
Working with non-latin1 languages
Indexing and retrieval from large document collections
Prospected interfaces

Program

Web as Corpus Workshop

In conjunction with Corpus Linguistics 2005

Birmingham University, UK
14th July 2005

Co-chairs: Marco Baroni, Sebastian Hoffmann, Adam Kilgarriff

9:30-10:00 Adam Kilgarriff (Lexicography MasterClass) - Welcome, goals of the workshop, overview of program. PDF version

10:00-10:30 Marco Baroni (University of Bologna) - Large crawls of the web for linguistic purposes

10:30-11:00 coffee break

11.00-12.00 Marco Baroni (University of Bologna) and Serge Sharoff (University of Leeds) - Creating specialized and general corpora using automated search engine queries

12:00-13:00 Small groups arranged around the participants' research purposes

13:00-14:30 lunch break

14:30-15:15 Sebastian Hoffmann (University of Zurich) - Processing web-derived text (or: Working with very messy data). Handout

15:15-16:00 Stefan Evert (University of Osnabrück) and Adam Kilgarriff (Lexicography MasterClass) - Indexing and interfaces. Stefan's slides

16:00-16:30 coffee break

16:30-17:00 Alexander Mehler and Rüdiger Gleim (University of Bielefeld) - Representing genre-specific websites

17:00-17:30 Small groups on "what are critical next steps for Web-as-Corpus activity?"

17:30-18:10 Plenary: where next?

Corpus Linguistics 2005