The workshop is now over. We (the organizers) greatly enjoyed it, and we hope that it was a good experience for the people who attended as well.
In the program section, you can find some of the materials used by the speakers.
For information on some Web-as-Corpus actitivities, you can visit the WaCky project page.
The World Wide Web is a mine of language data of unprecedented richness and ease of access (Kilgarriff and Grefenstette, 2003). A growing body of studies has shown that simple algorithms using Web-based evidence are successful at many linguistic tasks, often outperforming sophisticated methods based on smaller but more controlled data sources (e.g., Turney 2001), despite the many peculiarities of data that might be used in this way.
Current Internet-based linguistic studies differ in terms of strategies used to access Web data. For example, some researchers collect frequency data directly from commercial search engines (e.g., Turney 2001). Others use a search engine to find relevant pages, and then retrieve the pages to build a corpus (e.g., Ghani et al. 2001). Others yet build a corpus by spidering the web and manage the data with an ad-hoc search engine (e.g., Terra and Clarke 2003).
Different approaches have also been proposed to the task of sharing web-derived data. For example, some researchers make web-mining tools available (e.g., Baroni and Bernardini 2004) while others have proposed prototypes of Internet search engines for the linguists' community (Kehoe and Renouf 2002, Fletcher 2002, Kilgarriff 2003, Resnik and Elkiss 2003).
Many fundamental issues about the viability and exploitation of the web as a linguistic corpus must still be explored, or are just starting to be tackled. These issues range from word frequency distributions on the web to efficient handling of massive data sets, to the legal standing of web indexing.
Thus, we believe that the research on the web as corpus is currently in a very exciting stage: increasing evidence points to the enormous potential of the Internet as a source of linguistic data, but we are still far removed from anything like a working, fully-fledged linguist's search engine.
This full-day workshop will provide a general introduction to current research on the web as corpus as well as practical advice, concrete examples and moments of discussion among participants working on similar web-corpus-related tasks (e.g., in ontological engineering, terminology/specialized domains/language teaching, "classic" NLP tasks, general linguistics or extraction of world knowledge).
Some of the topics that are going to be covered in the tutorials are: