2nd Web as Corpus Workshop

In conjunction with the 11th Conference of the European Chapter of the Association for Computational Linguistics

Trento, Italy

April 3, 2006

Co-chairs: Adam Kilgarriff and Marco Baroni

Despite the fact that a growing body of work has shown that the World Wide Web is a mine of language data of unprecedented richness and ease of access (see, e.g., the papers collected in Kilgarriff and Grefenstette, 2003), many fundamental issues about the viability and exploitation of the Web as a linguistic corpus are just starting to be tackled, ranging from Web frequency distributions and registers, to efficient handling of massive data sets, to copyright. Research on the Web as corpus is currently at a very exciting stage: increasing evidence points to the enormous potential of the Internet as a source of linguistic data, but we are still far from a working, fully-fledged linguists' search engine.

We invite submissions which:

Preference will be given to projects where Web data are downloaded and processed directly, rather than via search engine interfaces.

Preliminary Program

9:00-9:30 Adam Kilgarriff and Marco Baroni - Introduction

9:30-10:00 Arno Scharl and Albert Weichselbraun Web coverage of the 2004 US presidential election

10:00-10:30 Rüdiger Gleim, Alexander Mehler and Matthias Dehmer - Web corpus mining by instance of Wikipedia

10:30-11:00 break

11:00-11:30 Masatsugu Tonoike, Mitsuhiro Kida, Toshihiro Takagi, Yasuhiro Sasaki, Takehito Utsuro and Satoshi Sato - A comparative study on compositional translation estimation using a domain/topic-specific corpus collected from the web

11:30-12:00 Gemma Boleda, Stefan Bott, Rodrigo Meza, Carlos Castillo, Toni Badia and Vicente López CUCWeb: a Catalan corpus built from the web

12:00-12:30 Paul Rayson, James Walkerdine, William H. Fletcher and Adam Kilgarriff - Annotated web as corpus

12.30-2.30 lunch

2:30-3:00 András Kornai, Péter Halácsy, Viktor Nagy, Csaba Oravecz, Viktor Trón and Dániel Varga - Web-based frequency dictionaries for medium density languages

3:00-3:30 Cédrick Fairon - Corporator: A tool for creating RSS-based specialized corpora

3:30-4:00 Demos, part 1

4:00-4:30 break

4:30-4:50 Demos, part 2

4:50-5:20 Davide Fossati, Gabriele Ghidoni, Barbara Di Eugenio, Isabel Cruz, Huiyong Xiao and Rajen Subba - The problem of ontology alignment on the web: a first report

5:20-5:50 Kie Zuraw - Using the web as a phonological corpus: a case study from Tagalog

5:50-6:00 Organization, next meeting, closing

Program Committee

Toni Badia
Marco Baroni (co-chair)
Silvia Bernardini
Massimiliano Ciaramita
Barbara Di Eugenio
Roger Evans
Stefan Evert
William Fletcher
Rüdiger Gleim
Gregory Grefenstette
Péter Halácsy
Frank Keller
Adam Kilgarriff (co-chair)
Rob Koeling
Mirella Lapata
Anke Lüdeling
Alexander Mehler
Drago Radev
Philip Resnik
German Rigau
Serge Sharoff
David Weir

