Italian ACOPOST Models Readme
		   =============================
	http://sslmit.unibo.it/~baroni/tools_and_resources.html


ACOPOST (http://sourceforge.net/projects/acopost) is an open source
Part-of-Speech tagging toolkit, containing a transformation-based
tagger, an example-based tagger, a HMM tagger and a maximum entropy
tagger.

This archive contains pre-trained models to tag Italian text using
three of the ACOPOST taggers: the transformation-based tagger, the
example-based tagger and the HMM tagger (for reasons we are trying to
understand, the maximum entropy tagger causes a segmentation fault
when it is run with our trained model).

This is version 1.1 of our pre-trained models (small changes in the
input files and in the tagset).

CONTENTS OF ARCHIVE
===================

The archive contains the following files:

- Readme.ItaACOPOSTModels: this readme file;
- ita_exclude_tags: list of tags that were excluded in training the et
unknown word model;
- ita_known.etf: feature file used to train et known word model;
- ita_known_tree: et known word model;
- ita_lex: lexicon file;
- ita_tagset.txt: the tagset we used to tag the data;
- ita_t3_ngrams: t3 model;
- ita_tbt.log: STDERR output of tbt training;
- ita_tbt_rules: tbt model;
- ita_tbt_templates: template file used to train tbt;
- ita_unknown.etf: feature file used to train et unknown word model;
- ita_unknown_tree: et unknown word model;
- simple_majority_tagger.pl: simple perl script to do majority
tagging;


TRAINING
========

We trained the models on a corpus of 180 manually tagged articles
randomly selected from the years 1985-88 of the Italian national daily
_la Repubblica_, for a total of about 107,500 tokens.

Manual tagging was performed by Federica Comastri, who worked on texts
that had been pre-tagged with the TreeTagger
(http://www.ims.uni-stuttgart.de/projekte/corplex/TreeTagger/DecisionTreeTagger.html).

Training and testing were performed by Marco Baroni.

Here are the commands we used for training. Before reading this,
please take a look at the documentation that comes with the ACOPOST
toolkit. We used version 1.8.4 of ACOPOST.

Our manually tagged corpus is in the file ita_trainset.cooked (which
we cannot release for copyright reasons):

$ wc ita_trainset.cooked 
   3719  216759 1207825 ita_trainset.cooked

We created the lexicon file in this way:

$ cooked2lex.pl < ita_trainset.cooked > ita_lex
3719 sentences
51 tags 17405 types 109009 tokens
  1     16421  94.346%     77822  71.390%
  2       912   5.240%     21962  20.147%
  3        71   0.408%      9179   8.420%
  4         1   0.006%        46   0.042%
Mean ambiguity A=1.371144

Entropy H(p)=4.073393

This is how we trained the t3 (HMM) tagger:

$ cooked2ngram.pl < ita_trainset.cooked > ita_t3_ngrams

We trained the tbt (transofrmation-based) tagger in the following
way (the files ita_tbt_templates and ita_tbt.log are included in the
ItaACOPOSTModels archive):

$ tbt -l ita_lex -m 4 -n 1 -o 2 -t ita_tbt_templates ita_tbt_rules \
< ita_trainset.cooked 2> ita_tbt.log &

We trained the et (example-based) tagger like this (the files
ita_known.etf, ita_exclude_tags and ita_unknown.etf are also in the
archive):

$ cooked2wtree.pl -a 3 ita_known.etf < ita_trainset.cooked \
> ita_known_tree
No. of features: 5 (from "../aco_input_files/ita_known.etf")
No. of sentences: 3719
No of words: 109009
Most frequent words:
            6076        ","
            3869        "di"
            3563        "."
            2215        "e"
            1941        "che"
Word at rank 100: "poi" (102 occurances)
Frequent word threshold: 102
Entropy: 4.073393
Features:
  0              TAG[-2] H==3.322 IG==0.751 S==3.711 GR==0.202
  1              TAG[-1] H==2.694 IG==1.379 S==3.665 GR==0.376
  2              WORD[0] H==1.364 IG==2.710 S==3.706 GR==0.731
  3             CLASS[0] H==0.142 IG==3.932 S==4.629 GR==0.849
  4             CLASS[1] H==2.441 IG==1.633 S==4.544 GR==0.359
Permutation: 3 2 1 4 0

$ cooked2wtree.pl -b 2 -e ita_exclude_tags ita_unknown.etf \
< ita_trainset.cooked > ita_unknown_tree
No. of features: 10 (from "../aco_input_files/ita_unknown.etf")
No. of sentences: 3719
No of words: 109009
Most frequent words:
            6076        ","
            3869        "di"
            3563        "."
            2215        "e"
            1941        "che"
Word at rank 100: "poi" (102 occurances)
Frequent word threshold: 102
Entropy: 4.073393
Features:
  0              TAG[-1] H==0.255 IG==3.819 S==0.965 GR==3.957
  1               CAP[0] H==0.316 IG==3.757 S==0.498 GR==7.545
  2            NUMBER[0] H==0.372 IG==3.702 S==0.424 GR==8.723
  3            HYPHEN[0] H==0.392 IG==3.681 S==0.409 GR==8.992
  4          LETTER[0,1] H==0.284 IG==3.789 S==1.070 GR==3.540
  5         LETTER[0,-4] H==0.340 IG==3.734 S==0.984 GR==3.794
  6         LETTER[0,-3] H==0.316 IG==3.758 S==0.923 GR==4.073
  7         LETTER[0,-2] H==0.288 IG==3.785 S==0.923 GR==4.102
  8         LETTER[0,-1] H==0.321 IG==3.752 S==0.809 GR==4.636
  9             CLASS[1] H==0.320 IG==3.754 S==1.030 GR==3.646
Permutation: 3 2 1 8 7 6 0 5 9 4


TESTING
=======

The taggers and various tagger combinations were tested on the
ita_trainset.cooked dataset in a series of 10-fold cross-validation
experiments (in each fold we trained the models with nine tenths of
the whole data, whereas the trained models we are providing were
trained on the full dataset).

The following table reports statistics about the percentage word-level
accuracy achieved by the three taggers and by the best performing
combinations in these experiments.

           HMM    TBT    ET     COMB   STCOMB
Min        93.64  94.08  92.09  94.41  94.47
Median     95.31  95.46  94.20  95.82  95.88
Mean       95.04  95.20  93.95  95.62  95.61
Max        96.02  96.11  95.47  96.47  96.46

COMB is a majority voter that, in case of ties, picks the HMM
tag.

STCOMB (for STacked COMBination) is a majority voter that uses a
HMM-TBT stack instead of TBT (i.e., the TBT tagger takes a corpus
pre-tagged by the HMM tagger as input). STCOMB also picks the HMM tag
in cases of ties.


TAGGING
=======

The corpus to be tagged should be in one-line per sentence format, and
the tokens (including punctuation marks) should be separated by space.

The corpus used for training is latin1-encoded, thus the corpus to be
tagged should also be in this encoding.

In order to tag a corpus using our models, follow the instructions in
the ACOPOST documentation.

For example:

$ t3 ita_t3_ngrams ita_lex < corpus > t3_tagged_corpus

$ tbt -l ita_lex -r ita_tbt_rules < corpus > tbt_tagged_corpus

$ et ita_known ita_unknown ita_lex < corpus > et_tagged_corpus

To tag with a t3/tbt stack:

$ tbt -l ita_lex ita_tbt_rules < t3_tagged_corpus \
> t3_tbt_tagged_corpus

We provide a simple script to perform majority tagging:

$ simple_majority_tagger t3_tagged_corpus tbt_tagged_corpus \
et_tagged_corpus > majority_tagger_corpus

The script only works with three corpora as argument. In case of ties,
it chooses the tag assigned by the first of the three corpora.


COPYRIGHT AND LICENSE INFORMATION
=================================

Copyright 2004, SSLMIT/SITLEC

The models are free resources. You may copy, edit or redistribute them
under the same terms as the ACOPOST toolkit.


FEEDBACK
========

baroni@sslmit.unibo.it