Trained models for tagging Italian texts with the ACOPOST taggers

ACOPOST is an open source Part-of-Speech tagging toolkit, containing a transformation-based tagger, an example-based tagger, a HMM tagger and a maximum entropy tagger.

From this page, you can download pre-trained models to tag Italian text using three of the ACOPOST taggers: the transformation-based tagger, the example-based tagger and the HMM tagger (for reasons we are trying to understand, the maximum entropy tagger causes a segmentation fault when it is run with our trained model).

The models were trained using ACOPOST version 1.8.4.

We trained the models on a corpus of 180 manually tagged articles randomly selected from the years 1985-88 of the Italian national daily la Repubblica, for a total of about 107,500 tokens.

The following table reports statistics about the percentage word-level accuracy achieved by the three taggers and by their combination in a series of 10-fold cross-validation experiment

HMMTBTETCOMBSTCOMB
Min93.6494.0892.0994.4194.47
Median95.3195.4694.2095.8295.88
Mean95.0495.2093.9595.6295.61
Max96.0296.1195.4796.4796.46

HMM is the Hidden Markov Model tagger, TBT is the Transformation-Based Tagger and ET is the Example-based Tagger.

COMB is a majority voter that, in case of ties, picks the HMM tag. This was the best performing simple combination.

STCOMB (for STacked COMBination) is a majority voter that uses a HMM-TBT stack instead of TBT (i.e., the TBT tagger takes a corpus pre-tagged by the HMM tagger as input). STCOMB also picks the HMM tag in cases of ties. This was the best performing stacked combination.

The HMM tagger and TBT achieve very similar performance levels (the second one is slightly better); the tagger combinations outperform the simple taggers.

Since the models are trained on newspaper text, we expect the actual performance level to be lower on other text typologies.

An alternative freely available pre-trained tagger for Italian texts is the TreeTagger.

Compared to our models, the TreeTagger has the great advantage of coming with a script that also performs tokenization and lemmatization.

On the other hand, ACOPOST used with our models allows you to experiment with different taggers and tagging combinations (the archive contains a simple perl script to do majority tagging). Moreover, since the models have been trained on a larger corpus, they might possibly achieve a higher performance level (this is hard to test since our tagset is different from the one used by the TreeTagger).

More info on the la Repubblica corpus.

Read the readme file.

Take a look at the tagset.

Download the Italian ACOPOST Models archive (Readme and tagset files are included).

Back to the tools and resources page