Italian ACOPOST Models Readme ============================= http://sslmit.unibo.it/~baroni/tools_and_resources.html ACOPOST (http://sourceforge.net/projects/acopost) is an open source Part-of-Speech tagging toolkit, containing a transformation-based tagger, an example-based tagger, a HMM tagger and a maximum entropy tagger. This archive contains pre-trained models to tag Italian text using three of the ACOPOST taggers: the transformation-based tagger, the example-based tagger and the HMM tagger (for reasons we are trying to understand, the maximum entropy tagger causes a segmentation fault when it is run with our trained model). This is version 1.1 of our pre-trained models (small changes in the input files and in the tagset). CONTENTS OF ARCHIVE =================== The archive contains the following files: - Readme.ItaACOPOSTModels: this readme file; - ita_exclude_tags: list of tags that were excluded in training the et unknown word model; - ita_known.etf: feature file used to train et known word model; - ita_known_tree: et known word model; - ita_lex: lexicon file; - ita_tagset.txt: the tagset we used to tag the data; - ita_t3_ngrams: t3 model; - ita_tbt.log: STDERR output of tbt training; - ita_tbt_rules: tbt model; - ita_tbt_templates: template file used to train tbt; - ita_unknown.etf: feature file used to train et unknown word model; - ita_unknown_tree: et unknown word model; - simple_majority_tagger.pl: simple perl script to do majority tagging; TRAINING ======== We trained the models on a corpus of 180 manually tagged articles randomly selected from the years 1985-88 of the Italian national daily _la Repubblica_, for a total of about 107,500 tokens. Manual tagging was performed by Federica Comastri, who worked on texts that had been pre-tagged with the TreeTagger (http://www.ims.uni-stuttgart.de/projekte/corplex/TreeTagger/DecisionTreeTagger.html). Training and testing were performed by Marco Baroni. Here are the commands we used for training. Before reading this, please take a look at the documentation that comes with the ACOPOST toolkit. We used version 1.8.4 of ACOPOST. Our manually tagged corpus is in the file ita_trainset.cooked (which we cannot release for copyright reasons): $ wc ita_trainset.cooked 3719 216759 1207825 ita_trainset.cooked We created the lexicon file in this way: $ cooked2lex.pl < ita_trainset.cooked > ita_lex 3719 sentences 51 tags 17405 types 109009 tokens 1 16421 94.346% 77822 71.390% 2 912 5.240% 21962 20.147% 3 71 0.408% 9179 8.420% 4 1 0.006% 46 0.042% Mean ambiguity A=1.371144 Entropy H(p)=4.073393 This is how we trained the t3 (HMM) tagger: $ cooked2ngram.pl < ita_trainset.cooked > ita_t3_ngrams We trained the tbt (transofrmation-based) tagger in the following way (the files ita_tbt_templates and ita_tbt.log are included in the ItaACOPOSTModels archive): $ tbt -l ita_lex -m 4 -n 1 -o 2 -t ita_tbt_templates ita_tbt_rules \ < ita_trainset.cooked 2> ita_tbt.log & We trained the et (example-based) tagger like this (the files ita_known.etf, ita_exclude_tags and ita_unknown.etf are also in the archive): $ cooked2wtree.pl -a 3 ita_known.etf < ita_trainset.cooked \ > ita_known_tree No. of features: 5 (from "../aco_input_files/ita_known.etf") No. of sentences: 3719 No of words: 109009 Most frequent words: 6076 "," 3869 "di" 3563 "." 2215 "e" 1941 "che" Word at rank 100: "poi" (102 occurances) Frequent word threshold: 102 Entropy: 4.073393 Features: 0 TAG[-2] H==3.322 IG==0.751 S==3.711 GR==0.202 1 TAG[-1] H==2.694 IG==1.379 S==3.665 GR==0.376 2 WORD[0] H==1.364 IG==2.710 S==3.706 GR==0.731 3 CLASS[0] H==0.142 IG==3.932 S==4.629 GR==0.849 4 CLASS[1] H==2.441 IG==1.633 S==4.544 GR==0.359 Permutation: 3 2 1 4 0 $ cooked2wtree.pl -b 2 -e ita_exclude_tags ita_unknown.etf \ < ita_trainset.cooked > ita_unknown_tree No. of features: 10 (from "../aco_input_files/ita_unknown.etf") No. of sentences: 3719 No of words: 109009 Most frequent words: 6076 "," 3869 "di" 3563 "." 2215 "e" 1941 "che" Word at rank 100: "poi" (102 occurances) Frequent word threshold: 102 Entropy: 4.073393 Features: 0 TAG[-1] H==0.255 IG==3.819 S==0.965 GR==3.957 1 CAP[0] H==0.316 IG==3.757 S==0.498 GR==7.545 2 NUMBER[0] H==0.372 IG==3.702 S==0.424 GR==8.723 3 HYPHEN[0] H==0.392 IG==3.681 S==0.409 GR==8.992 4 LETTER[0,1] H==0.284 IG==3.789 S==1.070 GR==3.540 5 LETTER[0,-4] H==0.340 IG==3.734 S==0.984 GR==3.794 6 LETTER[0,-3] H==0.316 IG==3.758 S==0.923 GR==4.073 7 LETTER[0,-2] H==0.288 IG==3.785 S==0.923 GR==4.102 8 LETTER[0,-1] H==0.321 IG==3.752 S==0.809 GR==4.636 9 CLASS[1] H==0.320 IG==3.754 S==1.030 GR==3.646 Permutation: 3 2 1 8 7 6 0 5 9 4 TESTING ======= The taggers and various tagger combinations were tested on the ita_trainset.cooked dataset in a series of 10-fold cross-validation experiments (in each fold we trained the models with nine tenths of the whole data, whereas the trained models we are providing were trained on the full dataset). The following table reports statistics about the percentage word-level accuracy achieved by the three taggers and by the best performing combinations in these experiments. HMM TBT ET COMB STCOMB Min 93.64 94.08 92.09 94.41 94.47 Median 95.31 95.46 94.20 95.82 95.88 Mean 95.04 95.20 93.95 95.62 95.61 Max 96.02 96.11 95.47 96.47 96.46 COMB is a majority voter that, in case of ties, picks the HMM tag. STCOMB (for STacked COMBination) is a majority voter that uses a HMM-TBT stack instead of TBT (i.e., the TBT tagger takes a corpus pre-tagged by the HMM tagger as input). STCOMB also picks the HMM tag in cases of ties. TAGGING ======= The corpus to be tagged should be in one-line per sentence format, and the tokens (including punctuation marks) should be separated by space. The corpus used for training is latin1-encoded, thus the corpus to be tagged should also be in this encoding. In order to tag a corpus using our models, follow the instructions in the ACOPOST documentation. For example: $ t3 ita_t3_ngrams ita_lex < corpus > t3_tagged_corpus $ tbt -l ita_lex -r ita_tbt_rules < corpus > tbt_tagged_corpus $ et ita_known ita_unknown ita_lex < corpus > et_tagged_corpus To tag with a t3/tbt stack: $ tbt -l ita_lex ita_tbt_rules < t3_tagged_corpus \ > t3_tbt_tagged_corpus We provide a simple script to perform majority tagging: $ simple_majority_tagger t3_tagged_corpus tbt_tagged_corpus \ et_tagged_corpus > majority_tagger_corpus The script only works with three corpora as argument. In case of ties, it chooses the tag assigned by the first of the three corpora. COPYRIGHT AND LICENSE INFORMATION ================================= Copyright 2004, SSLMIT/SITLEC The models are free resources. You may copy, edit or redistribute them under the same terms as the ACOPOST toolkit. FEEDBACK ======== baroni@sslmit.unibo.it