IL PART-OF-SPEECH TAGGING IN PRATICA La lepre ha fatto un salto. Tokenizzazione: La lepre ha fatto un salto . POS-tagging: La ART lepre NOM ha AUX fatto VER un ART salto NOM . PUN Lemmatizzazione: La ART il lepre NOM lepre ha AUX avere fatto VER fare un ART un salto NOM salto . PUN . ********** tagwrapper.pl Documentation This script performs tagging of English, German, Italian, French and Spanish by invoking the appropriate taggers and producing output in the format expected by CWB: The DET the dogs N dog ... ... ... where, independently of the tagger output, the positional attributes are always arranged in the order: word pos lemma (tab delimited). The text elements are going to be present only if the -d option is used (see below). The relevant taggers must be in the path of the user. They are: tree-tagger-english tree-tagger-german ita_tree_tagger_wrapper.pl analyzer The tagsets are those used by these taggers, unless replacement tags are provided in the parameter files bundled in the __DATA__ section of the script. Usage: tagwrapper.pl -l langcode [-d delimiter] inputfile > taggedoutput tagwrapper.pl -h | more -l langcode: one of en de it fr es -d delimiter: if a line begins with delimeter, the first string following delimiter is used as an id and a corresponding text element Usage: tagwrapper.pl -l langcode [-d delimiter] inputfile > taggedoutput tagwrapper.pl -h | more -l langcode: one of en de it fr es -d delimiter: if a line begins with delimeter, the first string following delimiter is used as an id and a corresponding text element is introduced in the output -h: prints this information and quits The script is controlled by various parameter files that are bundled at the bottom of the script in the __DATA__ section. Copyright 2005, Marco Baroni and Sara Piccioni This program is free software. You may copy or redistribute it under the same terms as Perl itself. ********** Dati i corpora creati con il nostro metodo, la stringa-delimiter sara' CURRENT URL. Per esempio: tagwrapper.pl -l it -d "CURRENT URL" corpus.txt > corpus.tgd Taggers usati via tagwrapper.pl: TreeTagger: http://www.ims.uni-stuttgart.de/projekte/corplex/TreeTagger/ DecisionTreeTagger.html FreeLing: http://garraf.epsevg.upc.es/freeling/ Tagsets: Italiano, tedesco, inglese, francese: vedi pagina del TreeTagger Spagnolo: http://sslmit.unibo.it/~baroni/termsett/05_1/spanishtags.txt