Knorpora Software
I added the following extra programs and perl modules to the basic Knoppix 3.3
distribution:
- WordNet
The
famous lexical database for the English language.
- Natural Language
Toolkit
A suite of Python libraries and programs for
natural language processing, with an impressive collection of sample
data that can also be used with the other programs.
- ACOPOST
A
collection of Part-of-Speech Taggers, with pre-trained
models to tag Italian text.
- TreeTagger
POS
tagger and lemmatizer with pre-trained models to tag English, German,
Italian and French.
- fnTBL
Toolkit
Transformation-Based Learning toolkit, pre-trained
to perform English POS tagging and NP and text chunking.
- FreeLing
A
library to perform tokenization, sentence splitting, morphological
analysis, NE detection and PoS tagging, which comes with a simple
command line interface and pre-trained models for English, Spanish and
Catalan.
- ChaSen
(and Ipadic)
Tool and dictionary to perform tokenization
and morphological analysis of Japanese text.
- BootCaT
Perl
programs to extract specialized corpora and terms from the web.
- K-vec++
Implementation
of the K-vec algorithm to extract candidate translations from parallel
corpora.
- Ngram
Statistics Package (NSP)
Perl programs to extract n-grams
from corpora and evaluate their association strength.
- UCS (Utilities for
Cooccurrence Statistics)
A toolkit of perl and R programs
for the analysis of cooccurrence statistics.
- SenseClusters
A
complete unsupervised word sense discrimination system.
- The Bow
Toolkit
Toolkit to perform document classification,
retrieval and clustering, and other statistical text analysis
tasks.
- Finite
State Utilities
Jan Daciuk's tools to build and use finite
state automata and transducers.
- kwic
Pete
Whitelock's simple perl concordancer.
- regexp_tokenizer.pl
A
simple tokenizer.
- R
A very
powerful statistical analysis environment.
Here is a list of all the Perl modules installed in Knorpora (the ones listed above plus the ones that are already part of the standard Knoppix distribution plus the ones that I installed to satisfy some dependency).
I had originally planned to include more corpora, but I then
realized that it makes more sense for users to download corpora and
other resources in the languages of their interest, than for me to
pick an arbitrary set of languages. There should be enough (English)
data to get started with in the NLTK directory. For pointers to more
freely available data, please visit my NLP
data link list.
Back to the Welcome to Knorpora page