Knorpora Software

I added the following extra programs and perl modules to the basic Knoppix 3.3 distribution:

Programs

WordNet
The famous lexical database for the English language.
Natural Language Toolkit
A suite of Python libraries and programs for natural language processing, with an impressive collection of sample data that can also be used with the other programs.
ACOPOST
A collection of Part-of-Speech Taggers, with pre-trained models to tag Italian text.
TreeTagger
POS tagger and lemmatizer with pre-trained models to tag English, German, Italian and French.
fnTBL Toolkit
Transformation-Based Learning toolkit, pre-trained to perform English POS tagging and NP and text chunking.
FreeLing
A library to perform tokenization, sentence splitting, morphological analysis, NE detection and PoS tagging, which comes with a simple command line interface and pre-trained models for English, Spanish and Catalan.
ChaSen (and Ipadic)
Tool and dictionary to perform tokenization and morphological analysis of Japanese text.
BootCaT
Perl programs to extract specialized corpora and terms from the web.
K-vec++
Implementation of the K-vec algorithm to extract candidate translations from parallel corpora.
Ngram Statistics Package (NSP)
Perl programs to extract n-grams from corpora and evaluate their association strength.
UCS (Utilities for Cooccurrence Statistics)
A toolkit of perl and R programs for the analysis of cooccurrence statistics.
SenseClusters
A complete unsupervised word sense discrimination system.
The Bow Toolkit
Toolkit to perform document classification, retrieval and clustering, and other statistical text analysis tasks.
Finite State Utilities
Jan Daciuk's tools to build and use finite state automata and transducers.
kwic
Pete Whitelock's simple perl concordancer.
regexp_tokenizer.pl
A simple tokenizer.
R
A very powerful statistical analysis environment.

Perl Modules

HTML::FormatText
A module to format HTML as plain text.
Net::Google
Simple OO-ish interface to the Google SOAP API.
XML::Twig
A cool way to process XML documents.
WordNet::QueryData
A perl interface to WordNet.
WordNet::Similarity
Perl modules and programs to compute WordNet-based similarity measures.
PDL
Number crunching capabilities for Perl.

Here is a list of all the Perl modules installed in Knorpora (the ones listed above plus the ones that are already part of the standard Knoppix distribution plus the ones that I installed to satisfy some dependency).

I had originally planned to include more corpora, but I then realized that it makes more sense for users to download corpora and other resources in the languages of their interest, than for me to pick an arbitrary set of languages. There should be enough (English) data to get started with in the NLTK directory. For pointers to more freely available data, please visit my NLP data link list.

Back to the Welcome to Knorpora page