NAME

regexp_tokenizer.pl: a simple tokenizer based on regular expressions specified in a parameter file.


SYNOPSIS

    regexp_tokenizer.pl -h
    regexp_tokenizer.pl -v
    regexp_tokenizer.pl -t token_regexps input > tokenized_output
    regexp_tokenizer.pl -t token_regexps -r replacement_patterns input > tokenized_output
    regexp_tokenizer.pl -i -t token_regexps input > tokenized_output
    regexp_tokenizer.pl -t token_regexps -s "^[\.\?\!]+$" input > tokenized_output
    regexp_tokenizer.pl -i -t token_regexps -r replacement_patterns -s "^[\.\?\!]+$" input > tokenized_output


INSTALLATION

Installation is trivial.

Go to http://sslmit.unibo.it/~baroni/regexp_tokenizer.html and download the regexp_tokenizer-VERSION_NUMBER.tar.gz archive.

Unpack it:

    $ tar xvzf regexp_tokenizer-VERSION_NUMBER.tar.gz

Make the script executable:

    $ chmod +x regexp_tokenizer-VERSION_NUMBER/regexp_tokenizer.pl

If you want, add the relevant directory to the PATH variable, so that you will be able to call the tokenizer from wherever you are without having to specify the path to the script.

If you use tcsh, add something like the following line to the .tcshrc file:

    setenv PATH "${PATH}:/home/marco/sw/regexp_tokenizer-VERSION_NUMBER"

If you use the bash shell, add something like the following line to .bashrc:

    PATH=$PATH:/home/marco/sw/regexp_tokenizer-VERSION_NUMBER

That's it!


DESCRIPTION

This is a tokenizer that splits a text into tokens on the basis of a set of regular expressions that are specified by the user in a parameter file.

In this way, the tokenizer can be personalized for different languages and/or tokenization purposes.

Moreover, the user can provide a list of regular expression + replacement pattern pairs, specifying strings that must be modified before applying tokenization.

Also, all upper case characters in the ascii/latin1 range can be converted to lower case before tokenization.

Finally, the user can specify a regular expression describing end-of-sentence tokens, in which case the tokenizer is going to split the input into sentences, and the sentences into tokens.

When no end-of-sentence pattern is specified, output is in one-token-per-line format; when there is an end-of-sentence pattern, output is in one sentence-per-line format, with tokens delimited by single whitespace.

The main weakness of this tokenizer (besides the fact that it is virtually untested...) lies in the fact that it is not possible to do context-sensitive tokenization. For example, there is no way to formulate rules such as: treat this period as part of the previous token if word that follows begins with lowercase letter. I hope that in future versions I will be able to support this feature, at least in some limited, hacky way.

The basic algorithm I implement is the same implemented in the count.pl script of the NSP toolkit (http://www.d.umn.edu/~tpederse/nsp.html) written by Ted Pedersen, Satanjeev Banerjee and Amruta Purandare: many thanks to them for the idea!

I decided to write a different program because I needed to do tokenization for tasks other than ngram counting. Moreover, I wanted to be able to transform strings before applying tokenization, and to spot end-of-sentence markers.

Two important caveats:

1) This is version 0.01 of the script, and I mean it! This is very very preliminary and testing has been minimal: do not be surprised if even some of the basic functionalities described in the documentation do not work.

2) The script was not designed with efficiency in mind and it could easily turn out to be too slow for your task. All the expected factors will affect efficiency: more data, longer parameter files, less computing power...

The Tokenization Algorithm

This is the basic tokenization algorithm (the only difference wrt the count.pl algorithm is that I take a single line at a time, rather than the whole input at once, as my processing buffer):

    - For each line of input:
       - Copy line to current_string;
          - While current_string is not empty:
             - For each regexp in regexp_list:
                - If regexp is matched by substring at the left edge 
                  of current input string:
                   - Treat substring that matched regexp as a token
                     (printing it on its own line);
                   - Remove matching substring from current_string;
                   - Quit this for loop;
             - If no regexp matched:
                - Remove first character from current_string;

There are two important things to notice, regarding this algorithm.

First, the inner loop goes through the regexp list and stops as soon as it finds a matching regexp. Thus, the order in which the regexps are listed in the token regexp file matters: If two or more regexps match input from left side, only the first one will be applied.

Second, when no regexp matches, the first character is dropped and the search re-starts. This means that characters that do not fit into some regular expression will simply be ignored. For example, if no regular expression matches whitespace, whitespace will be entirely discarded. If no regular expression matches non-alphabetic characters, then only alphabetic characters will be kept.

The Token Regexp File

The file listing the regular expressions to be used for tokenization must be in one-regular-expression-per-line format. Empty lines and lines beginning with # are ignored (so that one can add comments).

Most valid perl regular expressions should work (unfortunately, \1 and the like will not work).

See the file ital_reg_exps.txt (which is part of the tar.gz archive) for a realistic example.

A simple English regexp file could contain these two lines:

    [a-zA-Z']+
    [\.,;:\?!]

Given this regexp file and the following input:

    John's friends are: Frank, Donna and me.

output will be:

    John's
    friends
    are
    :
    Frank
    ,
    Donna
    and
    me
    .

Order matters. Consider this input:

    John's friends are: Frank, Donna and Mr. Magoo.

Suppose that we added a Mr\. regexp at beginning of regexp file:

    Mr\.
    [a-zA-Z']+
    [;:\.\?!\(\)\-]

Then, output will be:

    John's
    friends
    are
    :
    Frank
    ,
    Donna
    and
    Mr.
    Magoo
    .

The regular expression Mr\. is matched by the left edge of the string Mr. Magoo, thus the token Mr. is constructed, and the other two regular expressions are not applied to this substring.

If, instead, we list Mr\. AFTER the other two regexps, or even in the middle of the two, the output will be:

    John's
    friends
    are
    :
    Frank
    ,
    Donna
    and
    Mr
    .
    Magoo
    .

This happens because Mr matches the [a-zA-Z']+ regular expression, so a token Mr is constructed and this substring is removed from input. At this point, leftover substring is . Magoo. and regexp Mr\. does not match anything, anymore.

Notice that tokens specified by regexp patterns can contain whitespaces. If our tokenization file contained the regexp Mr\. Magoo at the beginning of the list, the output would be:

    John's
    friends
    are
    :
    Frank
    ,
    Donna
    and
    Mr. Magoo
    .

See below on how this interacts with one-sentence-per-line output format.

Replacements

A file containing replacement patterns can be passed to the tokenizer via the -r option.

This file can also contain comment lines beginning with # and, again, empty lines will be ignored.

All the other lines of the file must contain two tab-delimited fields: a regular expression target and a replacement string.

Each pair is interpreted as the left and right side of a perl global replacement command. These global replacements are applied, in order, to each input line before the basic tokenization algorithm.

For example, suppose that we use a replacement file containing the following lines:

    [0-9]+      NUM
    [A-Z][a-z]+ CAP

Then, for each line, the script will perform the following global replacements before tokenizing:

    s/[0-9]+/NUM/g;
    s/[A-Z][a-z]+/CAP/g;

Some things to keep in mind:

1) The replacements apply BEFORE tokenization, so tokenization patterns have to be designed with the output of the replacement phase in mind (e.g., given the replacement patterns above, it would make no sense to have regexps referring to digits in the token regexp file).

2) Order matters: If the first replacement pattern in the file gets rid of all digits, then a later replacement pattern targeting digits will never match anything.

3) All replacements are applied, one after the other, to each input string. Thus, the output of a replacement will constitute the input of the next replacement. If a replacement pattern with target NUM followed the first rule in the example above, then this replacement would also be applied to all instances of NUM created by the first rule (notice difference from application of regexp during tokenization, where substrings matching a regexp are immediately removed from the input buffer, and thus they are not compared to following regexps).

4) You can also use the replacement file to specify target strings to be deleted. Simply use the empty string as the replacement string. In other words, a line containing a regexp followed by a tab (and then nothing) is interpreted as an instruction to remove all strings matching the regexp from input. E.g., use ``<[^>]+>'' followed by tab to remove all XML/HTML tags from input before applying tokenization. It is important to remember to add the tab, even if it is not followed by anything.

5) Unfortunately, at the moment the replacement string cannot contain ``matched variables'' ($1, $2...) This would be a very powerful feature, and I hope to find out how to implement it in the future.

Sentences

In my experience, tokenization tasks fall into two broad classes: those where we only care about identifying tokens (e.g., various unigram frequency collection tasks), and those where we also want to identify sentence boundaries (e.g., preparing data for POS tagging).

Thus, I provide two output formats for the tokenizer: one-token-per-line, which is typically the handiest format when sentence boundaries do not matter, and one-sentence-per-line-with-space-delimited-tokens, for cases where sentence boundaries do matter.

If you are interested in sentence boundaries, you will have to use the -s option followed by a regular expression identifying end-of-sentence markers.

This regexp will be applied to tokens once they are identified.

Thus, -s "[\.\?\!]" will treat any token containing a period, a question mark or an exclamation mark as an end-of-sentence mark. On the other hand, -s "^[\.\?\!]+$" will match only tokens that are entirely made of punctuation marks (a better choice, typically).

Sentence marker detection takes place after a token is identified. Thus, good sentence boundary detection will depend on a good integration between what you put in the token regexp file and the sentence marker regexp.

For example, suppose that we have this input:

    Mr. Magoo went to U.C.L.A. for his Ph.D. degree. Blah.

Let us consider a few alternative token regexp files.

We start with tok1:

    [a-zA-Z]+

With this token regexp file, periods will be discarded, and thus the following reference to the period as a sentence marker is useless:

    $ echo "Mr. Magoo went to U.C.L.A. for his Ph.D. degree. Blah." |\
    regexp_tokenizer.pl -t tok1 -s "\." -
    Mr Magoo went to U C L A for his Ph D degree Blah

Let's try tok2:

    [a-zA-Z]+
    \.

This time, the periods are preserved, but any period is identified as a an end-of-sentence marker:

    $ echo "Mr. Magoo went to U.C.L.A. for his Ph.D. degree. Blah." |\
    regexp_tokenizer.pl -t tok2 -s "\." -
    Mr .
    Magoo went to U .
    C .
    L .
    A .
    for his Ph .
    D .
    degree .
    Blah .

Now we add Mr., U.C.L.A. and Ph.D. as tokens to tok3:

    Mr\.
    U\.C\.L\.A\.
    Ph\.D\.
    [a-zA-Z]+
    \.
    $ echo "Mr. Magoo went to U.C.L.A. for his Ph.D. degree. Blah." |\
    regexp_tokenizer.pl -t tok3 -s "\." -
    Mr.
    Magoo went to U.C.L.A.
    for his Ph.D.
    degree .
    Blah .

Better. However, since our sentence marker regexp specified that it is sufficient for a token to contain a period in order to be considered a sentence marker, Mr., U.C.L.A. and Ph.D. were treated as sentence markers.

One more try, this time specifying that the sentence marker is a period (as opposed to: contains a period):

    $ echo "Mr. Magoo went to U.C.L.A. for his Ph.D. degree. Blah." |\
    regexp_tokenizer.pl -t tok3 -s "^\.$" -
    Mr. Magoo went to U.C.L.A. for his Ph.D. degree .
    Blah .

Good, that's what we wanted!

Notice that if you have token regexps containing white spaces and you are in sentence detection mode, the tokens with white spaces will become indistinguishable from regular tokens. I.e., suppose that we use the regexp file tok4:

    Mr\. Magoo
    U\.C\.L\.A\.
    Ph\.D\.
    [a-zA-Z]+
    \.

Without -s option we get:

    $ echo "Mr. Magoo went to U.C.L.A. for his Ph.D. degree." |\
    regexp_tokenizer.pl -t tok4 -
    Mr. Magoo
    went
    to
    U.C.L.A.
    for
    his
    Ph.D.
    degree
    .

However, in one-sentence-per-line format we get:

    $ echo "Mr. Magoo went to U.C.L.A. for his Ph.D. degree." |\
    regexp_tokenizer.pl -t tok4 -s "^\.$" -
    Mr. Magoo went to U.C.L.A. for his Ph.D. degree .

Here, the fact that Mr. Magoo is a single token, whereas, say, went to is not is no longer visible in the output.

One way out of this problem is to identify multi-word tokens in the replacement stage and to replace the inner spaces with a special symbol, like in the following example.

Replacement file rep1:

    Mr\. Magoo  Mr._Magoo

Token regexp file tok5:

    Mr\._Magoo
    U\.C\.L\.A\.
    Ph\.D\.
    [a-zA-Z_]+
    \.

Now, we get:

    $ echo "Mr. Magoo went to U.C.L.A. for his Ph.D. degree." |\
    regexp_tokenizer.pl -t tok5 -r rep1 -s "^\.$" -
    Mr._Magoo went to U.C.L.A. for his Ph.D. degree .

where the fact that Mr. Magoo is a token is signaled by the fact that its two elements are connected by an underscore.

In general, replacing the inner white spaces of multi-word tokens with other symbols is probably a good idea anyway.

This last example also shows that sentence marker detection takes place after the replacements from the replacement file are applied. One must keep this in mind when designing both the replacements and the sentence marker expression.

Finally, notice that the tokenizer assumes that sentences cannot cross newlines -- in other words, end-of-line is always treated as end-of-sentence, even if the last token on the line was not a sentence marker.

Lower-Casing

Sometimes, it is a good idea to turn all words to lower case -- for example, if you are collecting frequencies you probably want to treat The and the as two instances of the same token.

Thus, I provide the -i (for case Insensitive) option. If you use this option, all alphabetic characters in the latin1 range will be turned to lower case before anything else is done.

Thus, for example, using regexp file tok6:

    [a-z]

and the -i option we get:

    $ echo "I Used To Make HEAVY Use of CAPITALIZATION" |\
    regexp_tokenizer.pl -i -t tok6 -
    i
    used
    to
    make
    heavy
    use
    of
    capitalization

Lower-casing happens before anything else -- keep this in mind when preparing the replacement and token regexp files. For example, a Mr\. Magoo token regexp will be of no use if lower-casing transformed the relevant string into mr. magoo.

This also means that, by using replacements, one can re-insert upper case words after lower-casing. For example, one could turn all letters to lower case via the -i option, but specify that all digit sequences are to be replaced with NUM in the replacement file. Since replacements are applied after lower-casing, the NUM string would not be affected by lower-casing.

As I said, the -i switch assumes that input is in latin1. If you are working with a different encoding, you will have to change the relevant part of the code, or to do lower-casing via replacement patterns.

Summary

In short, the processes take place in the following order:

    - lower-casing (optional, use -i)
    - replacements (optional, use -r replacement_file)
    - tokenization (mandatory, use -t token_regexp_file)
    - sentence boundary detection (optional, use -s "SENT_MARK_EXP")


AUTHOR

Marco Baroni, marco baroni AT unitn it


BUGS

Probably many: if you find one, please let me know: marco baroni AT unitn it


COPYRIGHT

Copyright 2004, Marco Baroni

This program is free software. You may copy or redistribute it under the same terms as Perl itself.


SEE ALSO

NSP Toolkit: http://www.d.umn.edu/~tpederse/nsp.html