Tutorial Machine Learning and Language Acquisition : Code

General Comment

All code examples are programmed in Python 2.3 (and should be compatible with newer versions of Python). Your Mac OSX should have a version of Python installed. If not, there is a developer CD that contains it. Linux machines usually all have some version of Python as part of the system. If you do not have Python on your machine, mainly, if you are a Windows user, follow the following instructions on how to obtain a version for your system:

Windows specific
Python for everybody else

Code Examples

In general, all code examples are published under the GNU Public License, even if not explicitly mentioned in the code itself.

Use the examples via command line:

python example.py [textfile]

You will need: General code fragments used by some of the scripts, copy it to wherever you copy the other files!

N-gram Extraction

Frequency profile of uni-grams 1
Frequency profile of uni-grams 2
Frequency profile of uni-grams without function words for English
Bi-grams of characters
N-grams (you specify the N)
Another N-grams tool (you specify the N)
Another N-grams tool for characters (you specify the N)
Uni-grams of characters

Mutual Information and Relative Entropy

Mutual Information of bi-grams
Mutual Information and Relative Entropy
Average Left and Right Point-wise Mutual Information for tokens (words)
Average Left and Right Point-wise Mutual Information for tokens and types using the Brown corpus (or any other with slash "dog/N" tag annotation)
Average Left and Right Point-wise Mutual Information for tokens and types 2 using the Brown corpus (or any other with slash "dog/N" tag annotation)
Average Left and Right Point-wise Mutual Information for type information only using the Brown corpus (or any other with slash "dog/N" tag annotation)
Relative Entropy over tokens
Relative Entropy over token and type pairs

Language Identification

More information and the code with some language models can be found at:

LID: language identification over 3-grams

Clustering Algorithms

K-means example by Jay Askren (Indiana University)
Documentation

MI- and RE-based Parsing

This code requires you to use some additional files and the Brown corpus for example (or any other corpus where the tokens are slash-annotated (e.g. dog/N)). The choice of tags should be consistent over the corpus, it is irrelevant otherwise.

MI-RE Parser