Language Identification (LID) Examples
Damir Cavar
Last change: June 2011
The following algorithm and the code example are used for teaching and research purposes. I do not consider them appropriate for any kind of production environment. If you need help with such tools for production environments, contact me.
The code on this page is released under the GNU Lesser General Public License 3 (or newer). Feel free to use it for any purpose you like. However, keep in mind that I take no responsibility for any kind of data loss, or damage on your systems or computers whatsoever.
See the Wikipedia entry on Language Identification for more details on, and references to various types of LID approaches and algorithms.
CHANGES:
Thanks to Marco Tidona and Patrick Hall for pointing out some minor bugs in the prior online version. Their suggestions for corrections are now included in the files.
Description
The language ID tool here is very simple and effective. It requires just a sentence or two to guess the language correctly, given you have trained a good language model.
The principle is very simple. The tool has two parts. One is a generator for a language model (lidtrainer.py). The other is a recognizer that uses the language model (lid.py).
Language Model Generation
The language model generator reads in text files and extracts from the text all tri-grams, which is all sequences of three bytes (= characters in simple ASCII text). So, for example, from a text like:
John kissed Mary. John kissed Jane.
the language model trainer would extract the following tri-grams with the corresponding counts (= how often they were found in the text), by moving a window of three characters along the text:
Joh 2
ohn 2
hn 2
n k 2
ki 2
kis 2
iss 2
sse 2
sed 2
ed 2
d M 1
Ma 1
Mar 1
ary 1
ry. 1
y. 1
. J 1
Jo 1
d J 1
Ja 1
Jan 1
ane 1
ne. 1
It keeps track of the frequencies of single tri-grams over all documents. The frequencies are relativized by dividing the single tri-gram counts through the number of all tri-grams in the training corpus. The list of tri-grams is sorted on the basis of the relative frequency (the probability of the tri-gram in the given corpus). The final list of tri-grams is printed out. The output can be piped to a file. This file represents the language model for the Language ID tool.
Assuming that you have 99 German text documents, use the language model trainer via command line this way:
python lidtrainer.py german1.txt german2.txt ... german99.txt > German.dat
The language model will be stored as German.dat and the Language ID tool will use this model for recognition of text you provide to it.
Language Identification
Assuming that you generated several language models with the generator, for example German.dat, English.dat, Japanese.dat, and that you have stored the language models in the same folder as the Language ID tool (lid.py), you can use the Language ID tool now to identify some text which is stored in the file some.txt:
python lid.py some.txt
How does the Language ID algorithm work? It uses the language models you generated. All the files ending in .dat are loaded into memory. The text that is supposed to be analyzed is analyzed in the same way that the language model generator generates the language models. All the tri-grams are extracted and their relative frequency is calculated, as described above. Now the set of tri-grams for the text file some.txt is compared with the tri-grams in the language models.
Intuitively we would say that the tri-gram frequencies of tri-grams extracted from two different texts of the same language should be very similar, i.e. they should be close to each other. There are several possibilities to calculate this distance, i.e. the distance of one set of frequencies to another.
In the Language ID tool here we use a very simple one. The distance is calculated by looking at the relative frequencies of the tri-grams from the unknown text in the language models, and calculating the distance of the relative frequencies for every tri-gram as the subtraction of the one tri-gram frequency from the other. The distances are summed up. If we compare one text with several language models, for example 10, we would have 10 different values for the sum of distances. The sum of distances is our clustering criterion, i.e. the smallest value represents the best match for our text.
To sum up, the algorithms calculates for every language model the sum of distances for all tri-grams that occur in the unknown text. Consider the example given above. If the following language model is one of the given models with the relative frequency of tri-grams (= absolute frequency divided by total number of tri-grams, here 33):
Joh 2/33
ohn 2/33
hn 2/33
n k 2/33
ki 2/33
kis 2/33
iss 2/33
sse 2/33
sed 2/33
ed 2/33
d M 1/33
Ma 1/33
Mar 1/33
ary 1/33
ry. 1/33
y. 1/33
. J 1/33
Jo 1/33
d J 1/33
Ja 1/33
Jan 1/33
ane 1/33
ne. 1/33
And, if the unknown text is:
Mary saw John.
With the following set of tri-grams:
Mar 1/12
ary 1/12
ry 1/12
y s 1/12
sa 1/12
saw 1/12
aw 1/12
w J 1/12
Jo 1/12
Joh 1/12
ohn 1/12
hn. 1/12
The distance (or standard deviation) to the model above would be:
Mar 1/12 - 1/33
ary 1/12 - 1/33
ry 1/12 - 0
y s 1/12 - 0
sa 1/12 - 0
saw 1/12 - 0
aw 1/12 - 0
w J 1/12 - 0
Jo 1/12 - 1/33
Joh 1/12 - 2/33
ohn 1/12 - 2/33
hn. 1/12 - 0
If we make sure that none of the single tri-gram distances is negative (by taking the square of it for example or just inverting it when it is negative), we could sum the single tri-gram distances up and get a total distance to a language model. If we do this for every language model, we select the language model with the smallest distance value to be the model that best represents the unknown text we are analyzing.
Generating more languages is just a matter of collecting some text sample and running lidtrainer.py on the files. Within a day one could generate hundreds of language models. Also, this code is code-page independent. You can prepare the training text in various encodings (Latin 1 and 2, or Unicode like UTF8) and train on all of them. This way LID could be also used as a code-page recognizer or classifier, in addition to the language classification.
An implementation of this code in Scheme (Racket compatible) can be found here... Go to the section “Language Identification with Tri-gram models” and select the files “trigraph” and “LID”.
A much simpler and faster solution is actually to create such a character-trigram model and basically just sum over the language model probabilities, or to be mathematically cleaner sum the log of the probabilities over each trained model, and select the highest score to be the winner (the target language). An implementation can be downloaded here...
Support
If you have suggestions and comments, let me know. If you download this code and find it useful for teaching or your own code development, please take some time and let me know how you use it, in a short mail. Your comment and support is important and very much appreciated! Thank you!
Download
See in the Files section...