lidtrainer (version 0.2) | index /Users/dcavar/Documents/Teaching/DGfS Herbstschule 2005/Code/LID/lidtrainer.py |
lidtrainer.py
(C) 2005 by Damir Cavar <dcavar@indiana.edu>
License:
This program is free software; you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation; either version 2 of the License, or
(at your option) any later version.
Functionality:
Lidtrainer processes all the files given as parameters to the script in the
following way:
It extracts all tri-grams from all files.
It keeps track of the frequencies of single tri-grams over all documents.
It prints the sorted list (based on frequency/probability) of the tri-grams
to the screen. The output can be piped to a file. This file represents the
language model for Lid.
Read about Lid to understand how this algorithm works.
Please send your comments and suggestions!
Modules | ||||||
|
Classes | ||||||||
|
Functions | ||
Data | ||
__author__ = 'Damir Cavar' __version__ = 0.20000000000000001 ascii_letters = 'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ' ascii_lowercase = 'abcdefghijklmnopqrstuvwxyz' ascii_uppercase = 'ABCDEFGHIJKLMNOPQRSTUVWXYZ' digits = '0123456789' hexdigits = '0123456789abcdefABCDEF' letters = 'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ' lowercase = 'abcdefghijklmnopqrstuvwxyz' octdigits = '01234567' printable = '0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~ \t\n\r\x0b\x0c' punctuation = '!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~' uppercase = 'ABCDEFGHIJKLMNOPQRSTUVWXYZ' whitespace = '\t\n\x0b\x0c\r ' |
Author | ||
Damir Cavar |