Intro to Symbolic and Statistical NLP in Scheme: Code: N-Grams

Creating word frequency lists (Uni-gram model)

countwords1 (source)
Simple tokenization of text files and generation of increasing and decreasing word frequency lists.
Derivational steps: countwords1a (source), countwords1b (source), countwords1c (source), countwords1d (source), countwords1e (source), countwords1f (source), countwords1g (source), countwords1h (source), countwords1i (source), countwords1j (source), countwords1k (source), countwords1l (source)

countwords2 (source)
As countwords1, additionally all orthographic symbols are eliminated from the text, and the tokens are lower-cased.

countwords3 (source)
As countwords2, but eliminating stop-words (function words) from the frequency profile.

mkstpwdlst (source)
Convert a textfile with words into a Scheme list data structure.

english.scm (source)
English stop-words as a Scheme data structure.

danish.scm (source)
Danish stop-words as a Scheme data structure. Coded in UTF-8!

dutch.scm (source)
Dutch stop-words as a Scheme data structure. Coded in UTF-8!

french.scm (source)
French stop-words as a Scheme data structure. Coded in UTF-8!

german.scm (source)
German stop-words as a Scheme data structure. Coded in UTF-8!

italian.scm (source)
Italian stop-words as a Scheme data structure. Coded in UTF-8!

norwegian.scm (source)
Norwegian stop-words as a Scheme data structure. Coded in UTF-8!

portugese.scm (source)
Portugese stop-words as a Scheme data structure. Coded in UTF-8!

spanish.scm (source)
Spanish stop-words as a Scheme data structure. Coded in UTF-8!

swedish.scm (source)
Swedish stop-words as a Scheme data structure. Coded in UTF-8!

Creating Bi-gram models

countbigrams1 (source)
Simple tokenization of text files and generation of decreasing bigram frequency profiles.

countbigrams2 (source)
As countbigrams1, additionally all orthographic symbols are eliminated from the text, and the tokens are lower-cased.

countbigrams3 (source), countbigrams4 (source), countbigrams5 (source)

Language Identification with Tri-gram models

chartrigrams (source)
Creating trigram models on the character level.

trigraph (source)
Creating a trigraph language model for Language Identification (LID)

LID (source)
Language Identifier (requires models generated with trigraph.scm, and text files with text samples for testing)
Detailed instructions

Various N-Gram models

average-mi (source)
Calculates the left and right average mutual information over lexical items.

average-re (source)
Calculates the overall left and right average mutual information over lexical items.

average-mi-brown (source)
Analyzing the Brown Corpus (tags on tokens separated by /) for token, tag, and other relations.

Introduction to Symbolic and Statistical NLP in Scheme