Python
My PyPi modules:
More:
Some of the material is available on GitHub or Bitbucket.
I developed some Python and NLP, CL, ML teaching material as iPython notebooks for
jupyter. They will all be linked here eventually, here are some examples:
- Intro to Part-of-Speech Tagging (zip, jupyter nbviewer, Anaconda Cloud Notebook, GitHub
repo)
- Intro to Hidden Markov Models (zip, jupyter nbviewer, Anaconda Cloud Notebook, GitHub
repo)
- Intro to WordNet and NLTK (zip, jupyter nbviewer, GitHub
repo)
- Topic Modeling with MALLET (zip, jupyter nbviewer, GitHub repo)
- Intro to the Forward Algorithm (zip, jupyter nbviewer)
- Intro to the Backward Algorithm (zip, jupyter nbviewer)
- …
I was porting some Finite State algorithms to Python
3 for some
more or less functional FST-lib for Weighted Finite State Transducers
in native Python, and code generation to C for example. I will place the
code on GitHub: Project PyFST
Here is some of the material from my Python classes and developments.
Some of it is from the late 90s, so it might be outdated, and not
really working in Python 3.x.
Some of the Python examples and tutorials (slides and instruction
handouts) for corpus, data and language processing are adapted to Python 3.
- course material for JSSECL 2006
- course material for the DGfS/CL Fall School
2005
- Corpus processing tools (TEI XML from HTML, XML filtering, quantitative analysis)
- Language identification (LID) with n-gram models
- Orthography to IPA conversion for
Croatian (with Malgorzata E. Cavar): see
phonemic
- Parsing
algorithms (Charty, Earley
algorithm in Python, Scheme and JavaScript, and other computational syntax tools)
- TextStat.py
lightweight module with functions for creating and using n-gram models
for statistical analyses, various statistical functions, chi2 test,
vector space conversion of n-gram models, entropy and information
theoretic measures etc. There are examples for document classification,
measures of text or model similarity and various other useful functions.
- Finite State Automata (FSA)
scripts: FSA class, automaton from word list, DOT (Graphviz) from automaton, etc.
- Mutual Information and Relative
Entropy syntactic parsing (Python code base)
- Text 2 TEI XML with linguistic
annotation
- Lithuanian, Croatian, ... finite state morphology (transducer, lemmatized, feature annotation) (mostly in C++ now,
see the FLE Project)