C++
The code for the ongoing projects an be found on my GitHub and Bitbucket repos.
Some tools coded in C++:
- ELAN2split
(splitting ELAN annotation files into time-sequences as
annotated in a specific tier). This tool generates a corpus of file-pairs, i.e.
audio-file chunks from a time-aligned speech corpus with the
corresponding transcription to be used by the Hidden Markov Model
Toolkit (HTK) based speech
tools for the generation of Forced Aligners or training of other
types of speech recognizer models. The C++11 code is
available
at the Bitbucket Git Repo.
- TreeBankParserSA
is a tool written in C++11 to extract Context-free Grammar
rules from
treebanks in the Penn-Treebank format. It can generate Probabilistic
Context-free Grammar (PCFG) formats for the Free Linguistic
Environment (FLE)
with absolute counts and relative frequencies. The frequencies can
refer to the left-hand-side symbol or the particular extracted rule. One
output format will be also compact using Finite State representations
or the FLE-based Weighted Finite State
Transducer (WFST)
representation.
- Free Linguistic Environment (FLE), a parser environment implemented in
C++11/C++14,
mainly focusing on compatibility with XLE and XFST for parsing based on the LFG-formalism
(using the existing XFST morphologies and XLE
grammars). It also can parse with CFGs, PCFGs, etc. The implementation
provides an environment to work with Probabilistic LFG in the backbone
(using PCFGs or higher level probabilistic grammars), or it allows for
modeling of probabilistic relations between inputs and parse-tree and
f-representation. The morphological analyzer uses Foma and OpenFST. I run a list for the
development group, a closed
Bitbucket Git-repo (by invitation) and a
free and open repo with the finally released code.