CroMo is a finite state machine (FSM) parser and annotator for Croatian morphology. It segments words into morphemes, generates feature sets for each sub-morpheme, as well as lemmata for the lexical root and base in one swoop. It is thus some sort of morphological analyzer, tagger, and lemmatizer, merged in one finite state transducer (FST).

The code is based on:
  • C++ and pure C for the final automaton
  • Scheme, Python, and Ragel for code generation
  • GOLD-2008 OWL data for standardized feature labels

The C++/C code is highly optimized with respect to memory, speed, and development and testing cycles.
  • It is very efficient, and extremely fast: approx. 50,000 tokens are morphologically segmented and each morpheme is feature-annotated in 1 second on a common Intel Core2Duo 2 GHz CPU.
  • It is memory efficient: the binary is less than 5 MB big, it requires not much more runtime memory, the runtime memory demand is constant.
  • It is platform independent: binaries are available for common Unix and Linux distributions, Mac OS X 10.5 (Leopard), and various versions of Microsoft Windows.

For various scenarios we provide:
  • a monolithic (one compact FSA) binary version
  • a balanced distributed version (based on OpenMP)
  • a server version that eliminates load and binary instantiation time, and communicates over TCP/IP sockets (or WebService protocols) with your software environment

The code and development environment are applicable to various language types, i.e. they are not bound to Croatian, neither to one language only.

The Croatian lexical basis of CroMo is easily extensible, i.e. it is adoptable to various diachronic and synchronic, and dialectal variations. An extension of the morpheme base has almost no impact on the processing speed, while the compression into the binary representation minimizes the persistent and runtime memory requirements for one language. In principle, CroMo can be used for other languages as well, where a morphotactic language model is feasible.

The morphological annotation is based on the General Ontology for Linguistic Description (GOLD), using well-defined and specified linguistic terminology, and thus maximizing interoperability and usability, as well as mappability to other annotation schema and standards. GOLD opens up new possibilities of Ontology- or Description Logic supported disambiguation and processing.

If you would like to test the engine, contact us and explain your specific scenario and application, and we will come back with a licensing suggestion, and potential testing scenarios.