Python Parsing with NLTK and Foma

(C) 2017-2019 by Damir Cavar

Download: This and various other Jupyter notebooks are available from my GitHub repo.

This is a tutorial related to the discussion of grammar engineering and parsing in the classes Alternative Syntactic Theories and Advanced Natural Language Processing taught at Indiana University in Spring 2017, Fall 2018, and Spring 2019.

The following code examples require the foma application and library to be installed, as well as the foma.py module. Since I am using Python 3.x here, I would recommend to use my version of foma.py and install it in the local modules folder of your Python distribution. In my case, since I use Anaconda, the file goes in anaconda/lib/python3.6/ in my home directory. On a Linux distribution you might have a folder /usr/local/lib/python3.6 or similar, where the module foma.py has to be copied to.

Grammars and Parsers

We will use NLTK in the following:

In [1]:
import nltk

We can declare a feature structure and display it using NLTK:

In [2]:
fstr = nltk.FeatStruct("[POS='N', AGR=[PER=3, NUM='pl', GND='fem']]")
print(fstr)
[       [ GND = 'fem' ] ]
[ AGR = [ NUM = 'pl'  ] ]
[       [ PER = 3     ] ]
[                       ]
[ POS = 'N'             ]

We will also use Feature grammars:

In [3]:
from nltk.grammar import FeatureGrammar, FeatDict

We can define a feature grammar in the following way:

In [4]:
grammarText = """
% start S
# ############################
# Grammar Rules
# ############################
S[TYPE=decl] -> NP[NUM=?n, PERS=?p, CASE='nom'] VP[NUM=?n, PERS=?p] '.'
S[TYPE=inter] -> NP[NUM=?n, PERS=?p, CASE='nom', PRONTYPE=inter] VP[NUM=?n, PERS=?p] '?'
S[TYPE=inter] -> NP[NUM=?n, PERS=?p, CASE='acc', PRONTYPE=inter] AUX NP[NUM=?n, PERS=?p, CASE='nom'] VP[NUM=?n, PERS=?p, VAL=2] '?'
NP[NUM=?n, PERS=?p, CASE=?c] -> N[NUM=?n, PERS=?p, CASE=?c]
NP[NUM=?n, PERS=?p, CASE=?c] -> D[NUM=?n, CASE=?c] N[NUM=?n,PERS=?p, CASE=?c]
NP[NUM=?n, PERS=?p, CASE=?c] -> Pron[NUM=?n, PERS=?p, CASE=?c]
VP[NUM=?n, PERS=?p] -> V[NUM=?n, PERS=?p]
VP[NUM=?n, PERS=?p, VAL=2] -> V[NUM=?n, PERS=?p, VAL=2] NP[CASE='acc']
VP[NUM=?n, PERS=?p, VAL=2] -> V[NUM=?n, PERS=?p, VAL=2]
"""

grammar = FeatureGrammar.fromstring(grammarText)

Testing the grammar with the input sentence John loves Mary fails, because there is not lexical defintion of these entries in the grammar.

In [5]:
parser = nltk.parse.FeatureChartParser(grammar)
sentences = ["John loves Mary",
             "Mary loves John"]

for sentence in sentences:
    result = []
    try:
        result = list(parser.parse(sentence.split()))    
    except ValueError:
        print("The grammar is missing token definitions.")

    if result:
        for x in result:
            print(x)
    else:
        print("*", sentence)
The grammar is missing token definitions.
* John loves Mary
The grammar is missing token definitions.
* Mary loves John

We can include foma and a morphology using the following code:

In [6]:
import foma

The following line will load the eng.fst Foma morphology:

In [7]:
fst = foma.FST.load(b'eng.fst')

We can print out the analysis for each single token by submitting it to the FST:

In [8]:
tokens = "John loves Mary".split()
for token in tokens:
    result = list(fst.apply_up(str.encode(token)))
    for r in result:
        print(r.decode('utf8'))
John+N+Sg+Masc+NEPerson
love+V+3P+Sg
Mary+N+Sg+Fem+NEPerson

If we want to convert the flat string annotation from the morphology to a NLTK feature structure, we need to translate some entires to a corresponding Attribute Value Matrix (AVM). In the following table we define a feature in the morphology output and the corresponging feature structure that it corresponds with:

In [9]:
featureMapping = {
    'Q'     : "PRONTYPE = inter",
    'Animate' : "ANIMATE = 1",
    'Def'   : "DETTYPE = def",
    'Indef' : "DETTYPE = indef",
    'Sg'    : "NUM = sg",
    'Pl'    : "NUM = pl",
    '3P'    : "PERS = 3",
    'Masc'  : "GENDSEM = male",
    'Fem'   : "GENDSEM = female",
    'Dat'   : "CASE = dat",
    'Acc'   : "CASE = acc",
    'Nom'   : "CASE = nom",
    'NEPersonName' : """NTYPE = [NSYN = proper,
                                NSEM = [PROPER = [PROPERTYPE = name,
                                                 NAMETYPE   = first_name]]],
                        HUMAN = 1"""
}

We use a function feat2LFG to convert the feature tag in the morphological analysis to a LFG-compatible AVM:

In [10]:
def feat2LFG(f):
    result = featureMapping.get(f, "")
    return(nltk.FeatStruct("".join( ("[", result, "]") )))

The following helper function is recursive. It mapps the AVM to a bracketed string annotation of feature structures, as used in the NLTK feature grammar format:

In [11]:
def flatFStructure(f):
    res = ""
    for key in f.keys():
        val = f[key]
        if res:
            res += ', '
        if (isinstance(val, FeatDict)):
            res += key + '=' + flatFStructure(val)
        else:
            res += key + "=" + str(val)
    return('[' + res + ']')

The following function is a parse function that prints out parse trees for an input, maintaining the extended feature structures at the lexical level. It can now parse sentences that contain words that are not specified as lexical words in the grammar, but rather as paths in the morphological finite state transducer.

In [12]:
def parseFoma(sentence):
    tokens = sentence.split()

    tokenAnalyses = {}
    rules = []
    count = 0
    for token in tokens:
        aVal = []
        result = list(fst.apply_up(str.encode(token)))
        for r in result:
            elements = r.decode('utf8').split('+')

            lemma = elements[0]
            tokFeat = nltk.FeatStruct("[PRED=" + lemma + "]")

            cat = elements[1]
            if len(elements) > 2:
                feats = tuple(elements[2:])
            else:
                feats = ()
            for x in feats:
                fRes2 = feat2LFG(x)
                fRes = tokFeat.unify(fRes2)
                if fRes:
                    tokFeat = fRes
                else:
                    print("Error unifying:", tokFeat, fRes2)
            flatFStr = flatFStructure(tokFeat)
            aVal.append(cat + flatFStr)
            rules.append(cat + flatFStr + " -> " + "'" + token + "'")
        tokenAnalyses[count] = aVal
        count += 1

    grammarText2 = grammarText + "\n" + "\n".join(rules)

    grammar = FeatureGrammar.fromstring(grammarText2)
    parser = nltk.parse.FeatureChartParser(grammar)
    result = list(parser.parse(tokens))
    if result:
        for x in result:
            print(x)
    else:
        print("*", sentence)

We can call this function using the following code:

In [13]:
parseFoma("who called John ?")
(S[TYPE='inter']
  (NP[CASE=?c, NUM='sg', PERS=3]
    (Pron[ANIMATE=1, NUM='sg', PERS=3, PRED='who', PRONTYPE='inter']
      who))
  (VP[NUM=?n, PERS=?p, VAL=2]
    (V[PRED='call'] called)
    (NP[CASE=?c, NUM='sg', PERS=?p]
      (N[GENDSEM='male', NUM='sg', PRED='John'] John)))
  ?)

...