(C) 2018-2019 by Damir Cavar
Download: This and various other Jupyter notebooks are available from my GitHub repo.
This notebook is based on my own NLTK notebooks and Michael Elhadad's notebook Constituent-based Syntactic Parsing with NLTK. See for related material:
This is a tutorial related to the discussion of grammar engineering and parsing in the class Alternative Syntactic Theories and Advanced Natural Language Processing taught at Indiana University in Spring 2017 and Fall 2018 and 2019.
The examples presuppose an installed Python 3.x NLTK module with all the dependent modules and packages, as well as the data set for NLTK.
You can use the NLTK corpus module to read and use the different corpora in the NLTK data set.
import nltk.corpus
We can load the portion of the Penn Treebank:
ptb = nltk.corpus.treebank
The corpus consists of a collection of files. Each file has an ID (file name). You can print the IDs using the fileids method:
print(ptb.fileids())
We can print out the content of a particular file:
print(ptb.raw('wsj_0001.mrg'))
As you can see, the file wsj_0001.mrg contains two sentence trees. You can access each sentence tree individually. Here we print out the first sentence:
print(ptb.parsed_sents('wsj_0001.mrg')[0])
And in the following code we print the second tree:
print(ptb.parsed_sents('wsj_0001.mrg')[1])
To see the tree structure as a graphic, you can apply the draw() method to the particular tree. The following code will open up a window with the drawing of the first tree in the file wsj_0001.mrg.
ptb.parsed_sents('wsj_0001.mrg')[0].draw()
Notice that in the menu of the window you can export the drawing of the tree as a Postscript file.
NLTK allows you to define your own tree using the bracketed notation and the Tree class.
from nltk import Tree
We can define a string variable with a syntactic tree and create a Tree object by applying the fromstring method from the Tree class:
trees = "(S (NP (NN cats ) ) (VP (V chase ) (NP (NN mice ) ) ))"
myTree = Tree.fromstring(trees)
We can print the tree in bracketed notation:
print(myTree)
The internal representation of the tree looks as follows:
print(myTree.__repr__())
The tree can be written in LaTeX format for the qtree package:
print(myTree.pformat_latex_qtree())
We can draw the tree by applying the draw() method to the tree object. The following code will open up a window with the drawing of the corresponding tree:
myTree.draw()
We can print the label of a tree or subtrees using the label method:
print(myTree.label())
A daughter of the root node of the tree can be accessed using the list index notation:
print(myTree[0])
print(myTree[1])
We can print out the label of these subtrees by applying the label method to them:
print(myTree[0].label())
print(myTree[1].label())
We can traverse the tree using these bracketed index operators:
print(myTree[0])
print(myTree[0,0])
print(myTree[0,0,0])
Sections of the tree can be modified or replaced using the index notation. We replace the subject (NN cats) with (NN dogs) in the following code:
myTree[0,0] = Tree.fromstring('(NN dogs)')
print(myTree)
Various details about the tree can be accessed. We can extract the leaves of the tree using the leaves method:
print(myTree.leaves())
We can query the height of a tree:
print(myTree.height())
For specific purposes and some specific algorithms one might want to convert the tree to remove all unary relations between symbols. To collapse for example (NP (NP (NN dogs ))) to (NP+NP (NN cats )):
trees2 = "(S (NP (NP (NN cats ) ) ) (VP (V chase ) (NP (NN mice ) ) ))"
myTree2 = Tree.fromstring(trees2)
myTree2.collapse_unary()
print(myTree2)
We can convert a tree to Chomsky Normal Form as well. In the following tree we have a subject NP branching into three daughter nodes. This is converted to a binary branching structure:
trees2 = "(S (NP (NP (DET the ) (JJ big ) (NN cats ) ) ) (VP (V chase ) (NP (NN mice ) ) ))"
myTree2 = Tree.fromstring(trees2)
myTree2.chomsky_normal_form()
print(myTree2)
We can generate all production rules for a tree:
myTree.productions()
Trees can contain nodes that are complex objects. Nodes do not have to be strings. In the following code we replace the root node of our tree with a tuple of string and integer:
myTree.set_label( ('S', 3) )
print(myTree)
Probabilistic rules or trees can be defined in the following way:
pt = nltk.tree.ProbabilisticTree('NP', ['DET', 'N'], prob=0.5)
print(pt)
Draw the tree:
pt.draw()
In the previous sections (see links to the other notebooks above), we have seen how CFGs can be defined, read from a file, or used in a parser.
PCFG rules are defined in the same way. In addition a probability is assigned to each right-hand side of a rule. All probabilities for one particular left-hand side have to sum up to 1. The following code imports the PCFG class from NLTK:
from nltk import PCFG
We can define a grammar:
pcfg1 = PCFG.fromstring("""
S -> NP VP [1.0]
NP -> Det N [0.5] | NP PP [0.25] | N [0.25]
PP -> P NP [1.0]
VP -> VP PP [0.1] | V NP [0.7] | V [0.2]
N -> 'woman' [0.3] | 'man' [0.3] | 'telescope' [0.3] | 'mixer' [0.1]
Det -> 'the' [0.6] | 'a' [0.2] | 'my' [0.2]
V -> 'killed' [0.35] | 'saw' [0.65]
P -> 'with' [0.61] | 'under' [0.39]
""")
We can print out the productions:
print(pcfg1)
We can import the portion of the Penn Treebank as treebank:
from nltk.corpus import treebank
We can now loop over all files, read the sentences in, extract all production rules from them, and aggregate the production rules in a list:
productions = []
for t in treebank.fileids()[:2]:
for x in treebank.parsed_sents(t):
productions += x.productions()
print(productions)
To be able to induce the PCFG, we need to define a Nonterminal object with the start symbol as label:
from nltk import Nonterminal
S = Nonterminal('S')
We can now induce the PCFG using the Nonterminal start symbol and the list of production rules:
grammar = nltk.induce_pcfg(S, productions)
print(grammar)
We can normalize the trees by collapsing unary branches and converting them to Chomsky Normal Form before we extract the production rules and induce the grammar:
productions = []
for t in treebank.fileids()[:2]:
for x in treebank.parsed_sents(t):
x.collapse_unary(collapsePOS = False)
x.chomsky_normal_form(horzMarkov = 2)
productions += x.productions()
grammar = nltk.induce_pcfg(S, productions)
print(grammar)
NLTK provides various parser implementations. One that implements the Viterbi CKY n-best parses over a PCFG is available in the parse.viterbi module. (See Michael Elhadad's notebook Constituent-based Syntactic Parsing with NLTK for more details.)
print("Parse sentence using induced grammar:")
parser = nltk.pchart.InsideChartParser(grammar)
parser.trace(3)
sent = treebank.parsed_sents('wsj_0001.mrg')[0].leaves()
print(sent)
for parse in parser.parse(sent):
print(parse)
The other parsers are defined in the nltk.parse module.
import sys, time
from nltk import tokenize
from nltk.grammar import toy_pcfg1
from nltk.parse import pchart
from nltk.parse import ViterbiParser
demos = [('I saw John with my telescope', toy_pcfg1)]
sent, grammar = demos[0]
# Tokenize the sentence.
tokens = sent.split()
# Define a list of parsers. We'll use all parsers.
parsers = [
ViterbiParser(grammar),
pchart.InsideChartParser(grammar),
pchart.RandomChartParser(grammar),
pchart.UnsortedChartParser(grammar),
pchart.LongestChartParser(grammar),
pchart.InsideChartParser(grammar, beam_size = len(tokens)+1)
]
We can now loop over the different parsers and compare the output and performance:
# Run the parsers on the tokenized sentence.
from functools import reduce
times = []
average_p = []
num_parses = []
all_parses = {}
for parser in parsers:
print('\ns: %s\nparser: %s\ngrammar: %s' % (sent,parser,grammar))
parser.trace(3)
t = time.time()
parses = parser.parse_all(tokens)
times.append(time.time()-t)
if parses:
lp = len(parses)
p = reduce(lambda a,b:a+b.prob(), parses, 0.0)
else:
p = 0
average_p.append(p)
num_parses.append(len(parses))
for p in parses:
all_parses[p.freeze()] = 1
# Print summary statistics
print()
print('-------------------------+------------------------------------------')
print(' Parser Beam | Time (secs) # Parses Average P(parse)')
print('-------------------------+------------------------------------------')
for i in range(len(parsers)):
print('%19s %4d |%11.4f%11d%19.14f' % (parsers[i].__class__.__name__,
getattr(parsers[0], "beam_size", 0),
times[i],
num_parses[i],
average_p[i]))
parses = all_parses.keys()
if parses:
p = reduce(lambda a,b:a+b.prob(), parses, 0)/len(parses)
else:
p = 0
print('-------------------------+------------------------------------------')
print('%19s |%11s%11d%19.14f' % ('(All Parses)', 'n/a', len(parses), p))
print()
for parse in parses:
print(parse)
(C) 2018-2019 by Damir Cavar - Creative Commons Attribution-ShareAlike 4.0 International License (CA BY-SA 4.0). Parts of the code are taken from Michael Elhadad's notebook Constituent-based Syntactic Parsing with NLTK. Please consult him for details about the copyright.