Assignment 2nd of Feb. 2012

Evaluate the assumption that “young king” is a collocation in the novel The House of Pomegranate:

See Manning and Schütze (1999) for some more details.

Use Antconc to analyze HOPG.txt as described in the “Using Antconc: Notes 1” post.

Identify the frequency of “young”, “king” and “young king”.

Map these frequencies on a table like this:

young not young total
king ---- ----

not king ---- ----

total

Assuming that the two words are not accidentally occurring next to each other, our Research Hypothesis will be that: the probability of “young king” occurring in the text is not the product of the probability of “young” and the probability of “king”, that is that the two variables are dependent:

P(young king) ≠ P(young) * P(king)

We formulate the Null Hypothesis that the observation we made about the occurrence of “young king” is just random, and that in fact the two variables are independent.

P(young king) = P(young) * P(king)

Calculate the expectation of the bigrams in the table above, using the independency assumption.

Example:

Assume that we have a text with 36485 tokens. The token “John” occurs 214 times in the text. The token “Smith” occurs 86 times in the text. The bigram “John Smith” occurs 63 times in the text. We fill the table this way:

John not John total
Smith 63 ---- 86

not Smith ---- ----

total 214 36485

The total number of words that are “not John” is 36485 - 214, the total number of words that are “not Smith” is 36485 - 86. The total number of bigrams with “John” and some other word that is “not Smith” is 214 - 63, the total number of some other word “not John” followed by “Smith” is 86 - 63. Finish the table of observations.

To calculate the expectations given the Null Hypothesis, we need to calculate the probability of for example “John” and “Smith”, that is, the expectations for randomly picking “John Smith” when picking two words next to each other in the text given the Null Hypothesis would be:

P(John) = 214 / 36485
P(Smith) = 86 / 36485

P(John) * P(Smith) = 214 / 36485 * 86 / 36485

This would return us the probability, but not the absolute number we would expect in the cell with the observation of 63. If we multiply the probability of the product of the individual words with the total number of words, we get back the expected number of concurrencies of “John Smith”, that is:

expected frequency of “John Smith” = 214 / 36485 * 86 / 36485 * 36485 = (214 * 86) / 36485

This we can reinterpret as: For every cell we multiply the total of the column with the total of the row and divide by the overall total.

Apply this for every cell, note the expectations given the Null Hypothesis, then apply the Chi2 formula to all cells observation and expectation values.

Note: In this case the degree of freedom (df) is 1.