Course: Statistics for linguistic research

Translate equations to R
by Damir Cavar (March 2010)

In this section we will take a look at how to translate some typical equation into R. Let us start with the formula for variance:

The LaTeX code for the equation is: S^2_{N-1}=\frac{1}{N-1}\sum_{i=1}^N{(x_i - \bar{x})^2}

What this formula says is the following:

a. Given a vector of values (some experimental results or measures) of length N, for example:

x <- c(1, 2, 4, 5, 6, 8, 9)

In this case, x is of length 7, that is N = 7, which can be verified by using the following function in R:

length(x)

b. for each value, from the beginning (index position i = 1), to the last element (index position i = N = 7),
subtract the arithmetic mean of the distribution from this value, and square the result:

The arithmetic mean of the distribution can be calculated using:

mean(x)

The subtraction and squaring can be calculated as:

(x - mean(x))^2

The resulting vector for x as defined above should look like:

16 9 1 0 1 9 16

c. sum up all the resulting values, and divide the result by (N - 1).

Summing up the values in a list can be achieved by using the function sum():

sum(x)

The output for x as defined above should be:

35

To complete our calculation using steps a. and b., the formula so far should be translated as:

sum( (x - mean(x))^2 )

The output of this calculation for the defined x should be:

52

The final step is to divide this part of the calculation result by N, which is the length of x reduced by 1:

sum( (x - mean(x))^2 ) / (length(x) - 1)

The resulting value:

8.666667

is the variance of the measures in x, as defined above.

We can define this calculation as a function, and store it in some external file for reuse. A function can be defined in the following way:

s2variance <- function (x) { sum( (x - mean(x))^2 ) / (length(x) - 1) }

s2variance is now defined as a function that takes one parameter, i.e. a vector of values, measures or results. The returned value is the variance of the distribution, as explained above. For example:

s2variance( c(3, 5, 1, 1, 2, 1, 0, 4) )

should return the value:

2.982143

In fact, R comes with a predefined function for variance calculation, as explained in the previous section (the function var), such that:

var( c(3, 5, 1, 1, 2, 1, 0, 4) )

should return the same value as our own function definition above, i.e.:

2.982143

A variation of this equation is the variance of a population:

The LaTeX code for this equation is: S^2_N=\frac{1}{N}\sum_{i=1}^N{(x_i - \bar{x})^2}

The only difference is that in this function we do not divide by the length of x minus 1, but just by the length of x, as in the following R code:

sum( (x - mean(x))^2 ) / length(x)

Let us consider now the equation for the Standard deviation:

The LaTeX code for this equation is: \sigma = \sqrt{ \frac{1}{N} \sum_{i=1}^N{ (x_i - \bar{x})^2 }}

This equation is just the square root of the population variance. The corresponding R code is:

sqrt( sum( (x - mean(x))^2 ) / length(x) )

To declare this code as a function for storage and reuse, just wrap it in a function declaration as:

sdd <- function (x) { sqrt( sum( (x - mean(x))^2 ) / length(x) ) }

and use it as follows:

sdd(x)

which should return:

2.725541

Let us consider a random variable X, represented here as a vector with probabilities:

x <- c(0.1, 0.3, 0.15, 0.25, 0.01, 0.08, 0.01, 0.06, 0.04)

The vector x in this case is a complete list of event probabilities, such that the sum of all probabilities in x is equal to 1, as can be verified with the following R code:

sum(x)

Now, the (Shannon) entropy H(x) is defined as:

The LaTeX code for this equation is: H(X) = -\sum_{i=1}^n{p(x_i)log_b p(x_i)}

The equation says that all probabilities of events for the random variable X have to be multiplied with the logarithm of them, summarized and multiplied by -1. We use here the logarithm to the base of 2, which is in R:

log2(x)

Thus, given that the vector x contains probabilities, the multiplication of each single probability with the log to the base of two of it, is in R:

x * log2(x)

For x as defined above, the resulting list should look like:

-0.33219281 -0.52108968 -0.41054484 -0.50000000 -0.06643856 -0.29150850 -0.06643856 -0.24353362 -0.18575425

Summing up the values in the resulting list can be achieved by the function sum:

sum(x * log2(x))

and the absolute value of this summation is returned by multiplying with -1:

-sum(x * log2(x))

By the way, the initial sum is always negative, because the log of a probability is always negative, given that a probability is between 0 and 1.

Course: Statistics for linguistic research

Linguistics Department, University of Zadar