**(C) 2018-2019 by Damir Cavar**

**Version:** 1.1, September 2019

**Download:** This and various other Jupyter notebooks are available from my GitHub repo.

This is a tutorial related to the L665 course on Machine Learning for NLP focusing on Deep Learning, Spring 2018 at Indiana University.

The following material is based on *Linear Algebra Review and Reference* by Zico Kolter (updated by Chuong Do) from September 30, 2015. This means, many passages are literally copied, many are rewritten. I do not mark sections that are added or different. Consider this notebook a extended annotation of Kolter's (and Do's) notes. See also James E. Gentle (2017) Matrix Algebra: Theory, Computations and Applications in Statistics. Second edition. Springer. Another good resource is Philip N. Klein (2013) Coding the Matrix: Linear Algebra through Applications to Computer Science, Newtonian Press.

The following system of equations:

$\begin{equation} \begin{split} 4 x_1 - 5 x_2 & = -13 \\ -2x_1 + 3 x_2 & = 9 \end{split} \end{equation}$

We are looking for a unique solution for the two variables $x_1$ and $x_2$. The system can be described as:

$Ax = b$

as matrices:

$A = \begin{bmatrix} 4 & -5 \\[0.3em] -2 & 3 \end{bmatrix},\ b = \begin{bmatrix} -13 \\[0.3em] 9 \end{bmatrix}$

A **scalar** is an element in a vector, containing a real number **value**. In a vector space model or a vector mapping of (symbolic, qualitative, or quantitative) properties the scalar holds the concrete value or property of a variable.

A **vector** is an array, tuple, or ordered list of scalars (or elements) of size $n$, with $n$ a positive integer. The **length** of the vector, that is the number of scalars in the vector, is also called the **order** of the vector.

**Vectorization** is the process of creating a vector from some data using some process.

Vectors of the length $n$ could be treated like points in $n$-dimensional space. One can calculate the distance between such points using measures like Euclidean Distance. The similarity of vectors could also be calculated using Cosine Similarity.

A **matrix** is a list of vectors that all are of the same length. $A$ is a matrix with $m$ rows and $n$ columns, antries of $A$ are real numbers:

$A \in \mathbb{R}^{m \times n}$

A vector $x$ with $n$ entries of real numbers, could also be thought of as a matrix with $n$ rows and $1$ column, or as known as a **column vector**.

$x = \begin{bmatrix} x_1 \\[0.3em] x_2 \\[0.3em] \vdots \\[0.3em] x_n \end{bmatrix}$

Representing a **row vector**, that is a matrix with $1$ row and $n$ columns, we write $x^T$ (this denotes the transpose of $x$, see above).

$x^T = \begin{bmatrix} x_1 & x_2 & \cdots & x_n \end{bmatrix}$

We use the notation $a_{ij}$ (or $A_{ij}$, $A_{i,j}$, etc.) to denote the entry of $A$ in the $i$th row and $j$th column:

$A = \begin{bmatrix} a_{11} & a_{12} & \cdots & a_{1n} \\[0.3em] a_{21} & a_{22} & \cdots & a_{2n} \\[0.3em] \vdots & \vdots & \ddots & \vdots \\[0.3em] a_{m1} & a_{m2} & \cdots & a_{mn} \end{bmatrix}$

We denote the $j$th column of $A$ by $a_j$ or $A_{:,j}$:

$A = \begin{bmatrix} \big| & \big| & & \big| \\[0.3em] a_{1} & a_{2} & \cdots & a_{n} \\[0.3em] \big| & \big| & & \big| \end{bmatrix}$

We denote the $i$th row of $A$ by $a_i^T$ or $A_{i,:}$:

$A = \begin{bmatrix} -- & a_1^T & -- \\[0.3em] -- & a_2^T & -- \\[0.3em] & \vdots & \\[0.3em] -- & a_m^T & -- \end{bmatrix}$

A $n \times m$ matrix is a two-dimensional array with $n$ rows and $m$ columns.

The result of the multiplication of two matrixes $A \in \mathbb{R}^{m \times n}$ and $B \in \mathbb{R}^{n \times p}$ is the matrix:

$C = AB \in \mathbb{R}^{m \times n}$

That is, we are multiplying the columns of $A$ with the rows of $B$:

$C_{ij}=\sum_{k=1}^n{A_{ij}B_{kj}}$

The number of columns in $A$ must be equal to the number of rows in $B$.

For two vectors $x, y \in \mathbb{R}^n$, the **inner product** or **dot product** $x^T y$ is a real number:

$x^T y \in \mathbb{R} = \begin{bmatrix} x_1 & x_2 & \cdots & x_n \end{bmatrix} \begin{bmatrix} y_1 \\[0.3em] y_2 \\[0.3em] \vdots \\[0.3em] y_n \end{bmatrix} = \sum_{i=1}^{n}{x_i y_i}$

The **inner products** are a special case of matrix multiplication.

It is always the case that $x^T y = y^T x$.

To calculate the inner product of two vectors $x = [1 2 3 4]$ and $y = [5 6 7 8]$, we can loop through the vector and multiply and sum the scalars (this is simplified code):

In [ ]:

```
x = (1, 2, 3, 4)
y = (5, 6, 7, 8)
n = len(x)
if n == len(y):
result = 0
for i in range(n):
result += x[i] * y[i]
print(result)
```

It is clear that in the code above we could change line 7 to `result += y[i] * x[i]`

without affecting the result.

We can use the *numpy* module to apply the same operation, to calculate the **inner product**. We import the *numpy* module and assign it a name *np* for the following code:

In [ ]:

```
import numpy as np
```

We define the vectors $x$ and $y$ using *numpy*:

In [ ]:

```
x = np.array([1, 2, 3, 4])
y = np.array([5, 6, 7, 8])
print("x:", x)
print("y:", y)
```

We can now calculate the $dot$ or $inner product$ using the *dot* function of *numpy*:

In [ ]:

```
np.dot(x, y)
```

The order of the arguments is irrelevant:

In [ ]:

```
np.dot(y, x)
```

Note that both vectors are actually **row vectors** in the above code. We can transpose them to column vectors by using the *shape* property:

In [ ]:

```
print("x:", x)
x.shape = (4, 1)
print("xT:", x)
print("y:", y)
y.shape = (4, 1)
print("yT:", y)
```

In fact, in our understanding of Linear Algebra, we take the arrays above to represent **row vectors**. *Numpy* treates them differently.

We see the issues when we try to transform the array objects. Usually, we can transform a row vector into a column vector in *numpy* by using the *T* method on vector or matrix objects:

In [ ]:

```
x = np.array([1, 2, 3, 4])
y = np.array([5, 6, 7, 8])
print("x:", x)
print("y:", y)
print("xT:", x.T)
print("yT:", y.T)
```

The problem here is that this does not do, what we expect it to do. It only works, if we declare the variables not to be arrays of numbers, but in fact a matrix:

In [ ]:

```
x = np.array([[1, 2, 3, 4]])
y = np.array([[5, 6, 7, 8]])
print("x:", x)
print("y:", y)
print("xT:", x.T)
print("yT:", y.T)
```

Note that the *numpy* functions *dot* and *outer* are not affected by this distinction. We can compute the dot product using the mathematical equation above in *numpy* using the new $x$ and $y$ row vectors:

In [ ]:

```
print("x:", x)
print("y:", y.T)
np.dot(x, y.T)
```

Or by reverting to:

In [ ]:

```
print("x:", x.T)
print("y:", y)
np.dot(y, x.T)
```

To read the result from this array of arrays, we would need to access the value this way:

In [ ]:

```
np.dot(y, x.T)[0][0]
```

For two vectors $x \in \mathbb{R}^m$ and $y \in \mathbb{R}^n$, where $n$ and $m$ do not have to be equal, the **outer product** of $x$ and $y$ is:

$xy^T \in \mathbb{R}^{m\times n}$

The **outer product** results in a matrix with $m$ rows and $n$ columns by $(xy^T)_{ij} = x_i y_j$:

$xy^T \in \mathbb{R}^{m\times n} = \begin{bmatrix} x_1 \\[0.3em] x_2 \\[0.3em] \vdots \\[0.3em] x_n \end{bmatrix} \begin{bmatrix} y_1 & y_2 & \cdots & y_n \end{bmatrix} = \begin{bmatrix} x_1 y_1 & x_1 y_2 & \cdots & x_1 y_n \\[0.3em] x_2 y_1 & x_2 y_2 & \cdots & x_2 y_n \\[0.3em] \vdots & \vdots & \ddots & \vdots \\[0.3em] x_m y_1 & x_m y_2 & \cdots & x_m y_n \\[0.3em] \end{bmatrix}$

Some useful property of the outer product: assume $\mathbf{1} \in \mathbb{R}^n$ is an $n$-dimensional vector of scalars with the value $1$. Given a matrix $A \in \mathbb{R}^{m\times n}$ with all columns equal to some vector $x \in \mathbb{R}^m$, using the outer product $A$ can be represented as:

$A = \begin{bmatrix} \big| & \big| & & \big| \\[0.3em] x & x & \cdots & x \\[0.3em] \big| & \big| & & \big| \end{bmatrix} = \begin{bmatrix} x_1 & x_1 & \cdots & x_1 \\[0.3em] x_2 & x_2 & \cdots & x_2 \\[0.3em] \vdots & \vdots & \ddots & \vdots \\[0.3em] x_m &x_m & \cdots & x_m \end{bmatrix} = \begin{bmatrix} x_1 \\[0.3em] x_2 \\[0.3em] \vdots \\[0.3em] x_m \end{bmatrix} \begin{bmatrix} 1 & 1 & \cdots & 1 \end{bmatrix} = x \mathbf{1}^T$

If we want to compute the outer product of two vectors $x$ and $y$, we need to transpose the row vector $x$ to a column vector $x^T$. This can be achieved by the *reshape* function in *numpy*, the *T* method, or the *transpose()* function. The *reshape* function takes a parameter that describes the number of colums and rows for the resulting transposing:

In [ ]:

```
x = np.array([[1, 2, 3, 4]])
print("x:", x)
print("xT:", np.reshape(x, (4, 1)))
print("xT:", x.T)
print("xT:", x.transpose())
```

We can now compute the **outer product** by multiplying the column vector $x$ with the row vector $y$:

In [ ]:

```
x = np.array([[1, 2, 3, 4]])
y = np.array([[5, 6, 7, 8]])
x.T * y
```

*Numpy* provides an *outer* function that does all that:

In [ ]:

```
np.outer(x, y)
```

Note, in this simple case using the simple arrays for the data structures of the vectors does not affect the result of the *outer* function:

In [ ]:

```
x = np.array([1, 2, 3, 4])
y = np.array([5, 6, 7, 8])
np.outer(x, y)
```

Assume a matrix $A \in \mathbb{R}^{m\times n}$ and a vector $x \in \mathbb{R}^n$ the product results in a vector $y = Ax \in \mathbb{R}^m$.

$Ax$ could be expressed as the dot product of row $i$ of matrix $A$ with the column value $j$ of vector $x$. Let us first consider matrix multiplication with a scalar:

$A = \begin{bmatrix} 1 & 2 \\[0.3em] 3 & 4 \end{bmatrix}$

We can compute the product of $A$ with a scalar $n = 2$ as:

$A = \begin{bmatrix} 1 * n & 2 * n \\[0.3em] 3 * n & 4 * n \end{bmatrix} = \begin{bmatrix} 1 * 2 & 2 * 2 \\[0.3em] 3 * 2 & 4 * 2 \end{bmatrix} = \begin{bmatrix} 2 & 4 \\[0.3em] 6 & 8 \end{bmatrix} $

Using *numpy* this can be achieved by:

In [ ]:

```
import numpy as np
A = np.array([[4, 5, 6],
[7, 8, 9]])
A * 2
```

Assume that we have a column vector $x$:

$x = \begin{bmatrix} 1 \\[0.3em] 2 \\[0.3em] 3 \end{bmatrix}$

To be able to multiply this vector with a matrix, the number of columns in the matrix must correspond to the number of rows in the column vector. The matrix $A$ must have $3$ columns, as for example:

$A = \begin{bmatrix} 4 & 5 & 6\\[0.3em] 7 & 8 & 9 \end{bmatrix}$

To compute $Ax$, we multiply row $1$ of the matrix with column $1$ of $x$:

$\begin{bmatrix} 4 & 5 & 6 \end{bmatrix} \begin{bmatrix} 1 \\[0.3em] 2 \\[0.3em] 3 \end{bmatrix} = 4 * 1 + 5 * 2 + 6 * 3 = 32 $

We do the compute the dot product of row $2$ of $A$ and column $1$ of $x$:

$\begin{bmatrix} 7 & 8 & 9 \end{bmatrix} \begin{bmatrix} 1 \\[0.3em] 2 \\[0.3em] 3 \end{bmatrix} = 7 * 1 + 8 * 2 + 9 * 3 = 50 $

The resulting column vector $Ax$ is:

$Ax = \begin{bmatrix} 32 \\[0.3em] 50 \end{bmatrix}$

Using *numpy* we can compute $Ax$:

In [ ]:

```
A = np.array([[4, 5, 6],
[7, 8, 9]])
x = np.array([1, 2, 3])
A.dot(x)
```

We can thus describe the product writing $A$ by rows as:

$y = Ax = \begin{bmatrix} -- & a_1^T & -- \\[0.3em] -- & a_2^T & -- \\[0.3em] & \vdots & \\[0.3em] -- & a_m^T & -- \end{bmatrix} x = \begin{bmatrix} a_1^T x \\[0.3em] a_2^T x \\[0.3em] \vdots \\[0.3em] a_m^T x \end{bmatrix}$

This means that the $i$th scalar of $y$ is the inner product of the $i$th row of $A$ and $x$, that is $y_i = a_i^T x$.

If we write $A$ in column form, then:

$y = Ax = \begin{bmatrix} \big| & \big| & & \big| \\[0.3em] a_1 & a_2 & \cdots & a_n \\[0.3em] \big| & \big| & & \big| \end{bmatrix} \begin{bmatrix} x_1 \\[0.3em] x_2 \\[0.3em] \vdots \\[0.3em] x_n \end{bmatrix} = \begin{bmatrix} a_1 \end{bmatrix} x_1 + \begin{bmatrix} a_2 \end{bmatrix} x_2 + \dots + \begin{bmatrix} a_n \end{bmatrix} x_n $

In this case $y$ is a **linear combination** of the *columns* of $A$, the coefficients taken from $x$.

The above examples multiply be the right with a column vector. One can multiply on the left by a row vector as well, $y^T = x^T A$ for $A \in \mathbb{R}^{m\times n}$, $x\in \mathbb{R}^m$, $y \in \mathbb{R}^n$. There are two ways to express $y^T$, with $A$ expressed by its columns, with $i$th scalar of $y^T$ corresponds to the inner product of $x$ and the $i$th column of $A$:

$y^T = x^T A = x^t \begin{bmatrix} \big| & \big| & & \big| \\[0.3em] a_1 & a_2 & \cdots & a_n \\[0.3em] \big| & \big| & & \big| \end{bmatrix} = \begin{bmatrix} x^T a_1 & x^T a_2 & \dots & x^T a_n \end{bmatrix}$

One can express $A$ by rows, where $y^T$ is a linear combination of the rows of $A$ with the scalars from $x$.

$\begin{equation} \begin{split} y^T & = x^T A \\ & = \begin{bmatrix} x_1 & x_2 & \dots & x_n \end{bmatrix} \begin{bmatrix} -- & a_1^T & -- \\[0.3em] -- & a_2^T & -- \\[0.3em] & \vdots & \\[0.3em] -- & a_m^T & -- \end{bmatrix} \\ & = x_1 \begin{bmatrix}-- & a_1^T & --\end{bmatrix} + x_2 \begin{bmatrix}-- & a_2^T & --\end{bmatrix} + \dots + x_n \begin{bmatrix}-- & a_n^T & --\end{bmatrix} \end{split} \end{equation}$

One can view matrix-matrix multiplication $C = AB$ as a set of vector-vector products. The $(i,j)$th entry of $C$ is the inner product of the $i$th row of $A$ and the $j$th column of $B$:

$C = AB = \begin{bmatrix} -- & a_1^T & -- \\[0.3em] -- & a_2^T & -- \\[0.3em] & \vdots & \\[0.3em] -- & a_m^T & -- \end{bmatrix} \begin{bmatrix} \big| & \big| & & \big| \\[0.3em] b_1 & b_2 & \cdots & b_p \\[0.3em] \big| & \big| & & \big| \end{bmatrix} = \begin{bmatrix} a_1^T b_1 & a_1^T b_2 & \cdots & a_1^T b_p \\[0.3em] a_2^T b_1 & a_2^T b_2 & \cdots & a_2^T b_p \\[0.3em] \vdots & \vdots & \ddots & \vdots \\[0.3em] a_m^T b_1 & a_m^T b_2 & \cdots & a_m^T b_p \end{bmatrix}$

Here $A \in \mathbb{R}^{m\times n}$ and $B \in \mathbb{R}^{n\times p}$, $a_i \in \mathbb{R}^n$ and $b_j \in \mathbb{R}^n$, and $A$ is represented by rows, $B$ by columns.

If we represent $A$ by columns and $B$ by rows, then $AB$ is the sum of the outer products:

$C = AB = \begin{bmatrix} \big| & \big| & & \big| \\[0.3em] a_1 & a_2 & \cdots & a_n \\[0.3em] \big| & \big| & & \big| \end{bmatrix} \begin{bmatrix} -- & b_1^T & -- \\[0.3em] -- & b_2^T & -- \\[0.3em] & \vdots & \\[0.3em] -- & b_n^T & -- \end{bmatrix} = \sum_{i=1}^n a_i b_i^T $

This means that $AB$ is the sum over all $i$ of the outer product of the $i$th column of $A$ and the $i$th row of $B$.

One can interpret matrix-matrix operations also as a set of matrix-vector products. Representing $B$ by columns, the columns of $C$ are matrix-vector products between $A$ and the columns of $B$:

$C = AB = A \begin{bmatrix} \big| & \big| & & \big| \\[0.3em] b_1 & b_2 & \cdots & b_p \\[0.3em] \big| & \big| & & \big| \end{bmatrix} = \begin{bmatrix} \big| & \big| & & \big| \\[0.3em] A b_1 & A b_2 & \cdots & A b_p \\[0.3em] \big| & \big| & & \big| \end{bmatrix} $

In this interpretation the $i$th column of $C$ is the matrix-vector product with the vector on the right, i.e. $c_i = A b_i$.

Representing $A$ by rows, the rows of $C$ are the matrix-vector products between the rows of $A$ and $B$:

$C = AB = \begin{bmatrix} -- & a_1^T & -- \\[0.3em] -- & a_2^T & -- \\[0.3em] & \vdots & \\[0.3em] -- & a_m^T & -- \end{bmatrix} B = \begin{bmatrix} -- & a_1^T B & -- \\[0.3em] -- & a_2^T B & -- \\[0.3em] & \vdots & \\[0.3em] -- & a_n^T B & -- \end{bmatrix}$

The $i$th row of $C$ is the matrix-vector product with the vector on the left, i.e. $c_i^T = a_i^T B$.

**Matrix multiplication is associative:** $(AB)C = A(BC)$

**Matrix multiplication is distributive:** $A(B + C) = AB + AC$

**Matrix multiplication is, in general, not commutative;** It can be the case that $AB \neq BA$. (For example, if $A \in \mathbb{R}^{m\times n}$ and $B \in \mathbb{R}^{n\times q}$, the matrix product $BA$ does not even exist if $m$ and $q$ are not equal!)

The **identity matrix** $I \in \mathbb{R}^{n\times n}$ is a square matrix with the value $1$ on the diagonal and $0$ everywhere else:

$I_{ij} = \left\{ \begin{array}{lr} 1 & i = j\\ 0 & i \neq j \end{array} \right. $

For all $A \in \mathbb{R}^{m\times n}$:

$AI = A = IA$

In the equation above multiplication has to be made possible, which means that in the portion $AI = A$ the dimensions of $I$ have to be $n\times n$, while in $A = IA$ they have to be $m\times m$.

We can generate an *identity matrix* in *numpy* using:

In [ ]:

```
import numpy as np
A = np.array([[0, 1, 2],
[3, 4, 5],
[6, 7, 8],
[9, 10, 11]])
print("A:", A)
```

We can ask for the shape of $A$:

In [ ]:

```
A.shape
```

The *shape* property of a matrix contains the $m$ (number of rows) and $n$ (number of columns) properties in a tuple, in that particular order. We can create an identity matrix for the use in $AI$ by using the $n$ value:

In [ ]:

```
np.identity(A.shape[1], dtype="int")
```

Note that we specify the *dtype* parameter to *identity* as *int*, since the default would return a matrix of *float* values.

To generate an identity matrix for the use in $IA$ we would use the $m$ value:

In [ ]:

```
np.identity(A.shape[0], dtype="int")
```

We can compute the dot product of $A$ and its identity matrix $I$:

In [ ]:

```
n = A.shape[1]
I = np.array(np.identity(n, dtype="int"))
np.dot(A, I)
```

The same is true for the other direction:

In [ ]:

```
m = A.shape[0]
I = np.array(np.identity(m, dtype="int"))
np.dot(I, A)
```

In the **diagonal matrix** non-diagonal elements are $0$, that is $D = diag(d_1, d_2, \dots{}, d_n)$, with:

$D_{ij} = \left\{ \begin{array}{lr} d_i & i = j\\ 0 & i \neq j \end{array} \right. $

The identity matrix is a special case of a diagonal matrix: $I = diag(1, 1, \dots{}, 1)$.

In *numpy* we can create a *diagonal matrix* from any given matrix using the *diag* function:

In [ ]:

```
import numpy as np
A = np.array([[0, 1, 2, 3],
[4, 5, 6, 7],
[8, 9, 10, 11],
[12, 13, 14, 15]])
np.diag(A)
```

An optional parameter *k* to the *diag* function allows us to extract the diagonal above the main diagonal with a positive *k*, and below the main diagonal with a negative *k*:

In [ ]:

```
np.diag(A, k=1)
```

In [ ]:

```
np.diag(A, k=-1)
```

**Transposing** a matrix is achieved by *flipping* the rows and columns. For a matrix $A \in \mathbb{R}^{m\times n}$ the transpose $A^T \in \mathbb{R}^{n\times m}$ is the $n\times m$ matrix given by:

$(A^T)_{ij} = A_{ji}$

Properties of transposes:

- $(A^T)^T = A$
- $(AB)^T = B^T A^T$
- $(A+B)^T = A^T + B^T$

Square metrices $A \in \mathbb{R}^{n\times n}$ are **symmetric**, if $A = A^T$.

$A$ is **anti-symmetric**, if $A = -A^T$.

For any matrix $A \in \mathbb{R}^{n\times n}$, the matrix $A + A^T$ is **symmetric**.

For any matrix $A \in \mathbb{R}^{n\times n}$, the matrix $A - A^T$ is **anti-symmetric**.

Thus, any square matrix $A \in \mathbb{R}^{n\times n}$ can be represented as a sum of a symmetric matrix and an anti-symmetric matrix:

$A = \frac{1}{2} (A + A^T) + \frac{1}{2} (A - A^T)$

The first matrix on the right, i.e. $\frac{1}{2} (A + A^T)$ is symmetric. The second matrix $\frac{1}{2} (A - A^T)$ is anti-symmetric.

$\mathbb{S}^n$ is the set of all symmetric matrices of size $n$.

$A \in \mathbb{S}^n$ means that $A$ is symmetric and of the size $n\times n$.

The **trace** of a square matrix $A \in \mathbb{R}^{n\times n}$ is $tr(A)$ (or $trA$) is the sum of the diagonal elements in the matrix:

$trA = \sum_{i=1}^n A_{ii}$

Properties of the **trace**:

- For $A \in \mathbb{R}^{n\times n}$, $\mathrm{tr}A = \mathrm{tr}A^T$
- For $A,B \in \mathbb{R}^{n\times n}$, $\mathrm{tr}(A + B) = \mathrm{tr}A + \mathrm{tr}B$
- For $A \in \mathbb{R}^{n\times n}$, $t \in \mathbb{R}$, $\mathrm{tr}(tA) = t \mathrm{tr}A$
- For $A,B$ such that $AB$ is square, $\mathrm{tr}AB = \mathrm{tr}BA$
- For $A,B,C$ such that $ABC$ is square, $\mathrm{tr}ABC = \mathrm{tr}BCA = \mathrm{tr}CAB$, and so on for the product of more matrices.

See for proofs the paper!

The **norm** of a vector $x$ is $\| x\|$, informally the length of a vector.

Example: the Euclidean or $\mathscr{l}_2$ norm:

$\|x\|_2 = \sqrt{\sum_{i=1}^n{x_i^2}}$

Note: $\|x\|_2^2 = x^T x$

A **norm** is any function $f : \mathbb{R}^n \rightarrow \mathbb{R}$ that satisfies the following properties:

- For all $x \in \mathbb{R}^n$, $f(x) \geq 0$ (non-negativity)
- $f(x) = 0$ if and only if $x = 0$ (definiteness)
- For all $x \in \mathbb{R}^n$, $t \in \mathbb{R}$, $f(tx) = |t|\ f(x)$ (homogeneity)
- For all $x, y \in \mathbb{R}^n$, $f(x + y) \leq f(x) + f(y)$ (triangle inequality)

Norm $\mathscr{l}_1$:

$\|x\|_1 = \sum_{i=1}^n{|x_i|}$

Norm $\mathscr{l}_\infty$:

$\|x\|_\infty = \max_i|x_i|$

All these three norms are examples of the $\mathscr{l}_p$ norms, with $p$ a real number parameter $p \geq 1$:

$\|x\|_p = \left(\sum_{i=1}^n{|x_i|^p}\right)^{\frac{1}{p}}$

*Frobenius norm* for matrices:

$\|A\|_F = \sqrt{\sum_{i=1}^m\sum_{i=1}^n A_{ij}^2} = \sqrt{\mathrm{tr}(A^T A)}$

And many more.

A set of vectors $\{x_1, x_2, \dots{}, x_n\} \subset \mathbb{R}^m$ is said to be **(linearly) independent** if no vector can be represented as a linear combination of the remaining vectors.

A set of vectors $\{x_1, x_2, \dots{}, x_n\} \subset \mathbb{R}^m$ is said to be **(lineraly) dependent** if one vector from this set can be represented as a linear combination of the remaining vectors.

For some scalar values $\alpha_1, \dots{}, \alpha_{n-1} \in \mathbb{R}$ the vectors $x_1, \dots{}, x_n$ are linerly dependent, if:

$\begin{equation} x_n = \sum_{i=1}^{n-1}{\alpha_i x_i} \end{equation}$

Example: The following vectors are lineraly dependent, because $x_3 = -2 x_1 + x_2$

$x_1 = \begin{bmatrix} 1 \\[0.3em] 2 \\[0.3em] 3 \end{bmatrix} \quad x_2 = \begin{bmatrix} 4 \\[0.3em] 1 \\[0.3em] 5 \end{bmatrix} \quad x_3 = \begin{bmatrix} 2 \\[0.3em] -1 \\[0.3em] -1 \end{bmatrix} $

The **column rank** of a matrix $A \in \mathbb{R}^{m\times n}$ is the size of the largest subset of columns of $A$ that constitute a linear independent set. Informaly this is the number of linearly independent columns of $A$.

The **row rank** of a matrix $A \in \mathbb{R}^{m\times n}$ is the largest number of rows of $A$ that constitute a lineraly independent set.

For any matrix $A \in \mathbb{R}^{m\times n}$, the column rank of $A$ is equal to the row rank of $A$. Both quantities are referred to collectively as the rank of $A$, denoted as $rank(A)$. Here are some basic properties of the rank:

- For $A \in \mathbb{R}^{m\times n}$, $rank(A) \leq \min(m, n)$. If $rank(A) = \min(m, n)$, then $A$ is said to be
**full rank**. - For $A \in \mathbb{R}^{m\times n}$, $rank(A) = rank(A^T)$
- For $A \in \mathbb{R}^{m\times n}$, $B \in \mathbb{R}^{n\times p}$, $rank(AB) \leq \min(rank(A), rank(B))$
- For $A,B \in \mathbb{R}^{m\times n}$, $rank(A + B) \leq rank(A) + rank(B)$

Assume $A \in \mathbb{R}^{m\times n}$ and $B \in \mathbb{R}^{m\times n}$, that is $A$ and $B$ are of the same size, to add $A$ to $B$, or to subtract $B$ from $A$, we add or subtract corresponding entries:

$A + B = \begin{bmatrix} a_{11} & a_{12} & \cdots & a_{1n} \\[0.3em] a_{21} & a_{22} & \cdots & a_{2n} \\[0.3em] \vdots & \vdots & \ddots & \vdots \\[0.3em] a_{m1} & a_{m2} & \cdots & a_{mn} \end{bmatrix} + \begin{bmatrix} b_{11} & b_{12} & \cdots & b_{1n} \\[0.3em] b_{21} & b_{22} & \cdots & b_{2n} \\[0.3em] \vdots & \vdots & \ddots & \vdots \\[0.3em] b_{m1} & b_{m2} & \cdots & b_{mn} \end{bmatrix} = \begin{bmatrix} a_{11} + b_{11} & a_{12} + b_{12} & \cdots & a_{1n} + b_{1n} \\[0.3em] a_{21} + b_{21} & a_{22} + b_{22} & \cdots & a_{2n} + b_{2n} \\[0.3em] \vdots & \vdots & \ddots & \vdots \\[0.3em] a_{m1} + b_{m1} & a_{m2} + b_{m2} & \cdots & a_{mn} + b_{mn} \end{bmatrix} $

The same is applies to subtraction:

$A - B = \begin{bmatrix} a_{11} & a_{12} & \cdots & a_{1n} \\[0.3em] a_{21} & a_{22} & \cdots & a_{2n} \\[0.3em] \vdots & \vdots & \ddots & \vdots \\[0.3em] a_{m1} & a_{m2} & \cdots & a_{mn} \end{bmatrix} - \begin{bmatrix} b_{11} & b_{12} & \cdots & b_{1n} \\[0.3em] b_{21} & b_{22} & \cdots & b_{2n} \\[0.3em] \vdots & \vdots & \ddots & \vdots \\[0.3em] b_{m1} & b_{m2} & \cdots & b_{mn} \end{bmatrix} = \begin{bmatrix} a_{11} - b_{11} & a_{12} - b_{12} & \cdots & a_{1n} - b_{1n} \\[0.3em] a_{21} - b_{21} & a_{22} - b_{22} & \cdots & a_{2n} - b_{2n} \\[0.3em] \vdots & \vdots & \ddots & \vdots \\[0.3em] a_{m1} - b_{m1} & a_{m2} - b_{m2} & \cdots & a_{mn} - b_{mn} \end{bmatrix} $

In Python using *numpy* this can be achieved using the following code:

In [ ]:

```
import numpy as np
print("np.arange(9):", np.arange(9))
print("np.arange(9, 18):", np.arange(9, 18))
A = np.arange(9, 18).reshape((3, 3))
B = np.arange(9).reshape((3, 3))
print("A:", A)
print("B:", B)
```

The *numpy* function *arange* is similar to the standard Python function *range*. It returns an array with $n$ elements, specified in the one parameter version only. If we provide to parameters to *arange*, it generates an array starting from the value of the first parameter and ending with a value one less than the second parameter. The function *reshape* returns us a matrix with the corresponding number of rows and columns.

We can now add and subtract the two matrices $A$ and $B$:

In [ ]:

```
A + B
```

In [ ]:

```
A - B
```

The **inverse** of a square matrix $A \in \mathbb{R}^{n\times n}$ is $A^{-1}$:

$A^{-1} A = I = A A^{-1}$

Not all matrices have inverses. Non-square matrices do not have inverses by definition. For some square matrices $A$ the inverse might not exist.

$A$ is **invertible** or **non-singular** if $A^{-1}$ exists.

$A$ is **non-invertible** or **singular** if $A^{-1}$ does not exist.

Note: **non-singular** means the opposite of **non-invertible**!

For $A$ to have an inverse $A^{-1}$, $A$ must be **full rank**.

Assuming that $A,B \in \mathbb{R}^{n\times n}$ are non-singular, then:

- $(A^{-1})^{-1} = A$
- $(AB)^{-1} = B^{-1} A^{-1}$
- $(A^{-1})^T = (A^T)^{-1}$ (often simply $A^{-T}$)

Two vectors $x, y \in \mathbb{R}^n$ are **orthogonal** if $x^T y = 0$.

A vector $x \in \mathbb{R}^n$ is **normalized** if $\|x\|^2 = 1$.

A square matrix $U \in \mathbb{R}^{n\times n}$ is **orthogonal** if all its columns are orthogonal to each other and are **normalized**. The columns are then referred to as being **orthonormal**.

It follows immediately from the definition of orthogonality and normality that:

$U^T U = I = U U^T$

This means that the inverse of an orthogonal matrix is its transpose.

If U is not square - i.e., $U \in \mathbb{R}^{m\times n}$, $n < m$ - but its columns are still orthonormal, then $U^T U = I$, but $U U^T \neq I$.

We generally only use the term orthogonal to describe the case, where $U$ is square.

Another nice property of orthogonal matrices is that operating on a vector with an orthogonal matrix will not change its Euclidean norm. For any $x \in \mathbb{R}^n$, $U \in \mathbb{R}^{n\times n}$ orthogonal.

$\|U_x\|^2 = \|x\|^2$

The **span** of a set of vectors $\{ x_1, x_2, \dots{}, x_n\}$ is the set of all vectors that can be expressed as
a linear combination of $\{ x_1, \dots{}, x_n \}$:

$\mathrm{span}(\{ x_1, \dots{}, x_n \}) = \{ v : v = \sum_{i=1}^n \alpha_i x_i, \alpha_i \in \mathbb{R} \}$

It can be shown that if $\{ x_1, \dots{}, x_n \}$ is a set of n linearly independent vectors, where each $x_i \in \mathbb{R}^n$, then $\mathrm{span}(\{ x_1, \dots{}, x_n\}) = \mathbb{R}^n$. That is, any vector $v \in \mathbb{R}^n$ can be written as a linear combination of $x_1$ through $x_n$.

The projection of a vector $y \in \mathbb{R}^m$ onto the span of $\{ x_1, \dots{}, x_n\}$ (here we assume $x_i \in \mathbb{R}^m$) is the vector $v \in \mathrm{span}(\{ x_1, \dots{}, x_n \})$, such that $v$ is as close as possible to $y$, as measured by the Euclidean norm $\|v − y\|^2$. We denote the projection as $\mathrm{Proj}(y; \{ x_1, \dots{}, x_n \})$ and can define it formally as:

$\mathrm{Proj}( y; \{ x_1, \dots{}, x_n \}) = \mathrm{argmin}_{v\in \mathrm{span}(\{x_1,\dots{},x_n\})}\|y − v\|^2$

The **range** (sometimes also called the columnspace) of a matrix $A \in \mathbb{R}^{m\times n}$, denoted $\mathcal{R}(A)$, is the the span of the columns of $A$. In other words,

$\mathcal{R}(A) = \{ v \in \mathbb{R}^m : v = A x, x \in \mathbb{R}^n\}$

Making a few technical assumptions (namely that $A$ is full rank and that $n < m$), the projection of a vector $y \in \mathbb{R}^m$ onto the range of $A$ is given by:

$\mathrm{Proj}(y; A) = \mathrm{argmin}_{v\in \mathcal{R}(A)}\|v − y\|^2 = A(A^T A)^{−1} A^T y$

See for more details in the notes page 13.

The **nullspace** of a matrix $A \in \mathbb{R}^{m\times n}$, denoted $\mathcal{N}(A)$ is the set of all vectors that equal $0$ when multiplied by $A$, i.e.,

$\mathcal{N}(A) = \{ x \in \mathbb{R}^n : A x = 0 \}$

Note that vectors in $\mathcal{R}(A)$ are of size $m$, while vectors in the $\mathcal{N}(A)$ are of size $n$, so vectors in $\mathcal{R}(A^T)$ and $\mathcal{N}(A)$ are both in $\mathbb{R}^n$. In fact, we can say much more. It turns out that:

$\{ w : w = u + v, u \in \mathcal{R}(A^T), v \in \mathcal{N}(A) \} = \mathbb{R}^n$ and $\mathcal{R}(A^T) \cap \mathcal{N}(A) = \{0\}$

In other words, $\mathcal{R}(A^T)$ and $\mathcal{N}(A)$ are disjoint subsets that together span the entire space of
$\mathbb{R}^n$. Sets of this type are called **orthogonal complements**, and we denote this $\mathcal{R}(A^T) = \mathcal{N}(A)^\perp$.

The determinant of a square matrix $A \in \mathbb{R}^{n\times n}$, is a function $\mathrm{det} : \mathbb{R}^{n\times n} \rightarrow \mathbb{R}$, and is denoted $|A|$ or $\mathrm{det}A$ (like the trace operator, we usually omit parentheses).

Given

$\begin{bmatrix} -- & a_1^T & -- \\[0.3em] -- & a_2^T & -- \\[0.3em] & \vdots & \\[0.3em] -- & a_n^T & -- \end{bmatrix}$

consider the set of points $S \subset \mathbb{R}^n$ formed by taking all possible linear combinations of the row vectors $a_1, \dots{}, a_n \in \mathbb{R}^n$ of $A$, where the coefficients of the linear combination are all between $0$ and $1$; that is, the set $S$ is the restriction of $\mathrm{span}( \{ a_1, \dots{}, a_n \})$ to only those linear combinations whose coefficients $\alpha_1, \dots{}, \alpha_n$ satisfy $0 \leq \alpha_i \leq 1$, $i = 1, \dots{}, n$. Formally:

$S = \{v \in \mathbb{R}^n : v = \sum_{i=1}^n \alpha_i a_i \mbox{ where } 0 \leq \alpha_i \leq 1, i = 1, \dots{}, n \}$

The absolute value of the determinant of $A$, it turns out, is a measure of the *volume* of the set $S$. The volume here is intuitively for example for $n = 2$ the area of $S$ in the Cartesian plane, or with $n = 3$ it is the common understanding of *volume* for 3-dimensional objects.

Example:

$A = \begin{bmatrix} 1 & 3\\[0.3em] 3 & 2 \end{bmatrix}$

The rows of the matrix are:

$a_1 = \begin{bmatrix} 1 \\[0.3em] 3 \end{bmatrix} \quad a_2 = \begin{bmatrix} 3 \\[0.3em] 2 \end{bmatrix}$

The set S corresponding to these rows is shown in:

The figure above is an illustration of the determinant for the $2\times 2$ matrix $A$ above. Here, $a_1$ and $a_2$ are vectors corresponding to the rows of $A$, and the set $S$ corresponds to the shaded region (i.e., the parallelogram). The absolute value of the determinant, $|\mathrm{det}A| = 7$, is the area of the parallelogram.

For two-dimensional matrices, $S$ generally has the shape of a parallelogram. In our example, the value of the determinant is $|A| = −7$ (as can be computed using the formulas shown later), so the area of the parallelogram is $7$.

In three dimensions, the set $S$ corresponds to an object known as a parallelepiped (a three-dimensional box with skewed sides, such that every face has the shape of a parallelogram). The absolute value of the determinant of the $3 \times 3$ matrix whose rows define $S$ give the three-dimensional volume of the parallelepiped. In even higher dimensions, the set $S$ is an object known as an $n$-dimensional parallelotope.

Algebraically, the determinant satisfies the following three properties (from which all other properties follow, including the general formula):

- The determinant of the identity is $1$, $|I| = 1$. (Geometrically, the volume of a unit hypercube is $1$).
- Given a matrix $A \in \mathbb{R}^{n\times n}$, if we multiply a single row in $A$ by a scalar $t \in \mathbb{R}$, then the determinant of the new matrix is $t|A|$,

$\left| \begin{bmatrix} -- & t a_1^T & -- \\[0.3em] -- & a_2^T & -- \\[0.3em] & \vdots & \\[0.3em] -- & a_m^T & -- \end{bmatrix}\right| = t|A|$

(Geometrically, multiplying one of the sides of the set $S$ by a factor $t$ causes the volume to increase by a factor $t$.) - If we exchange any two rows $a^T_i$ and $a^T_j$ of $A$, then the determinant of the new matrix is $−|A|$, for example

$\left| \begin{bmatrix} -- & a_2^T & -- \\[0.3em] -- & a_1^T & -- \\[0.3em] & \vdots & \\[0.3em] -- & a_m^T & -- \end{bmatrix}\right| = -|A|$

Several properties that follow from the three properties above include:

- For $A \in \mathbb{R}^{n\times n}$, $|A| = |A^T|$
- For $A,B \in \mathbb{R}^{n\times n}$, $|AB| = |A||B|$
- For $A \in \mathbb{R}^{n\times n}$, $|A| = 0$ if and only if $A$ is singular (i.e., non-invertible). (If $A$ is singular then it does not have full rank, and hence its columns are linearly dependent. In this case, the set $S$ corresponds to a "flat sheet" within the $n$-dimensional space and hence has zero volume.)
- For $A \in \mathbb{R}^{n\times n}$ and $A$ non-singular, $|A−1| = 1/|A|$

Continue on page 16.

A **tensor** could be thought of as an organized multidimensional array of numerical values. A vector could be assumed to be a sub-class of a tensor. Rows of tensors extend alone the y-axis, columns along the x-axis. The **rank** of a scalar is 0, the rank of a **vector** is 1, the rank of a **matrix** is 2, the rank of a **tensor** is 3 or higher.

The **hyperplane** is a sub-space in the ambient space with one dimension less. In a two-dimensional space the hyperplane is a line, in a three-dimensional space it is a two-dimensional plane, etc.

Hyperplanes divide an $n$-dimensional space into sub-spaces that might represent clases in a machine learning algorithm.

This is also the *inner product*. It is a function that returns a number computed from two vectors of the same length by summing up the product of the corresponding dimensions.

For two vectors $a = [a_1, a_2, \dots{}, a_n]$ and $b = [b_1, b_2, \dots{}, b_n]$ the dot product is:

$\mathbf{a} \cdot \mathbf{b} = \sum_{i=1}^{n} a_{i} b_{i} = a_{1} b_{1} + a_{2} b_{2} + \cdots + a_{n} b_{n}$

If we normalize two vectors and compute the dot product, we get the *cosine similarity*, which can be used as a metric for cimilarity of vectors. Independent of the absolute length we look at the angle between the vectors, i.e. the lenght is neutralized via normalization.

The cosine of two non-zero vectors can be derived by using the Euclidean dot product formula (see Wikipedia: Cosine similarity):

$\mathbf{a} \cdot \mathbf{b} = \left\|\mathbf{a}\right\| \left\|\mathbf{b}\right\| \cos\theta$

Given two vectors of attributes, $A$ and $B$, the cosine similarity, $cos(\theta)$, is represented using a dot product and magnitude as:

$\text{similarity} = \cos(\theta) = \frac{\mathbf{A} \cdot \mathbf{B}}{ \|\mathbf{A} \|\|\mathbf{B} \| } = \frac{\sum \limits_{i=1}^{n}{A_{i}B_{i}}}{{\sqrt {\sum \limits _{i=1}^{n}{A_{i}^{2}}}}{\sqrt {\sum \limits _{i=1}^{n}{B_{i}^{2}}}}}$, with $A_i$ and $B_i$ components of vector $A$ and $B$ respectively.

This is also known as the **entrywise product**. For two matrices $A \in \mathbb{R}^{m\times n}$ and $B \in \mathbb{R}^{m\times n}$ the Hadamard product $A\circ B$ is:

$(A\circ B)_{i,j} = (A)_{i,j} (B)_{i,j}$

For example:

$\begin{bmatrix} a_{11} & a_{12} & a_{13} \\[0.3em] a_{21} & a_{22} & a_{23} \\[0.3em] a_{31} & a_{32} & a_{33} \end{bmatrix} \circ \begin{bmatrix} b_{11} & b_{12} & b_{13} \\[0.3em] b_{21} & b_{22} & b_{23} \\[0.3em] b_{31} & b_{32} & b_{33} \end{bmatrix} = \begin{bmatrix} a_{11}b_{11} & a_{12}b_{12} & a_{13}b_{13} \\[0.3em] a_{21}b_{21} & a_{22}b_{22} & a_{23}b_{23} \\[0.3em] a_{31}b_{31} & a_{32}b_{32} & a_{33}b_{33} \end{bmatrix}$

This is also called the **tensor product** of two vectors. Compute the resulting matrix by multiplying each element from a column vector with all alements in a row vector.

...

**(C) 2018-2019 by Damir Cavar**