Linear Algebra - SI420 Artificial Intelligence

Basics¶

The first two lectures are review and an introduction to vectors, matrices, and their algebraic operations. This is material you learned in your Discrete Structures class which you can review here.

Matrix Algebra¶

Once we have objects that represent the information we want and can perform operations on, the next logical question is: can we build an algebra? At this level, algebra involves identifying the rules we can apply to manipulate equations and solve for unknown variables.

The Rules of Matrix Algebra¶

While matrix algebra feels similar to the math we learn in grade school, there are some critical distinctions—particularly regarding commutativity and division.

Addition and Subtraction: You can add and subtract matrices ( $M_1 + M_2$ or $M_1 - M_2$ ). These operations are commutative:
$M_1 + M_2 = M_2 + M_1$
(1)
Multiplication: You can multiply matrices ( $M_1 M_2$ ), but it is not commutative. Order matters:
$M_1 M_2 \neq M_2 M_1$
(2)
Division: You cannot divide matrices ( $\frac{M_1}{M_2}$ does not exist).
Associativity: Both addition and multiplication are associative:
$(M_1 + M_2) + M_3 = M_1 + (M_2 + M_3)$
(3)

$(M_1 M_2) M_3 = M_1 (M_2 M_3)$
(4)
Distributivity: Matrices follow the distributive property:
$M_1(M_2 + M_3) = M_1M_2 + M_1M_3$
(5)

$(M_1 + M_2)M_3 = M_1M_3 + M_2M_3$
(6)

Solving Without Division: Inverses and Identities¶

Without a division operation, we solve for variables by using inverses and identities. In regular arithmetic, the multiplicative inverse of $n$ is $1/n$ . When you multiply them, you get the multiplicative identity, 1.

To do this with matrices, we must first define the Identity Matrix ( $\mathbf{I}$ ). For a square matrix, the identity is a matrix of all zeros with ones along the main diagonal:

\mathbf{I} = \begin{pmatrix} 1 & 0 & 0 & 0 \\ 0 & 1 & 0 & 0 \\ 0 & 0 & 1 & 0 \\ 0 & 0 & 0 & 1 \end{pmatrix}

(7)

Multiplying any matrix by the identity returns the original matrix: $\mathbf{M}\mathbf{I} = \mathbf{I}\mathbf{M} = \mathbf{M}$ . Therefore, the inverse of a matrix ( $\mathbf{M}^{-1}$ ) is a matrix such that their product results in the identity:

\mathbf{M}\mathbf{M}^{-1} = \mathbf{I} \quad \text{and} \quad \mathbf{M}^{-1}\mathbf{M} = \mathbf{I}

(8)

While there are simple formulas for the inverse of small matrices ( $2 \times 2$ ), larger matrices are much more difficult to invert and typically require Gaussian Elimination.

Non-Square Matrices and the Transpose¶

If a matrix is not square, it doesn’t have a standard inverse. To handle this, we “make it square” using the transpose ( $\mathbf{M}^T$ ). To transpose a matrix, you switch the rows and columns:

If $\mathbf{M} = \begin{pmatrix} a & b \\ c & d \\ e & f \end{pmatrix}$ , then $\mathbf{M}^T = \begin{pmatrix} a & c & e \\ b & d & f \end{pmatrix}$ .

The product of a matrix and its transpose is always a square matrix. By taking the inverse of that resulting square matrix, we can calculate what is known as the pseudo-inverse.

Putting it Together: Solving an Equation¶

Now we can perform algebra to solve for a vector $\mathbf{x}$ in the linear system $\mathbf{M}\mathbf{x} = \mathbf{y}$ :

\begin{aligned} \mathbf{M}\mathbf{x} &= \mathbf{y} \\ \mathbf{M}^T\mathbf{M}\mathbf{x} &= \mathbf{M}^T\mathbf{y} \\ (\mathbf{M}^T\mathbf{M})^{-1}\mathbf{M}^T\mathbf{M}\mathbf{x} &= (\mathbf{M}^T\mathbf{M})^{-1}\mathbf{M}^T\mathbf{y} \\ \mathbf{x} &= (\mathbf{M}^T\mathbf{M})^{-1}\mathbf{M}^T\mathbf{y} \end{aligned}

(9)

Gauss-Jordan Elimination¶

The Gauss-Jordan algorithm is the typical method we use to find the inverse of a square matrix. It is a variation of Gaussian Elimination. We manipulate our matrix using these three rules:

Exchange two rows in the matrix.
Multiply a row with a non-zero constant.
Add or subtract the scalar multiple of one row to another row.

The Algorithm¶

Create the partitioned matrix $(\mathbf{M}|\mathbf{I})$ , where $\mathbf{I}$ is the identity matrix.
Swap the rows so that the row with the largest, leftmost nonzero entry is on top.
Multiply the top row by a scalar so that the top row’s leading entry becomes 1.
Add/subtract multiples of the top row to the other rows so that all other entries in the column containing the top row’s leading entry are all zero.
Repeat steps 2-4 for the next leftmost nonzero entry until all the leading entries are 1.

Example¶

Let’s invert the following matrix:

\left(\begin{array}{ccc} 2 & 0 & 3 \\ -1 & 3 & -4 \\ -3 & 1 & -4 \end{array}\right)^{-1}

(10)

Create the partitioned matrix:
$\left(\begin{array}{ccc|ccc} 2 & 0 & 3 & 1 & 0 & 0 \\ -1 & 3 & -4 & 0 & 1 & 0 \\ -3 & 1 & -4 & 0 & 0 & 1 \end{array}\right)$
(11)
The largest entry is already in the top row (in magnitude relative to the first column), so we skip the swap.
Multiply row 1 by $\frac{1}{2}$ to get a 1 in the upper left:
$\left(\begin{array}{ccc|ccc} 1 & 0 & \frac{3}{2} & \frac{1}{2} & 0 & 0 \\ -1 & 3 & -4 & 0 & 1 & 0 \\ -3 & 1 & -4 & 0 & 0 & 1 \end{array}\right)$
(12)
Get zeros in the rest of the first column. Since row 2 has a -1, we replace row 2 with (Row 1 + Row 2):
$\left(\begin{array}{ccc|ccc} 1 & 0 & \frac{3}{2} & \frac{1}{2} & 0 & 0 \\ 0 & 3 & -\frac{5}{2} & \frac{1}{2} & 1 & 0 \\ -3 & 1 & -4 & 0 & 0 & 1 \end{array}\right)$
(13)
Get a zero in the last row. Replace row 3 with (Row 3 + 3 $\times$ Row 1):
$\left(\begin{array}{ccc|ccc} 1 & 0 & \frac{3}{2} & \frac{1}{2} & 0 & 0 \\ 0 & 3 & -\frac{5}{2} & \frac{1}{2} & 1 & 0 \\ 0 & 1 & \frac{1}{2} & \frac{3}{2} & 0 & 1 \end{array}\right)$
(14)
Move to the second column. We want a 1 where the 3 is, so we divide the second row by 3:
$\left(\begin{array}{ccc|ccc} 1 & 0 & \frac{3}{2} & \frac{1}{2} & 0 & 0 \\ 0 & 1 & -\frac{5}{6} & \frac{1}{6} & \frac{1}{3} & 0 \\ 0 & 1 & \frac{1}{2} & \frac{3}{2} & 0 & 1 \end{array}\right)$
(15)
The first row already has a 0 in this column. Replace row 3 with (Row 3 - Row 2):
$\left(\begin{array}{ccc|ccc} 1 & 0 & \frac{3}{2} & \frac{1}{2} & 0 & 0 \\ 0 & 1 & -\frac{5}{6} & \frac{1}{6} & \frac{1}{3} & 0 \\ 0 & 0 & \frac{4}{3} & \frac{4}{3} & -\frac{1}{3} & 1 \end{array}\right)$
(16)
Fix the last column. Multiply row 3 by $\frac{3}{4}$ :
$\left(\begin{array}{ccc|ccc} 1 & 0 & \frac{3}{2} & \frac{1}{2} & 0 & 0 \\ 0 & 1 & -\frac{5}{6} & \frac{1}{6} & \frac{1}{3} & 0 \\ 0 & 0 & 1 & 1 & -\frac{1}{4} & \frac{3}{4} \end{array}\right)$
(17)
Replace row 2 with (Row 2 + $\frac{5}{6} \times$ Row 3):
$\left(\begin{array}{ccc|ccc} 1 & 0 & \frac{3}{2} & \frac{1}{2} & 0 & 0 \\ 0 & 1 & 0 & 1 & \frac{1}{8} & \frac{5}{8} \\ 0 & 0 & 1 & 1 & -\frac{1}{4} & \frac{3}{4} \end{array}\right)$
(18)
Final step: Replace row 1 with (Row 1 - $\frac{3}{2} \times$ Row 3):
$\left(\begin{array}{ccc|ccc} 1 & 0 & 0 & -1 & \frac{3}{8} & -\frac{9}{8} \\ 0 & 1 & 0 & 1 & \frac{1}{8} & \frac{5}{8} \\ 0 & 0 & 1 & 1 & -\frac{1}{4} & \frac{3}{4} \end{array}\right)$
(19)

And we’re done. The inverted matrix is:

\left(\begin{array}{ccc} 2 & 0 & 3 \\ -1 & 3 & -4 \\ -3 & 1 & -4 \end{array}\right)^{-1} = \begin{pmatrix} -1 & \frac{3}{8} & -\frac{9}{8} \\ 1 & \frac{1}{8} & \frac{5}{8} \\ 1 & -\frac{1}{4} & \frac{3}{4} \end{pmatrix}

(20)

Complexity Note¶

For an $n \times n$ matrix, the algorithm operates by fixing each of the $n$ columns. For each column, it operates on $n$ rows, and for each row, it performs operations across the entire row. Therefore, Gauss-Jordan Elimination has a computational complexity of $O(n^3)$ .

Linear Regression¶

We have seen how perceptrons classify data using a line (or in higher dimensions, a hyperplane) to separate in-category data from out-category data. Classification is a special machine learning task where the outputs are binary. However, we often want to make predictions where our outputs are real values.

Example: If you are a bank, and someone applies for a line of credit:

Classification: Deciding whether or not to grant the line of credit (Yes/No).
Regression: Deciding how much credit to give (a specific number).

When the task of the system is to output a real number that matches the data, it is called regression. Regression techniques were initially developed in the 19th century but are now understood as a core part of the ML landscape.

As with earlier models, our training data is a collection of input-output pairs $(x_1, y_1), (x_2, y_2), \dots, (x_n, y_n)$ , where the $y$ values are now real numbers. Our goal is to find a function $h(x)$ (our hypothesis) that takes an $x$ and outputs the appropriate $y$ .

Fundamental Assumptions¶

The function we are trying to find is linear.
The data is noisy, meaning it is impossible to find an exact match function—there will always be some error.

Defining the Error¶

First, let’s call the function we develop $h(x)$ for “hypothesis” (since $f$ was already used for our feature function). Our error will be defined as:

E(h) = \frac{1}{N}\sum_{n=1}^{N}(h(x_n)-y_n)^2

(21)

Since $h$ is linear, it is just some constants (weights) times each of the input elements:

h(x) = \sum_{i=0}^{d}w_ix_i = w^Tx

(22)

Note: We start $i$ at 0 for the bias term ( $x_0 = 1$ ). The reason $w$ is transposed is that both $w$ and $x$ are treated as column vectors.

The Matrix Approach¶

Next, we’ll treat all the input as a single giant matrix $X$ (by transposing and stacking each input):

X = \begin{pmatrix} x_{1,1} & x_{1,2} & \cdots & x_{1,d} \\ x_{2,1} & x_{2,2} & \cdots & x_{2,d} \\ \vdots & \vdots & \ddots & \vdots \\ x_{n,1} & x_{n,2} & \cdots & x_{n,d} \end{pmatrix}

(23)

Now, we can re-write the error function as:

\begin{aligned} E(w) &= \frac{1}{N}\sum_{n=1}^{N}(w^Tx_n-y_n)^2 \\ &= \frac{1}{N}\|Xw-y\|^2 \\ &= \frac{1}{N}(w^TX^TXw - 2w^TX^Ty + y^Ty) \end{aligned}

(24)

The last step above is simply the result of multiplying the expression out.

Finding the Minimum Error¶

What we want to do is find the weights that minimize that error: $w_{min} = \arg\min_{w} E(w)$ .

To do this, we do what we always do: take the derivative and set it equal to 0 to find the critical points. We write this in vector calculus as $\nabla E(w)$ , which is defined as:

[\nabla E(w)]_i = \frac{\partial}{\partial w_i}E(w)

(25)

Useful Vector Calculus Identities:

$\nabla_w (w^TAw) = (A + A^T)w$
$\nabla_w (w^Tb) = b$
$C^TC$ is symmetrical, so $(C^TC)^T = C^TC$

Using these facts, we can determine the gradient:

\nabla E(w) = \frac{2}{N}(X^TXw - X^Ty)

(26)

The Solution (The Algorithm)¶

Since $\frac{2}{N}$ will never be 0, we only need to worry about the terms in the parentheses:

\begin{aligned} X^TXw - X^Ty &= 0 \\ X^TXw &= X^Ty \\ \end{aligned}

(27)

Now, we simply invert $X^TX$ and multiply:

w = (X^TX)^{-1}X^Ty

(28)

And that last line is literally the whole algorithm.