Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Linear Algebra


Basics

The first two lectures are review and an introduction to vectors, matrices, and their algebraic operations. This is material you learned in your Discrete Structures class which you can review here.

Matrix Algebra

Once we have objects that represent the information we want and can perform operations on, the next logical question is: can we build an algebra? At this level, algebra involves identifying the rules we can apply to manipulate equations and solve for unknown variables.

The Rules of Matrix Algebra

While matrix algebra feels similar to the math we learn in grade school, there are some critical distinctions—particularly regarding commutativity and division.


Solving Without Division: Inverses and Identities

Without a division operation, we solve for variables by using inverses and identities. In regular arithmetic, the multiplicative inverse of nn is 1/n1/n. When you multiply them, you get the multiplicative identity, 1.

To do this with matrices, we must first define the Identity Matrix (I\mathbf{I}). For a square matrix, the identity is a matrix of all zeros with ones along the main diagonal:

I=(1000010000100001)\mathbf{I} = \begin{pmatrix} 1 & 0 & 0 & 0 \\ 0 & 1 & 0 & 0 \\ 0 & 0 & 1 & 0 \\ 0 & 0 & 0 & 1 \end{pmatrix}

Multiplying any matrix by the identity returns the original matrix: MI=IM=M\mathbf{M}\mathbf{I} = \mathbf{I}\mathbf{M} = \mathbf{M}. Therefore, the inverse of a matrix (M1\mathbf{M}^{-1}) is a matrix such that their product results in the identity:

MM1=IandM1M=I\mathbf{M}\mathbf{M}^{-1} = \mathbf{I} \quad \text{and} \quad \mathbf{M}^{-1}\mathbf{M} = \mathbf{I}

While there are simple formulas for the inverse of small matrices (2×22 \times 2), larger matrices are much more difficult to invert and typically require Gaussian Elimination.


Non-Square Matrices and the Transpose

If a matrix is not square, it doesn’t have a standard inverse. To handle this, we “make it square” using the transpose (MT\mathbf{M}^T). To transpose a matrix, you switch the rows and columns:

If M=(abcdef)\mathbf{M} = \begin{pmatrix} a & b \\ c & d \\ e & f \end{pmatrix}, then MT=(acebdf)\mathbf{M}^T = \begin{pmatrix} a & c & e \\ b & d & f \end{pmatrix}.

The product of a matrix and its transpose is always a square matrix. By taking the inverse of that resulting square matrix, we can calculate what is known as the pseudo-inverse.


Putting it Together: Solving an Equation

Now we can perform algebra to solve for a vector x\mathbf{x} in the linear system Mx=y\mathbf{M}\mathbf{x} = \mathbf{y}:

Mx=yMTMx=MTy(MTM)1MTMx=(MTM)1MTyx=(MTM)1MTy\begin{aligned} \mathbf{M}\mathbf{x} &= \mathbf{y} \\ \mathbf{M}^T\mathbf{M}\mathbf{x} &= \mathbf{M}^T\mathbf{y} \\ (\mathbf{M}^T\mathbf{M})^{-1}\mathbf{M}^T\mathbf{M}\mathbf{x} &= (\mathbf{M}^T\mathbf{M})^{-1}\mathbf{M}^T\mathbf{y} \\ \mathbf{x} &= (\mathbf{M}^T\mathbf{M})^{-1}\mathbf{M}^T\mathbf{y} \end{aligned}

Gauss-Jordan Elimination

Gauss-Jordan Elimination

The Gauss-Jordan algorithm is the typical method we use to find the inverse of a square matrix. It is a variation of Gaussian Elimination. We manipulate our matrix using these three rules:

  1. Exchange two rows in the matrix.

  2. Multiply a row with a non-zero constant.

  3. Add or subtract the scalar multiple of one row to another row.

The Algorithm

  1. Create the partitioned matrix (MI)(\mathbf{M}|\mathbf{I}), where I\mathbf{I} is the identity matrix.

  2. Swap the rows so that the row with the largest, leftmost nonzero entry is on top.

  3. Multiply the top row by a scalar so that the top row’s leading entry becomes 1.

  4. Add/subtract multiples of the top row to the other rows so that all other entries in the column containing the top row’s leading entry are all zero.

  5. Repeat steps 2-4 for the next leftmost nonzero entry until all the leading entries are 1.


Example

Let’s invert the following matrix:

(203134314)1\left(\begin{array}{ccc} 2 & 0 & 3 \\ -1 & 3 & -4 \\ -3 & 1 & -4 \end{array}\right)^{-1}
  1. Create the partitioned matrix:

    (203100134010314001)\left(\begin{array}{ccc|ccc} 2 & 0 & 3 & 1 & 0 & 0 \\ -1 & 3 & -4 & 0 & 1 & 0 \\ -3 & 1 & -4 & 0 & 0 & 1 \end{array}\right)
  2. The largest entry is already in the top row (in magnitude relative to the first column), so we skip the swap.

  3. Multiply row 1 by 12\frac{1}{2} to get a 1 in the upper left:

    (10321200134010314001)\left(\begin{array}{ccc|ccc} 1 & 0 & \frac{3}{2} & \frac{1}{2} & 0 & 0 \\ -1 & 3 & -4 & 0 & 1 & 0 \\ -3 & 1 & -4 & 0 & 0 & 1 \end{array}\right)
  4. Get zeros in the rest of the first column. Since row 2 has a -1, we replace row 2 with (Row 1 + Row 2):

    (1032120003521210314001)\left(\begin{array}{ccc|ccc} 1 & 0 & \frac{3}{2} & \frac{1}{2} & 0 & 0 \\ 0 & 3 & -\frac{5}{2} & \frac{1}{2} & 1 & 0 \\ -3 & 1 & -4 & 0 & 0 & 1 \end{array}\right)
  5. Get a zero in the last row. Replace row 3 with (Row 3 + 3 ×\times Row 1):

    (103212000352121001123201)\left(\begin{array}{ccc|ccc} 1 & 0 & \frac{3}{2} & \frac{1}{2} & 0 & 0 \\ 0 & 3 & -\frac{5}{2} & \frac{1}{2} & 1 & 0 \\ 0 & 1 & \frac{1}{2} & \frac{3}{2} & 0 & 1 \end{array}\right)
  6. Move to the second column. We want a 1 where the 3 is, so we divide the second row by 3:

    (1032120001561613001123201)\left(\begin{array}{ccc|ccc} 1 & 0 & \frac{3}{2} & \frac{1}{2} & 0 & 0 \\ 0 & 1 & -\frac{5}{6} & \frac{1}{6} & \frac{1}{3} & 0 \\ 0 & 1 & \frac{1}{2} & \frac{3}{2} & 0 & 1 \end{array}\right)
  7. The first row already has a 0 in this column. Replace row 3 with (Row 3 - Row 2):

    (10321200015616130004343131)\left(\begin{array}{ccc|ccc} 1 & 0 & \frac{3}{2} & \frac{1}{2} & 0 & 0 \\ 0 & 1 & -\frac{5}{6} & \frac{1}{6} & \frac{1}{3} & 0 \\ 0 & 0 & \frac{4}{3} & \frac{4}{3} & -\frac{1}{3} & 1 \end{array}\right)
  8. Fix the last column. Multiply row 3 by 34\frac{3}{4}:

    (1032120001561613000111434)\left(\begin{array}{ccc|ccc} 1 & 0 & \frac{3}{2} & \frac{1}{2} & 0 & 0 \\ 0 & 1 & -\frac{5}{6} & \frac{1}{6} & \frac{1}{3} & 0 \\ 0 & 0 & 1 & 1 & -\frac{1}{4} & \frac{3}{4} \end{array}\right)
  9. Replace row 2 with (Row 2 + 56×\frac{5}{6} \times Row 3):

    (103212000101185800111434)\left(\begin{array}{ccc|ccc} 1 & 0 & \frac{3}{2} & \frac{1}{2} & 0 & 0 \\ 0 & 1 & 0 & 1 & \frac{1}{8} & \frac{5}{8} \\ 0 & 0 & 1 & 1 & -\frac{1}{4} & \frac{3}{4} \end{array}\right)
  10. Final step: Replace row 1 with (Row 1 - 32×\frac{3}{2} \times Row 3):

    (100138980101185800111434)\left(\begin{array}{ccc|ccc} 1 & 0 & 0 & -1 & \frac{3}{8} & -\frac{9}{8} \\ 0 & 1 & 0 & 1 & \frac{1}{8} & \frac{5}{8} \\ 0 & 0 & 1 & 1 & -\frac{1}{4} & \frac{3}{4} \end{array}\right)

And we’re done. The inverted matrix is:

(203134314)1=(138981185811434)\left(\begin{array}{ccc} 2 & 0 & 3 \\ -1 & 3 & -4 \\ -3 & 1 & -4 \end{array}\right)^{-1} = \begin{pmatrix} -1 & \frac{3}{8} & -\frac{9}{8} \\ 1 & \frac{1}{8} & \frac{5}{8} \\ 1 & -\frac{1}{4} & \frac{3}{4} \end{pmatrix}

Complexity Note

For an n×nn \times n matrix, the algorithm operates by fixing each of the nn columns. For each column, it operates on nn rows, and for each row, it performs operations across the entire row. Therefore, Gauss-Jordan Elimination has a computational complexity of O(n3)O(n^3).


Linear Regression

We have seen how perceptrons classify data using a line (or in higher dimensions, a hyperplane) to separate in-category data from out-category data. Classification is a special machine learning task where the outputs are binary. However, we often want to make predictions where our outputs are real values.

Example: If you are a bank, and someone applies for a line of credit:

When the task of the system is to output a real number that matches the data, it is called regression. Regression techniques were initially developed in the 19th century but are now understood as a core part of the ML landscape.

As with earlier models, our training data is a collection of input-output pairs (x1,y1),(x2,y2),,(xn,yn)(x_1, y_1), (x_2, y_2), \dots, (x_n, y_n), where the yy values are now real numbers. Our goal is to find a function h(x)h(x) (our hypothesis) that takes an xx and outputs the appropriate yy.

Fundamental Assumptions

  1. The function we are trying to find is linear.

  2. The data is noisy, meaning it is impossible to find an exact match function—there will always be some error.

Defining the Error

First, let’s call the function we develop h(x)h(x) for “hypothesis” (since ff was already used for our feature function). Our error will be defined as:

E(h)=1Nn=1N(h(xn)yn)2E(h) = \frac{1}{N}\sum_{n=1}^{N}(h(x_n)-y_n)^2

Since hh is linear, it is just some constants (weights) times each of the input elements:

h(x)=i=0dwixi=wTxh(x) = \sum_{i=0}^{d}w_ix_i = w^Tx

Note: We start ii at 0 for the bias term (x0=1x_0 = 1). The reason ww is transposed is that both ww and xx are treated as column vectors.

The Matrix Approach

Next, we’ll treat all the input as a single giant matrix XX (by transposing and stacking each input):

X=(x1,1x1,2x1,dx2,1x2,2x2,dxn,1xn,2xn,d)X = \begin{pmatrix} x_{1,1} & x_{1,2} & \cdots & x_{1,d} \\ x_{2,1} & x_{2,2} & \cdots & x_{2,d} \\ \vdots & \vdots & \ddots & \vdots \\ x_{n,1} & x_{n,2} & \cdots & x_{n,d} \end{pmatrix}

Now, we can re-write the error function as:

E(w)=1Nn=1N(wTxnyn)2=1NXwy2=1N(wTXTXw2wTXTy+yTy)\begin{aligned} E(w) &= \frac{1}{N}\sum_{n=1}^{N}(w^Tx_n-y_n)^2 \\ &= \frac{1}{N}\|Xw-y\|^2 \\ &= \frac{1}{N}(w^TX^TXw - 2w^TX^Ty + y^Ty) \end{aligned}

The last step above is simply the result of multiplying the expression out.

Finding the Minimum Error

What we want to do is find the weights that minimize that error: wmin=argminwE(w)w_{min} = \arg\min_{w} E(w).

To do this, we do what we always do: take the derivative and set it equal to 0 to find the critical points. We write this in vector calculus as E(w)\nabla E(w), which is defined as:

[E(w)]i=wiE(w)[\nabla E(w)]_i = \frac{\partial}{\partial w_i}E(w)

Useful Vector Calculus Identities:

Using these facts, we can determine the gradient:

E(w)=2N(XTXwXTy)\nabla E(w) = \frac{2}{N}(X^TXw - X^Ty)

The Solution (The Algorithm)

Since 2N\frac{2}{N} will never be 0, we only need to worry about the terms in the parentheses:

XTXwXTy=0XTXw=XTy\begin{aligned} X^TXw - X^Ty &= 0 \\ X^TXw &= X^Ty \\ \end{aligned}

Now, we simply invert XTXX^TX and multiply:

w=(XTX)1XTyw = (X^TX)^{-1}X^Ty

And that last line is literally the whole algorithm.