Vdrwopsuxb

This is not a code blog; there’s no easy way to copy and paste from here. Instead, please review the source code directly if that’s what you need.

Our first step is to normalize data; we do this by subtracting the mean. An elegant way to do this is to create a unit square matrix, which can then be multiplied by the data/total number of rows, thus giving the sum of data/total number of rows, thus giving the mean. This is formally known as the Deviation Matrix.

Look at this chonky business:

let unit = unitSquareMatrix(matrix.length);
let deviationMatrix = subtract(matrix, multiplyAndScale(unit, matrix, 1 / matrix.length));
const D = deviationMatrix
//Where multiply and scale basically just does a matrix multiplication with the scaling used to calculate mean
//The below snippet is taken directly from the library
/**
* Fix for #11, OOM on moderately large datasets, fuses scale and multiply into a single operation to save memory
*
* @param {Matrix} a
* @param {Matrix} b
* @param {number} factor
* @returns
*/
export function multiplyAndScale(a: Matrix, b: Matrix, factor: number): Matrix {
assertValidMatrices(a, b, "a", "b")
const aRows = a.length;
const aCols = a[0].length;
const bCols = b[0].length;
const flat = new Float64Array(aRows * bCols);
for (let i = 0; i < aRows; i++) {
for (let k = 0; k < aCols; k++) {
const aVal = a[i][k] * factor;
const iOffset = i * bCols;
for (let j = 0; j < bCols; j++) {
flat[iOffset + j] += aVal * b[k][j];
}
}
}
const result: Matrix = [];
for (let i = 0; i < aRows; i++) {
result[i] = Array.from(flat.subarray(i * bCols, (i + 1) * bCols));
}
return result;
}

Is this really fixed though? Will we continue to OOM? Well the 3 for loops basically means O(n^3) worst case and while this is never a good way to do things, we can go waay simpler by simply hard calculating the means inside the matrix columns themselves. Something like this → matrix.map(row => row.map((v, i) => v - matrix.reduce((s, r) => s + r[i], 0) / matrix.length));

The library skews for elegance instead of optimization, and assumes that you would eventually be doing this on the GPU (which would make this sort of code a lot faster for high volume data, with simple but parallel ops).

Let’s move on to the next step → Deviation Scores, which is nothing but D^T @ D (i.e. matrix multiply the transpose of the earlier computed deviation matrix to get a matrix with variances on the diagonal, and covariances off the diagonal). The @ basically means matrix multiplication in Python. Then we can simply divide by the number of rows in order to get the actual Variance Covariance matrix.

Why is none of this important? And why are we doing it at all?

Basically we have now formatted our data into a nice shape and form to do actual analysis on it. Simply put, the actual analysis requires data to be in a specific format, not raw unscaled data that could mean absolutely jackshit.

A small glimpse of this final Variance Covariance matrix would be as follows, and what this neatly formatted data means is… it can now be used to accomplish the actual mathematical property that can get us far, and be useful. Namely this can now be used to compute SVD. For 3 factors (or columns of data) and 3 items (or rows of data) you get a matrix looking like the below:

[
[var(f1), cov(f1,f2), cov(f1,f3)],
[cov(f2,f1), var(f2), cov(f2,f3)],
[cov(f3,f1), cov(f3,f2), var(f3) ]
]

What on earth is SVD?

Put informally, Singular Value Decomposition is the master of the universe. Basically, the easiest way to split a specially formatted matrix into a set of insights. What are the insights?

SVD = U*Σ*V^T

U and V are basically characteristics, and Σ is the importance of that characteristic. Okay, before that, all of the symbols above are matrices, so they’re all basically just numbers arranged in rows and columns. The insights are in the numbers themselves, while U provides ROW type characteristic insights, V provides COLUMN type characteristic insights, and the sigma is just a set of importance scores telling you how much importance to assign to each characteristic insight.

For our current purpose we only need to use COLUMN insights, i.e. we want to reduce COLUMNS while keeping ROWs intact, so V becomes our eigenvectors and Σ our eigenvalues

Wtf is this eigen shiii?

Eigen is German for “own” … no we do not use it anywhere else… its basically mathematical terminology for a characteristic, so umm just roll with it ig?

Basically, you can now pick a few eigenvectors and that is a “good enough”™️ representation of your entire data, and the way you pick these vectors and the total percentage of “good enough”™️ that it is is calculated using the eigenvalues. Just sort the list by eigenvalues and pick the vector with the max eigenvalues to get the most accurate compressed representation of your data, and calculate Percentage Explained simply by using :

percentage_explained = Σ(selected eigenvalues) / Σ(all eigenvalues)

So select 2 of the top vectors to get a higher percentage, 3 of them to get an even higher, and basically all the vectors to explain all your data (but then it is basically just the original data)

In practice, the eigenvectors are usually top heavy, with the TOP eigenvector usually explaining 80% of your data (the Pareto principle or 80-20 rule) but you can transparently see if this is not actually the case which is what makes Principal Components useful.

Compressed Data

So now you can really simply get the compressed data (fewer number of columns) depending on which eigenvector(s) you chose:

compressed = selected_eigenvectors × centered_data^T

So if you chose 2 columns [2 eigenvectors] x 4 rows [basically a weight of each column of your original data] of the eigenvectors and your data was originally 4 columns x 30 rows then you get finally 2 columns x 30 rows of data, which might not seem like much, but what if you had 3 million rows? HA!

Uncompressed Data

You can get your original data back using

original ≈ selected_eigenvectors^T × compressedData + mean

This is lossy compression , since you no longer have the exact actual data back, that got wiped out because you selected fewer eigenvectors than the full total.

Know that in order to transmit the data, you do need some overhead, like selected_eigenvectors along with the compressed data, but that should usually be extremely small compared to the actual data.

What have we achieved?

Well, we have achieved compression, but in order to understand actual insights let’s go back to the original example from the readme in the package. You are a high school teacher. You’ve taken 3 examinations in which each student performed different, but you didn’t really standardize the difficulty of each examination, some were harder some were easier.

Now you want to grade your students, but it would be unfair to skew towards a single easy exam where the student might have scored easily.

Sample data from the Readme:

| Student | Exam 1 | Exam 2 | Exam 3 | |----|----|----|----| | 1 | 40 | 50 | 60 | | 2 | 50 | 70 | 60 | | 3 | 80 | 70 | 90 | | 4 | 50 | 60 | 80 |

Here’s the averages :

| | Exam 1 | Exam 2 | Exam 3 | |----|----|----|----| | Mean | 55 | 62.5 | 72.5 |

As you can see, Exam 1 was probably the hardest, but student 3 did exceedingly well in it.. This should probably be provided a higher weightage than the other examinations.

As you can see, student 3 is the best in class, and student 1 is the worst in class.. the middle is where it gets close.

| Student | Average Score | |----|----| | 1 | 50.00 | | 2 | 60.00 | | 3 | 80.00 | | 4 | 63.33 |

Right now, I do not know if Student 4 is truly better than Student 2, since the averages are super close, maybe they just got lucky in the third exam (it was the simplest after all)

But luckily for us, we did Principal Components, due to which we now have 3 eigenvectors, each explaining a certain bit of variance in the examination difficulties.

| PC | Eigenvalue | Eigenvector | % Variance | |----|----|----|----| | PC1 | 520.1 | [0.74, 0.28, 0.60] | 84.3% | | PC2 | 78.1 | [0.23, 0.74, -0.63] | 12.7% | | PC3 | 18.5 | [0.63, -0.61, -0.48] | 3.0% |

Since we seem to have an 80-20 here (thanks Pareto!) we now can very simply take the first eigenvector as a good measure of difficulties of each exam. This also shows us intuitively, that the first exam is a discriminatory exam (i.e. most people did badly), whereas the second is not (scores were pretty close)

So, we can just now create our compressed data as Score = (Exam 1 × 0.74) + (Exam 2 × 0.28) + (Exam 3 × 0.60) where score is a simulated Exam 4, which basically then tells us how “good” a student is at a subject, while normalizing for difficulty. We can say that we would be 84.3% accurate in saying so, and that 80% is a good enough accuracy for us to predict the results of a FINAL exam.

Ideally we would have taken 100 exams in order to normalize this well, but umm, the students would probably revolt!

Now we have the scores:

| Student | Score | |----|----| | 1 | 31 | | 2 | 44 | | 3 | 84 | | 4 | 53 |

Now the picture is much clearer, student 4 may have been lucky, but they were lucky consistently, across a hard and an easy examination, this gives them a much higher score than student 2. Student 3 did well across both a hard and an easy exam so is the valedictorian and will probably go on to do great things!

I know what you’re thinking … Wishy Washy AF

Exactly! And this is why Principal Components is mainly used as a dimension reduction technique rather than for an actual insight across variables. For example, if Exam 1 was on Physics, Exam 2 on Chemistry, and Exam 3 on Math, then would you say that this would be a correct comparison? Sure, student 3 stands clear in their mastery, but umm, it’s kinda hard to say what really qualifies the other students, are they, physical chemists or mathematical physicists?

And this is how to move on to Part 4 … where we actually use this technique not just to opaquely reduce a set of variables (columns) into a single variable (as a weighted merge of the individual columns), but instead combine columns into an inspectable series of weighted averages. Basically, we give each eigenvector a name based on the weights they provide to the columns selected.

But hold on… before we move on to that… we need to go to Part 4, where we answer the most important question of 2022: ~~What is a Woman?~~ What is a Neural Network? More specifically, we will be answering how to investigate the inner workings of a convolutional neural network (if a GPT is a brain, a convnet is the eyes).

See you in Part 3… or not, idc.

Principal Components in TypeScript (Part 2): How PCA Actually Works Under the Hood

Why is none of this important? And why are we doing it at all?

What on earth is SVD?

Wtf is this eigen shiii?

Compressed Data

Uncompressed Data

What have we achieved?

I know what you’re thinking … Wishy Washy AF

Related Articles

Redwood Materials lays off 10% in restructuring to chase energy storage business

Tim Cook is stepping down as CEO of Apple: Here’s a look at his 15-year legacy, from new products and services to China expansion

Prediction markets are the new secret weapon for Coinbase and Robinhood growth