un — Geometry in Computer Science and Machine Learning

un

guest

1 / ?

back to lessons

Data Lives in Geometric Space

Everything Is a Vector

Word Embedding Vectors: Similarity as Direction

In machine learning, data lives in geometric space. Every data point with N features is a point in N-dimensional space. This is not a metaphor: it is the literal mathematical foundation of every algorithm.

A handwritten digit image (28×28 pixels) is a point in 784-dimensional space. Each pixel is one coordinate. Two similar-looking digits are nearby points in that space. Two different digits are far apart.

A word embedding maps a word to a point in 300-dimensional space. Words with similar meanings end up in the same neighborhood. 'Dog' & 'puppy' are close. 'Dog' & 'parliament' are far.

A user profile with 50 features (age, purchase history, click patterns) is a point in 50-dimensional space. Recommendation engines find users who are 'nearby' in this space & suggest what their geometric neighbors liked.

Geometry is how we reason about these spaces. Distance, direction, angle, projection: these are the fundamental operations of machine learning.

Vector Operations: The Building Blocks

The Dot Product Powers Everything

Three vector operations matter most in machine learning:

Vector addition: combining features or signals. If you add two word vectors, you get a vector representing both concepts blended together.

Scalar multiplication: scaling a vector changes its magnitude without changing its direction. Learning rates in gradient descent are scalar multipliers.

Dot product: this is the workhorse. The dot product of two vectors a and b equals |a||b|cos(θ), where θ is the angle between them. When the vectors are normalized (unit length), the dot product IS the cosine of the angle.

Cosine similarity = cos(θ) = (a·b) / (|a||b|)

This single formula powers:

- Search engines: finding documents similar to a query

- Attention mechanisms: deciding which tokens matter to each other

- Recommendation engines: matching user profiles to item profiles

- Retrieval-augmented generation: finding relevant context for language models

cos(θ) = 1 means the vectors point in exactly the same direction (identical meaning). cos(θ) = 0 means they are perpendicular (unrelated). cos(θ) = -1 means they point in opposite directions (opposing meaning).

Cosine Similarity

Cosine similarity is one of the most used metrics in modern machine learning systems.

Two word embedding vectors have a cosine similarity of 0.95. Another pair has cosine similarity of 0.12. What does each number tell you about the relationship between the words in each pair?

Three Ways to Measure Distance

The Choice of Distance Metric Changes What 'Similar' Means

Three Distance Metrics: Same Points, Different Meanings

Given two points in space, there are many ways to measure the 'distance' between them. Each metric defines a different geometry, & that geometry determines what your model considers 'similar.'

Euclidean distance (L2): the straight-line distance. d = √(Σ(aᵢ - bᵢ)²). This is the 'as the crow flies' distance, the one your intuition expects. It treats all dimensions equally and is sensitive to magnitude.

Manhattan distance (L1): the grid-walking distance. d = Σ|aᵢ - bᵢ|. Like navigating city blocks: you can only move along axes, never diagonally. More robust to outliers in single dimensions because it does not square the differences.

Cosine distance: measures the angle between vectors, ignoring magnitude entirely. d = 1 - cos(θ). Two documents about the same topic have small cosine distance regardless of length. Two equally long documents about different topics have large cosine distance.

The choice is not arbitrary. If magnitude matters (dosage of a drug, temperature of a reactor), use Euclidean. If you care about proportions rather than absolutes (word frequency distributions, user preference profiles), use cosine. If individual feature differences matter more than aggregate magnitude (fault diagnosis, where one sensor spiking is meaningful), use Manhattan.

K-Nearest Neighbors: Pure Geometry

KNN: The Simplest Geometric Algorithm

K-Nearest Neighbors is the most transparent geometric algorithm in machine learning. It has no training phase: it IS the training data.

To classify a new point: find the K closest points in the training data. Let them vote. Majority class wins. That is the entire algorithm.

The decision boundary that KNN produces is a Voronoi diagram: a partition of space where every point belongs to the region of its nearest training example. The boundaries are the perpendicular bisectors between adjacent training points.

Here is the geometric insight that matters: the choice of distance metric completely changes the Voronoi diagram. Euclidean distance produces curved, circular boundaries. Manhattan distance produces diamond-shaped boundaries. Cosine distance produces angular, cone-shaped boundaries.

Same training data. Same K. Different distance metric. Completely different model. The geometry IS the model.

Choosing a Distance Metric

Distance metrics are not interchangeable: the right choice depends on what 'similar' means for your data.

Why might you use cosine distance instead of Euclidean distance when comparing documents? Think about what happens when two documents discuss the same topic but one is much longer.

Hyperplanes: Flat Boundaries in High Dimensions

Every Linear Classifier Finds a Hyperplane

Decision Boundaries: Linear, Nonlinear, and the Kernel Trick

A linear classifier finds a flat surface that separates two classes. The dimensionality of this surface depends on the space:

- In 2D space, the boundary is a line (1-dimensional)

- In 3D space, the boundary is a plane (2-dimensional)

- In 784D space (MNIST digit images), the boundary is a 783-dimensional hyperplane

The general pattern: in N-dimensional space, the decision boundary is an (N-1)-dimensional flat surface called a hyperplane.

Logistic regression, support vector machines, & single-layer perceptrons are all hyperplane finders. They differ in HOW they find the best hyperplane:

- Logistic regression maximizes the probability of correct classification

- SVMs maximize the geometric margin: the distance from the hyperplane to the nearest data points

- Perceptrons simply find any hyperplane that separates the data, with no guarantee of optimality

The weight vector of a linear classifier IS the normal vector to the hyperplane. The bias term shifts the hyperplane away from the origin. These are geometric objects with geometric interpretations.

Beyond Flat Boundaries

When Data Is Not Linearly Separable

Many real-world problems cannot be solved with a flat boundary. Consider classifying images of cats vs dogs: no single hyperplane in pixel space separates them cleanly.

Two geometric strategies exist:

Strategy 1: The kernel trick: Transform the data into a higher-dimensional space where it IS linearly separable. A classic example: points inside a circle (class A) and points outside (class B) in 2D. No line separates them. But add a third dimension z = x² + y², and the inner points (small x² + y²) sit low while the outer points (large x² + y²) sit high. Now a flat plane separates them perfectly.

SVMs with kernel functions do this implicitly: they compute dot products in the high-dimensional space without ever constructing the actual high-dimensional vectors. This is called the 'kernel trick' and it is a purely geometric insight.

Strategy 2: Neural networks: Stack linear transformations with nonlinear activation functions. Each layer applies a linear transform (matrix multiply = rotation + scaling + shearing) followed by a nonlinear 'bend' (ReLU, sigmoid, tanh). The composition of many linear-then-bend operations can approximate any continuous boundary shape.

A deep neural network is a sequence of geometric transforms that warp the input space until the classes become linearly separable in the final layer.

Separating Circular Data

This is one of the most important geometric problems in machine learning.

In 2D, you have red points inside a circle & blue points outside. A straight line cannot separate them. Describe two geometric strategies to solve this.

Loss Surface

Training = Walking Downhill on a Surface

Loss Landscape: Navigating the Surface

Every machine learning model has parameters: weights and biases. The loss function measures how wrong the model's predictions are. Together, these define a loss surface: a landscape where each point corresponds to a specific set of parameter values, and the height is the loss.

For a model with 2 parameters, the loss surface is a 3D landscape you can visualize: hills, valleys, and plains. For a model with 175 billion parameters (like GPT-3), the loss surface exists in 175-billion-dimensional space. The math is identical.

Gradient descent is the algorithm that navigates this surface. The gradient is a geometric object: a vector that points in the direction of steepest ascent. To reduce loss, move in the opposite direction: the negative gradient. This is literally walking downhill.

The learning rate controls step size. Too large & you overshoot valleys. Too small & you crawl. The gradient tells you the direction; the learning rate tells you how far to step.

Saddle Points, Minima, and the Geometry of High Dimensions

The Loss Landscape Is Not a Simple Bowl

A naive picture of training imagines a smooth bowl with a single lowest point. The reality is far more complex:

Local minima: valleys that are not the deepest. Gradient descent can get stuck here, satisfied that every direction goes up, even though a deeper valley exists elsewhere.

Saddle points: shaped like a horse saddle. The loss curves downward in some dimensions and upward in others. In 2D this is rare. In high dimensions, saddle points are exponentially more common than local minima. A critical point in 1000-dimensional space has to curve upward in ALL 1000 dimensions to be a local minimum. If even one dimension curves down, it is a saddle point.

Flat plateaus: regions where the gradient is near zero. Training stalls because there is no slope to follow.

Sharp vs flat minima: a sharp minimum is a narrow valley. A flat minimum is a broad valley. Research suggests that flat minima generalize better to unseen data, because small perturbations to the parameters (from noise in new data) do not dramatically change the loss.

SGD with momentum helps escape saddle points and sharp minima. The randomness of stochastic gradient descent acts like shaking a ball on the surface: it bounces out of narrow traps and finds broader, flatter valleys.

SGD vs Full-Batch Gradient Descent

This is one of the most important practical insights in machine learning training.

Why does stochastic gradient descent (SGD) often find better solutions than full-batch gradient descent, from a geometric perspective?

Words as Vectors: Semantic Arithmetic

Meaning Has Direction

Word Embedding Space: Semantic Geometry

Word2Vec, GloVe, & modern transformer embeddings map discrete tokens (words, subwords) to continuous vector spaces. The result is a geometric world where meaning has coordinates.

The famous example: king - man + woman ≈ queen

This is vector arithmetic. The vector from 'man' to 'king' represents the concept 'royalty applied to a male.' The vector from 'woman' to 'queen' represents 'royalty applied to a female.' These vectors are approximately parallel: same direction, same relationship, different starting points.

Other geometric relationships that emerge from training on text:

- Paris - France + Italy ≈ Rome (capital-of relationship)

- walked - walk + swim ≈ swam (past tense transformation)

- bigger - big + small ≈ smaller (comparative form)

No one programmed these relationships. The model discovered that meaning has geometric structure by reading billions of words. Directions in embedding space correspond to semantic relationships. This is one of the most profound geometric discoveries in machine learning.

Manifold Hypothesis

High-Dimensional Data Lives on Low-Dimensional Surfaces

A 64×64 grayscale face image has 4,096 pixel values: it is a point in 4,096-dimensional space. But not every point in that space is a valid face. Most random 4,096-dimensional vectors look like static noise, not faces.

The manifold hypothesis states that real-world, high-dimensional data actually lies on or near low-dimensional curved surfaces (manifolds) embedded in the high-dimensional space. The manifold of faces might be only 50-dimensional: parameterized by factors like lighting angle, head pose, expression, skin tone, age.

This is a geometric claim with practical consequences:

- Dimensionality reduction (PCA, t-SNE, UMAP) works because the data is approximately low-dimensional. These algorithms find the manifold & project onto it.

- Autoencoders learn to compress data into a low-dimensional latent space (the manifold) & reconstruct from it.

- Generative models (VAEs, diffusion models) learn the manifold and sample new points on it: generating new faces, new music, new text that looks real because it lies on the learned manifold.

When your model fails to generalize, one geometric explanation is: it learned the wrong manifold. The training data traced out a surface that does not match the true data distribution.

Vector Analogies

The geometric structure of embedding spaces is one of the most surprising results in modern machine learning.

If word embeddings capture meaning geometrically, what does it mean when we say the vector from 'man' to 'king' is approximately parallel to the vector from 'woman' to 'queen'? What geometric concept is at work?

ROC Curves: Classification Quality as Area

Evaluation Metrics Live in Geometric Spaces

ROC Space: Classification Quality as Geometry

An ROC (Receiver Operating Characteristic) curve plots True Positive Rate (y-axis) against False Positive Rate (x-axis) as you sweep the classification threshold from 0 to 1.

This is a geometric space with meaningful landmarks:

- (0, 1): the top-left corner: perfect classification. Every positive detected, zero false alarms.

- (0, 0): the bottom-left: the model classifies everything as negative.

- (1, 1): the top-right: the model classifies everything as positive.

- The diagonal from (0,0) to (1,1): a random classifier. At every threshold, it has equal true positive and false positive rates.

AUC (Area Under the Curve) is literally a geometric area measurement. AUC = 0.5 means the model is random (the area under the diagonal). AUC = 1.0 means perfect classification (the entire unit square). A good model's ROC curve bows toward the top-left corner, enclosing more area.

AUC has a beautiful probabilistic interpretation: it equals the probability that the model scores a random positive example higher than a random negative example. But geometrically, it is just area: and that geometric simplicity is what makes it intuitive.

Precision-Recall Space

A Different Geometric Tradeoff

Precision-recall curves live in a different geometric space than ROC curves, & they tell a different story.

Precision = of everything the model flagged positive, what fraction was actually positive?

Recall = of all actual positives, what fraction did the model find?

As you lower the classification threshold (flag more things as positive), recall increases (you catch more real positives) but precision typically decreases (you also catch more false positives). This tradeoff traces a curve in precision-recall space.

F1 score = 2 × (precision × recall) / (precision + recall): the harmonic mean. Geometrically, the F1 score equals the point on the precision-recall curve where precision equals recall. It is where the curve intersects the diagonal of the precision-recall square.

Average Precision (AP) = the area under the precision-recall curve. Like AUC-ROC, it summarizes the entire curve into a single number that represents geometric area.

ROC curves & precision-recall curves are complementary geometric views of the same model. ROC curves can be misleadingly optimistic on imbalanced datasets (99% negative class). Precision-recall curves remain informative because they focus on the positive class.

AUC-ROC Interpretation

Understanding what AUC-ROC measures geometrically helps you choose between models.

Two models have the same accuracy (85%). Model A has AUC-ROC of 0.92. Model B has AUC-ROC of 0.78. Why might you prefer Model A? What does the geometric difference in their ROC curves tell you?

Transformers: Dot Products as Attention

Attention Is a Geometric Similarity Measure

Attention = Geometric Alignment Between Query and Keys

The transformer architecture: the foundation of modern language models: is built on a geometric operation: the dot product.

For each token in a sequence, the transformer computes three vectors: Query (Q): Key (K): & Value (V): each obtained by multiplying the input embedding by learned weight matrices.

The attention score between two tokens is: score = Q · K^T / √d

This is a scaled dot product: a geometric similarity measure. When Q and K point in the same direction (small angle between them), the dot product is large: this key is highly relevant to this query. When they are perpendicular, the dot product is zero: irrelevant.

The scores are passed through softmax to create a probability distribution: attention weights that sum to 1. The output is the weighted sum of Value vectors, where the weights are determined by geometric alignment.

In a sentence like 'The cat sat on the mat because it was tired,' attention computes: for the word 'it,' which other words have the most geometric alignment? If the Q vector for 'it' aligns most closely with the K vector for 'cat,' the model attends to 'cat': resolving the pronoun reference through geometry.

Multi-Head Attention: Multiple Geometric Perspectives

Why Multiple Heads?

Self-attention with a single set of Q, K, V matrices computes one type of geometric alignment. But language has many types of relationships: syntactic, semantic, positional, referential.

Multi-head attention uses multiple sets of Q, K, V projection matrices, each projecting into a different subspace of the embedding. Each head measures alignment in its own geometric subspace.

What researchers observe when they inspect attention heads:

- Head 1 might attend to the previous word (positional proximity)

- Head 2 might attend to the verb from the subject (syntactic dependency)

- Head 3 might attend to semantically related words earlier in the context

- Head 4 might attend to the most recent noun (coreference)

Each head is a different geometric lens on the same data. The projections rotate & scale the embedding space differently, making different relationships visible through alignment.

This explains why transformers outperform models with a single attention mechanism. A single dot product in the full embedding space captures one notion of similarity. Multiple dot products in different subspaces capture multiple, complementary notions simultaneously.

Multi-Head Attention

Multi-head attention is one of the key architectural innovations of the transformer.

In a transformer, why does using multiple attention heads help compared to a single head? Answer in terms of geometric subspaces.

Machine Learning Applies Geometry

The Unifying Thread

Look at what we have covered. Every major concept in machine learning has a geometric core:

Data = points in high-dimensional space

Features = dimensions of that space

Similarity = distance or angle between points

Classification = finding geometric boundaries between classes

Training = navigating a loss surface by following gradients

Embeddings = learned coordinate systems where geometry encodes meaning

Evaluation = areas under curves in metric spaces

Attention = dot products measuring angular alignment

This is not a coincidence. Machine learning inherited its mathematical framework from linear algebra and differential geometry: fields that are fundamentally about space, shape, and transformation.

Understanding the geometry gives you something that memorizing algorithms cannot: intuition. When your model fails, the geometric view suggests where to look. Are the classes not separable? Look at the boundary. Is training stuck? Examine the loss landscape. Are embeddings poor? Check if similar items are geometrically close. Is attention diffuse? Inspect the subspace projections.

The geometry is the same whether you are working with 3 dimensions or 3 billion. The math scales. The intuition transfers. This is what makes geometry the universal language of machine learning.

Geometric Debugging

We have covered vectors, distances, boundaries, training, embeddings, evaluation, and attention: all through the lens of geometry.

Choose one concept from this lesson & explain how understanding its geometric nature changes HOW you would debug or improve a model that uses it. Be specific.