How Machines Understand Language
A guide to word embeddings — where meaning becomes mathematics, and vectors do the talking.
When a search engine retrieves a document about automobiles in response to a query about cars, it is not matching text character by character. Somewhere beneath the interface, the system understands that these two words are semantically related. The mechanism behind that understanding is the word embedding — and once you see the geometry, you cannot unsee it.
This article walks through the key mathematical operations that make embeddings work: distance, similarity, arithmetic, scaling, and the dot product. Each concept is illustrated with concrete numerical vectors so the math is visible, not just described. Real embeddings typically use hundreds of dimensions; the 3- and 4-dimensional examples here preserve all the structure while staying readable on a page.
A word embedding is a representation of a word as a vector — an ordered list of numbers — in a high-dimensional space. A typical embedding model might use 300 dimensions, so the word cat becomes a point with 300 coordinates. That sounds abstract, but the key insight is this: the position of that point encodes meaning.
This is what researchers call a semantic space. Words with related meanings end up positioned close to each other. King and Queen live near each other. Paris and London live near each other. Bicycle and democracy live far apart. The model learns these positions not from human-curated rules, but from the statistical patterns of how words appear together in enormous text corpora.
vec(“Queen”) = [ 0.9, 0.7, 0.4, -0.6 ]
vec(“Man”) = [ 0.5, 0.3, 0.1, +0.8 ]
vec(“Woman”) = [ 0.5, 0.3, 0.1, -0.8 ]
The first three dimensions encode royalty, authority, and age.
The fourth dimension encodes gender: positive = masculine, negative = feminine.
Think of it as a map where the geography is meaning. Every word is a pin, and the distances between pins reflect semantic relationships rather than physical ones.
Once words are points in space, we need a way to measure how close they are. Two approaches dominate: Euclidean distance and cosine similarity. For the examples below, we use a 3-dimensional temperature embedding:
vec(“Warm”) = [ 0.8, 0.6, 0.4 ]
vec(“Cold”) = [ -0.6, 0.4, -0.8 ]
2.1 Euclidean (Cartesian) Distance
The most intuitive measure — the straight-line gap between the tips of two arrows drawn from the origin. For vectors a and b in n dimensions:
d(Hot, Warm) = √[ (1.0-0.8)2 + (0.8-0.6)2 + (0.6-0.4)2 ]
= √[ 0.04 + 0.04 + 0.04 ] = √0.12 ≈ 0.346 ← small: close together
// Hot vs Cold (opposite words)
d(Hot, Cold) = √[ (1.0-(-0.6))2 + (0.8-0.4)2 + (0.6-(-0.8))2 ]
= √[ 2.56 + 0.16 + 1.96 ] = √4.68 ≈ 2.163 ← large: far apart
2.2 Cosine Similarity — The Industry Standard
In practice, NLP systems almost universally prefer cosine similarity over Euclidean distance. It ignores the length of vectors entirely and focuses only on the angle between them — two vectors pointing the same direction score 1.0 regardless of their magnitude.
| cos(θ) = | a · b |
| ‖a‖ × ‖b‖ |
‖Hot‖ = √(1.02 + 0.82 + 0.62) = √2.00 ≈ 1.414
‖Warm‖ = √(0.82 + 0.62 + 0.42) = √1.16 ≈ 1.077
‖Cold‖ = √(0.62 + 0.42 + 0.82) = √1.16 ≈ 1.077
// Hot vs Warm (small angle)
dot(Hot, Warm) = (1.0)(0.8) + (0.8)(0.6) + (0.6)(0.4) = 0.80 + 0.48 + 0.24 = 1.52
cos(Hot, Warm) = 1.52 / (1.414 × 1.077) = 1.52 / 1.523 ≈ +0.998
// Hot vs Cold (large angle)
dot(Hot, Cold) = (1.0)(-0.6) + (0.8)(0.4) + (0.6)(-0.8) = -0.60 + 0.32 – 0.48 = -0.76
cos(Hot, Cold) = -0.76 / (1.414 × 1.077) = -0.76 / 1.523 ≈ -0.499
| Word Pair | Euclidean d | cos(θ) | Interpretation |
|---|---|---|---|
| Hot vs Warm | 0.346 | +0.998 | Nearly identical direction — closely related |
| Hot vs Cold | 2.163 | −0.499 | Opposite directions — antonyms |
Because words are vectors, you can perform arithmetic on them — and the results are semantically meaningful. The most famous example uses the 4-dimensional royalty vectors introduced in Section 1:
Man = [ 0.5, 0.3, 0.1, +0.8 ]
Woman = [ 0.5, 0.3, 0.1, -0.8 ]
// Subtract component by component, then add
King – Man = [ 0.9-0.5, 0.7-0.3, 0.4-0.1, 0.6-0.8 ] = [ 0.4, 0.4, 0.3, -0.2 ]
+ Woman = [ 0.4+0.5, 0.4+0.3, 0.3+0.1, -0.2+(-0.8) ] = [ 0.9, 0.7, 0.4, -1.0 ]
// Find nearest word by Euclidean distance
result = [ 0.9, 0.7, 0.4, -1.0 ]
d(result, Queen) = √[ 0 + 0 + 0 + (-1.0-(-0.6))2 ] = √0.16 ≈ 0.400 ← nearest
d(result, Woman) ≈ 0.671 d(result, King) = 1.600 d(result, Man) ≈ 1.910
cos(result, Queen) ≈ 0.974 ← highest cosine similarity also points to Queen
What happened geometrically? Subtracting Man from King stripped out the gender dimension (+0.8 gone), leaving the royalty structure intact. Adding Woman injected the feminine gender value (-0.8). The result sits 0.4 units from Queen — the nearest word in this vocabulary.
Multiplying or dividing a vector by a scalar (a plain number) changes its magnitude without changing its direction. This maps neatly onto the idea of degree in language — Tiny, Large, and Gigantic all point in roughly the same semantic direction, but at different intensities.
vec(“Large”) = [ 0.50, 0.70, 0.40 ]
vec(“Gigantic”) = [ 1.10, 1.50, 0.90 ]
Large × 2 = [ 0.5×2, 0.7×2, 0.4×2 ] = [ 1.00, 1.40, 0.80 ]
vec(“Gigantic”) = [ 1.10, 1.50, 0.90 ] d(Large × 2, Gigantic) ≈ 0.173 ← very close
// Multiplying Large by 0.2 moves it toward Tiny
Large × 0.2 = [ 0.10, 0.14, 0.08 ]
vec(“Tiny”) = [ 0.10, 0.20, 0.10 ] d(Large × 0.2, Tiny) ≈ 0.063 ← very close
Division works the same way along an intensity axis. Halving a “Loud” vector lands near “Soft”:
Loud ÷ 2 = [ 0.45, 0.60, 0.30 ]
d(Loud ÷ 2, Soft) ≈ 0.269 ← direction unchanged, intensity halved
The dot product of two vectors is computed by multiplying their corresponding components and summing the results:
The dot product is cosine similarity before normalising away the vector lengths. It captures two things simultaneously: the direction of agreement and the combined magnitude. Cosine similarity captures only the first.
We reuse the loudness vectors from Section 4 — Very Loud is “Loud” and A Little Loud is “Soft”. They point in exactly the same direction but have very different lengths:
vec(“Very Loud”) = [ 0.90, 1.20, 0.60 ] |magnitude| = 1.616
// Cosine similarity: measures direction only
dot(AL, VL) = (0.3)(0.9) + (0.4)(1.2) + (0.2)(0.6) = 0.27 + 0.48 + 0.12 = 0.87
cos(AL, VL) = 0.87 / (0.539 × 1.616) = 0.87 / 0.871 ≈ 1.000
// Dot product: measures direction AND magnitude
AL · AL = (0.3)2 + (0.4)2 + (0.2)2 = 0.09 + 0.16 + 0.04 = 0.29
VL · VL = (0.9)2 + (1.2)2 + (0.6)2 = 0.81 + 1.44 + 0.36 = 2.61
| Comparison | Magnitude | cos(θ) | v · v |
|---|---|---|---|
| A Little Loud | 0.539 | 1.000 (same dir.) | 0.29 |
| Very Loud | 1.616 | 1.000 (same dir.) | 2.61 |
Both words are perfectly collinear — cosine similarity is 1.0 in both cases. But the dot products are 0.29 vs 2.61, a 9× difference. This is why recommendation systems and attention mechanisms in transformer models often prefer raw dot products: when you want to know not just whether a document is relevant but also how prominently it discusses a topic, the dot product gives you both dimensions at once.
Search engines convert your query into a vector and retrieve documents whose vectors are nearest to it in the semantic space — using cosine similarity to rank by relevance regardless of exact word match. When you search for car insurance and the engine returns results about vehicle coverage, it is doing nearest-neighbour lookup in embedding space, exactly as the Hot/Warm/Cold example in Section 2 demonstrates.
Recommendation systems represent your interests as a vector computed from your history, then find products whose vectors are closest to yours. The dot product is particularly useful here: a highly-relevant item with a large magnitude — analogous to Very Loud — will score higher than a mildly-relevant item even if they point in the same direction.
Large language models use the scaled dot product directly inside the attention mechanism. For every token, a query vector and a set of key vectors are compared via dot product to determine which parts of the context deserve attention — a direct descendant of the arithmetic explored in Section 5.
Quick Reference: Embedding Operations
| Operation | Formula | Section 2-5 Result |
|---|---|---|
| Euclidean Distance | √( Σ (ai − bi)2 ) | d(Hot,Warm) = 0.346 d(Hot,Cold) = 2.163 |
| Cosine Similarity | (a·b) / (‖a‖×‖b‖) | cos(Hot,Warm) = +0.998 cos(Hot,Cold) = -0.499 |
| Vector Arithmetic | a ± b | King-Man+Woman → nearest Queen (d = 0.400) |
| Scalar Multiplication | λ · a | Large × 2 → near Gigantic Loud ÷ 2 → near Soft |
| Dot Product | a·b = Σ aibi | cos = 1.00 for both; dot 0.29 (soft) vs 2.61 (loud) |
✦ This article was generated with the assistance of Claude by Anthropic ✦
Share this:
- Share on X (Opens in new window) X
- Share on Facebook (Opens in new window) Facebook
- Print (Opens in new window) Print
- Email a link to a friend (Opens in new window) Email
- Share on LinkedIn (Opens in new window) LinkedIn
- Share on Reddit (Opens in new window) Reddit
- Share on Tumblr (Opens in new window) Tumblr
- Share on Threads (Opens in new window) Threads
- Share on Pinterest (Opens in new window) Pinterest
- Share on Telegram (Opens in new window) Telegram
- Share on WhatsApp (Opens in new window) WhatsApp
- Share on Bluesky (Opens in new window) Bluesky
You must be logged in to post a comment.