From Token IDs to Semantic Embeddings

Step 1: Token ID → Embedding Matrix Lookup

Token ID

1247

Integer identifier for "aunt"

⟹

Matrix Lookup

E[1247, :]

Row extraction from embedding matrix

⟹

Dense Vector

[0.23, -0.45, 0.78, ..., 0.12]

300-dimensional real vector

The Embedding Matrix: A learned parameter matrix $E \in \mathbb{R}^{|V| \times d}$ where:

• $|V|$ = vocabulary size (e.g., 50,000 words)

• $d$ = embedding dimension (e.g., 300)

• Each row $E_i$ contains the embedding for token $i$

$$\text{embed}(\text{token}_i) = E[i, :] \in \mathbb{R}^d$$

Step 2: Interactive Matrix Visualization

Live Lookup Demonstration

Input

aunt

Token: 1247

Matrix Access

E[1247, :]

Row extraction

Output Vector

Computing...

300D vector

Step 3: How These Embeddings Are Learned

Word2Vec Training Process

Skip-gram Objective: Predict context words given center word

$$\max_E \frac{1}{T} \sum_{t=1}^{T} \sum_{-c \leq j \leq c, j \neq 0} \log p(w_{t+j} | w_t)$$

Where: $p(w_O | w_I) = \frac{\exp(\vec{v}_{w_O}^T \vec{v}_{w_I})}{\sum_{w=1}^{|V|} \exp(\vec{v}_w^T \vec{v}_{w_I})}$

The

kind

aunt

visited

Training maximizes the probability of context words given the center word "aunt"

Gradient Descent Updates: The embedding matrix E is updated iteratively:

$$E^{(t+1)} = E^{(t)} - \eta \nabla_E \mathcal{L}$$

Where $\mathcal{L}$ is the negative log-likelihood loss. Through this process:

Words appearing in similar contexts develop similar embeddings
Semantic relationships emerge as geometric patterns
Vector arithmetic becomes meaningful (king - man + woman ≈ queen)

Step 4: Resulting Semantic Vector Space

Aunt Embedding Vector (300 dimensions):

From Integer to Semantic Meaning:

The transformation can be summarized as:

$\text{"aunt"} \rightarrow 1247 \rightarrow E[1247, :] \rightarrow \vec{v}_{\text{aunt}} \in \mathbb{R}^{300}$

This embedding vector now encodes semantic properties like:

Gender (similar to woman, mother, sister)
Family relationship (similar to uncle, cousin, nephew)
Human characteristics (similar to person, individual)
Social roles and cultural context

Mathematical Summary: The Complete Transformation

The Full Pipeline

$$\text{Tokenization: } \text{"aunt"} \xrightarrow{\text{vocab}} 1247 \in \mathbb{Z}^+$$ $$\text{Embedding Lookup: } 1247 \xrightarrow{E[\cdot]} \vec{v}_{\text{aunt}} \in \mathbb{R}^d$$ $$\text{Semantic Operations: } \vec{v}_{\text{aunt}} = \vec{v}_{\text{queen}} - \vec{v}_{\text{king}} + \vec{v}_{\text{uncle}}$$

Key Insight: The embedding matrix $E$ acts as a learned continuous representation that maps discrete symbolic tokens into a high-dimensional real vector space where semantic relationships become geometric relationships.

Why this works:

Continuous optimization in $\mathbb{R}^d$ allows gradient-based learning
High dimensionality provides sufficient representational capacity
Co-occurrence statistics capture semantic similarity
Vector space structure enables algebraic manipulation of meaning

The Answer to Your Friend's Question

Word embeddings transform discrete symbolic tokens into continuous vector representations where semantic operations correspond to geometric operations. The mathematical structure that enables vector addition, subtraction, scalar multiplication, and meaningful arithmetic relationships relies fundamentally on the properties of real numbers as a field.

The field of the vector space is: $\mathbb{R}$ (the real numbers)

This is why aunt = queen - king + uncle works mathematically—all operations are well-defined in the real number field $\mathbb{R}$.