The Mathematical Journey from Discrete Symbols to Continuous Meaning
The Embedding Matrix: A learned parameter matrix $E \in \mathbb{R}^{|V| \times d}$ where:
• $|V|$ = vocabulary size (e.g., 50,000 words)
• $d$ = embedding dimension (e.g., 300)
• Each row $E_i$ contains the embedding for token $i$
Skip-gram Objective: Predict context words given center word
Where: $p(w_O | w_I) = \frac{\exp(\vec{v}_{w_O}^T \vec{v}_{w_I})}{\sum_{w=1}^{|V|} \exp(\vec{v}_w^T \vec{v}_{w_I})}$
Training maximizes the probability of context words given the center word "aunt"
Gradient Descent Updates: The embedding matrix E is updated iteratively:
Where $\mathcal{L}$ is the negative log-likelihood loss. Through this process:
From Integer to Semantic Meaning:
The transformation can be summarized as:
This embedding vector now encodes semantic properties like:
Key Insight: The embedding matrix $E$ acts as a learned continuous representation that maps discrete symbolic tokens into a high-dimensional real vector space where semantic relationships become geometric relationships.
Why this works:
Word embeddings transform discrete symbolic tokens into continuous vector representations where semantic operations correspond to geometric operations. The mathematical structure that enables vector addition, subtraction, scalar multiplication, and meaningful arithmetic relationships relies fundamentally on the properties of real numbers as a field.
This is why aunt = queen - king + uncle works mathematically—all operations are well-defined in the real number field $\mathbb{R}$.