Like me if you have been spending hours to keep pace with the concepts and terminology used in large language models then this write-up may be worth your time. The objective is to document the glossary of key terms and put them in context that is easy to visualize and understand. Let’s start with few definitions first
Embeddings – A technique used to represent information in a format that can be easily processed by algorithms, especially deep learning models. This ‘information’ refers to text, pictures, video and/or audio. Embeddings consists of token.
Tokens – In NLP, a token is a single word for example “Vaneet Arora” is a single token. Embeddings are vector, with each component being a token. The value attached to a token is its “correlation” to the word representing its parent embedding. In LLMs, tokens are treated as the features in your dataset.
Dimension – Is the number of tokens per embedding.
Transformer – Embeddings and transformers go together. It is an algorithm (neural network) that learns context and thus meaning by tracking relationships in sequential data like the words in this sentence. It transforms original text into a more compact form and relationships to facilitate further processing.
Let’s use the analogy of a library to explain these concepts:
Tokens: Imagine each word in a language as a unique book in a library. In the context of language models, these unique books are referred to as ‘tokens’. For example, ‘cat’, ‘dog’, ‘run’ are all different tokens.
Embeddings: Now, imagine if each book (token) in our library had a magical summary that not only tells us about the book itself, but also how this book relates to every other book in the library. This magical summary is what we call an ‘embedding’. An embedding captures the essence of a token and its relationship with other tokens.
Dimensions: The magical summary (embedding) is not a simple one-line summary. It’s a detailed summary with many different points – these points are what we call ‘dimensions’. Each dimension captures a different aspect of the token. For example, one dimension might capture the token’s grammatical role (is it a noun, verb, adjective?), another might capture its sentiment (is it positive, negative?), and so on.
So, in summary, each word (token) in a language is like a unique book in a library. Each book has a magical summary (embedding) that tells us about the book and its relationship with other books. And each point in that summary is a dimension that captures a different aspect of the book.
Let’s continue with our library analogy to understand how dimensions work in practice. Let’s consider the words ‘king’ and ‘queen’.
In the language model’s multi-dimensional space, each dimension could represent a different characteristic. For example, one dimension might represent ‘gender’. In this dimension, ‘king’ and ‘queen’ would be far apart because one represents male and the other represents female.
Another dimension might represent ‘royalty’. In this dimension, ‘king’ and ‘queen’ would be very close together because they both represent royal figures.
Yet another dimension might represent ‘age’. If we consider ‘king’ and ‘queen’ to typically be older figures, they might be closer to words like ‘elderly’ and further from words like ‘child’ in this dimension.
So, the position of a word in this multi-dimensional space (its embedding) is determined by its relationship with other words across multiple dimensions. This is how a language model captures the nuanced meanings of words.
Remember, in practice, these dimensions are not explicitly labeled as ‘gender’, ‘royalty’, ‘age’, etc. They are learned from data and represent more abstract concepts. The labels are just for mere mortals like you and me to better understand the high-dimensional space that the model is working in.
Let’s now consider a few more examples to understand how ’embeddings’ work.
Synonyms: Words that have similar meanings are likely to have similar embeddings. For example, the words ‘happy’, ‘joyful’, and ‘elated’ are all synonyms and thus their embeddings would be close in the multi-dimensional space.
Antonyms: Words that have opposite meanings would be further apart in the embedding space. For example, ‘hot’ and ‘cold’ or ‘happy’ and ‘sad’ would have embeddings that are far apart.
Contextual meanings: Words that have different meanings based on context would have different embeddings based on their usage. For example, the word ‘bank’ in ‘river bank’ and ‘money bank’ would have different embeddings because they represent different concepts.
Word relationships: Relationships between words can also be captured in the embedding space. For example, the relationship between ‘man’ and ‘woman’ is similar to the relationship between ‘king’ and ‘queen’. This is often represented as an equation in the embedding space: king – man = queen – woman.
Lastly, do remember that these embeddings are learned from the data the model is trained on. So, the quality and diversity of the training data plays a crucial role in how well these embeddings capture the nuances of a language.