This is the second part of a two-part blog series. You can read the first part here.

As we delve deeper into word embeddings, we encounter new breakthroughs with the introduction of neural network-based approaches - Continuous Bag of Words (CBOW) and Skip-gram.

Continuous Bag of Words (CBOW):

CBOW is a word embedding model that aims to predict a target word based on its context within a given window of surrounding words. Unlike traditional models that predict a word from its context, CBOW operates in reverse, making training on large datasets efficient. Let's explain how it operates:

Let us consider the following sentence:
"Trekkers explore Manaslu Circuit for breathtaking views."
The goal is to predict the target word "trekkers" in this sentence using CBOW with a context window of 2.

Context Window: ["explore", "Manaslu", "Circuit", "for"]

Input: One-hot encoded vectors for each word in the context.

Target Word: "Trekkers"

Typically, the context window would span the words surrounding the target word, i.e. the model takes 2 words on either side of the target word, and the goal of CBOW is to predict the target word based on this context. CBOW uses a neural network with a hidden layer to predict the target word.


It is another word embedding model that takes a target word as input and predicts the context words within a specified window. Unlike CBOW, Skipgram is designed for situations where the target word has multiple context words.


Using the above sentence,
"Trekkers explore Manaslu Circuit for breathtaking views."
let's predict the context words for the target word "trekkers" with a window size 2.

Input: Target word "trekkers"

Output: Probability distribution over the context words ["explore", "Manaslu", "Circuit", "for"]

Through training, Skipgram learns to associate a target word with the words likely to appear in its context, capturing semantic relationships between words.

In summary, CBOW predicts a target word from its context, while Skipgram predicts context words given a target word, both contributing to effective word embeddings and capturing semantic relationships in language.

As we navigate through the evolution of word embedding models, we encounter a breakthrough in the landscape of natural language processing: the introduction of transformers.


The evolution of word embeddings takes a quantum leap with the introduction of transformer models. Transformers, known for their attention mechanisms, are deep learning model that allows words to dynamically weigh their importance based on context. In the context of our sentence, a transformer would consider the closeness of words and the overall global interdependence included in the statement. This global context improves the model's capacity to represent intricate interactions, providing a more holistic understanding of language.

Architecture: Let's dive deeper into the architecture of transformers.

  • Encoder-Decoder Structure:Transformers were initially designed for sequence-to-sequence tasks, where the input and output sequences may have different lengths. The architecture consists of an encoder and a decoder. However, we focus on the encoder for word embeddings and many NLP tasks.
  • Encoder Layers:The encoder comprises multiple layers, each consisting of two main subcomponents: a self-attention mechanism and a feedforward neural network. It enables the model to consider the contextual relationships between words regardless of where they are in the sequence.
    • In self-attention, each word in the input sequence interacts with every other word, and the model can capture dependencies and linkages by computing the importance or attention weight assigned to each word based on those interactions.
    • Consider a sentence: "I recently travelled to the breathtaking mountains. The scenery was mesmerizing, surrounded with fresh air, and the overall experience left me rejuvenated and grateful for nature."
      When analyzing the word "breathtaking," the model recognizes the contextual significance of words like "mountains," "mesmerizing," and "rejuvenated" in determining sentiment.
  • Feedforward Neural Network:A feedforward neural network processes the output after the self-attention layer. This network is typically a fully connected layer with non-linear activation functions, allowing it to capture complex patterns and relationships.
  • Layer Normalization and Residual Connections:Each sub-layer (self-attention and feedforward) is followed by layer normalization and a residual connection. By stabilizing the training process, these elements ensure the model understands contextual information locally and globally.
  • Multiple Encoder Layers:The transformer architecture consists of stacking multiple encoder layers on top of each other. Each layer refines the representation of the input sequence, capturing increasingly complex relationships and dependencies.

BERT (Bidirectional Encoder Representations from Transformers):

BERT uses a bidirectional transformer architecture. In contrast to earlier models mentioned in the blog, BERT considers both directions simultaneously during the training. It is pre-trained on large amounts of text data in an unsupervised manner. The model learns to predict missing words in sentences. It is widely used for tasks such as question answering, sentiment analysis, and named entity recognition. Due to this bi-directionality, performance is increased by embeddings that capture a more thorough comprehension of word meanings in context.

In the sentence, "Please bring a clean bowl. The bowl is in the []," BERT is trained to predict the missing word ("kitchen") by considering both the left and right context.

GPT (Generative Pre-trained Transformer):

GPT, on the other hand, focuses on a unidirectional model. It predicts the next word in a sequence based on the preceding context, generating coherent and contextually relevant text. Pre-trained on massive datasets, GPT can understand and generate human-like text, making it versatile in natural language generation. GPT is often used for text generation tasks, chatbots, language translation, and creative writing.

In the sentence "Mount Everest is located in []," GPT is trained to predict the next word ("Nepal") by understanding the context leading up.

Contextualized Embeddings:

One of the remarkable advancements brought forth by transformer models is contextualized embeddings. Unlike static embeddings that assign a fixed vector to each word, contextualized embeddings vary based on the surrounding context. This dynamic nature allows models to adapt to the nuances of language, resulting in richer and more accurate representations.

Consider the word "market". In the sentence "Exploring the local market in Thamel, I discovered stalls selling traditional handicrafts and souvenirs," the contextualized embedding for "market" would differ from its representation: "The trading market is down today." The meaning of "market" is context-dependent, and contextualized embeddings capture this nuance.

The progression from traditional word embedding models to transformers signifies a paradigm shift in natural language processing. The introduction of attention mechanisms and contextualized embeddings has improved the ability of models to capture semantic nuances and contextual intricacies, setting new benchmarks in language understanding and generation.

Thank you for reading this article. See you in the next one.