ℹ️
This is the first part of a two-part blog series. You can read the second part here

What Does Word Embedding mean?

Word embedding is a method to represent both individual words and entire sentences in a numerical format. Our objective is to convert words in a sentence into numerical values so that computers can read and digest the material since computers interpret information through numbers. We also desire computers to establish meaningful relationships between each word within a sentence or document, fostering an understanding of the connections and context between words.

For instance, chatbots process and analyze user inputs using word embeddings to convert the textual data into numerical representations. This numerical representation enables the computer to understand words and sentences better, which enhances the user experience overall by allowing the chatbot to provide more coherent and contextually relevant responses. Word embeddings are a key element in the development of Natural Language Processing (NLP) technologies because, in essence, they provide a bridge between the linguistic complexity of human communication and the numerical processing power of computers.

Imagine words as multidimensional entities engaged in a cosmic dance, where meanings and relationships arise from their interactions rather than being defined by dictionary definitions. This whimsical dance is word embeddings, a transformative concept in natural language processing. Word embeddings are numerical representations of words in a continuous vector space. Unlike traditional methods that treat words as discrete symbols, word embeddings capture the semantic relationships between words by placing them in a high-dimensional space where their context and meaning determine their positions.

Let's analyze the words king and queen in the context of word embeddings. In a well-trained word embedding model, these words would have numerical representations (vectors) in a continuous vector space. The positions of these vectors are determined by the context and meaning of the words. The vectors for king and queen may appear similar to you in this vector space because of a semantic link pertaining to royalty. So, in this high-dimensional space, relationships between words are captured. The model learns these relationships based on the patterns it observes in large amounts of text data during training. This enables the model to comprehend and represent the semantic connections between words, going beyond treating them as isolated symbols.

In this first part, we will explore the foundational techniques in NLP that paved the way for modern word embeddings.

Bag of Words(BoW):

The BoW model is a simple and fundamental technique for representing text data. It treats a document as an unordered set of words, disregarding grammar, word order, and structure. The primary focus of BoW is on the frequency of words in the document.

Original Sentence: "The majestic Himalayas stand tall."

Bag of Words Representation: {The: 1, majestic: 1, Himalayas: 1, stand: 1, tall: 1}

The result is a set of words with corresponding frequencies representing the document. The Bag of Words model does not consider the words' order or the sentence's grammatical structure. It treats each document as an unordered collection of words. While this approach simplifies the representation, the original text may lose some contextual information. Despite its simplicity, BoW is widely used in various NLP applications, such as text classification, sentiment analysis, and information retrieval.

Stemming and Lemmatization:

Stemming

Stemming involves removing suffixes from words to obtain their root or base form, known as the stem. Stemming is a more heuristic approach and can sometimes result in stems that are not actual words, but it is computationally less intensive. It removes the last few characters of the words even if it doesn't provide any meaning.

Example:
Original:
finally, final, finalize
Stemmed: fina

Lemmatization

Lemmatization, however, involves reducing words to their base or dictionary form, known as the lemma. Lemmatization considers the context of the word and aims to produce valid words.

Example:
Original:
finally, final, finalize
Lemmatized: final

They are the technique to reduce words to their base or root form, thereby consolidating different word variations. These techniques help simplify word representation and are often employed as preprocessing steps in text analysis tasks.

Term Frequency-Inverse Document Frequency (TF-IDF):

TF measures how often a term (word) appears in a document. It is calculated as the ratio of the number of times a term appears in a document to the total number of terms in that document. While IDF assesses the importance of a term across the entire corpus, and is calculated as the logarithm of the total number of documents divided by the number of documents containing the term. The TF-IDF score for a term in a document is the product of its TF and IDF scores. It assigns higher weights to terms frequent within a document but relatively rare across the entire corpus.

Consider a simple corpus with two documents:

Original Corpus:

  • Document 1: "The pigeon in the hat."
  • Document 2: "The hat is too big for the pigeon."
  • Document 3: "The big pigeon."

Term Frequency (TF):

  • TF("hat", Document 1) = 1/6
  • TF("hat", Document 2) = 1/9

Inverse Document Frequency (IDF):

  • IDF("hat") = log(3/2)

TF-IDF Score:

  • TF-IDF("hat", Document 1) = (1/6) * log(3/2)
  • TF-IDF("hat", Document 2) = (1/9) * log(3/2)

The TF-IDF scores reflect the importance of the word hat in each document relative to its occurrence in the entire corpus.

Histogram to Vectors:

Consider the word frequency histograms for two documents:

  • Document 1: {"pigeon": 1, "hat": 1, "in": 1, "the": 2}
  • Document 2: {"the": 2, "hat": 1, "is": 1, "too": 1, "big": 1, "for": 1, "pigeon": 1}

Now, let's convert these histograms into vectors:

  • Vector for Document 1: [1,1,1,2,0,0,0,0]
  • Vector for Document 2: [1,1,1,2,1,1,1,1]

Each element in the vector corresponds to the frequency of a specific word. These vectors provide a concise numerical representation of the word distribution within each document, making it easier to perform mathematical operations or comparisons. These examples demonstrate how histogram to Vectors can be applied to enhance the representation of text data.

In the next part, we'll continue to unravel the connection between words, exploring models like Continuous Bag of Words, Skipgram, and transformers.

Thank you for reading this article. Please consider subscribing if you loved the article.