Text Summarization in Python Part 1

Introduction

Text summarization is crucial in managing information overload by generating a concise version from extensive textual content. Text summarization can be a very important tool in today's era of information overload, where we are surrounded by information everywhere. There are generally two ways of summarization: extractive and abstractive. In this two-part article, we will look at implementing both of these techniques in Python.

Extractive Summarization

Extractive summarization is an approach of summarization which involves extracting important sentences from a large form of text, generally based on the frequency of relevant words in the sentence. In extractive summary, new sentences are not generated; rather, the most important sentence or phrases are used to create the summary, shortening the actual length of the summary.

Tokenization

Now, let’s dive into our summarization code. Initially, we first create tokens from long-form content to summarize a text. Tokenization is just splitting down of texts into smaller pieces called “tokens”. For instance:

Sentence: “Today, I will learn text summarization techniques in Python”.

Tokenization:

“Today”
","
“I”
“will”
“learn”
“text”
“summarization”
“techniques”
“in”
“Python”
“.”

While a basic .split() could be used, we will be using spaCy lib for a more efficient tokenization. For instance, “you’re” does not contain white space but should be split into two tokens, “you” and “‘’re”. The lib will also handle commas, periods, hyphens, or quotes for us.

Firstly, we will create a virtual environment which lets us have a separate stable, reproducible and portable environment.

pip install virtualenv
virutalenv env
source env/bin/activate

Now, let’s install spaCy.

pip install -U pip setuptools wheel
pip install -U spaCy
python -m spacy download en_core_web_sm

With the stage set, let's write our summarization code:

import spacy

def summarize(text):
  nlp = spacy.load("en_core_web_sm")
  document = nlp(text)

Initializing a spaCy language model ("en_core_web_sm") and processing a given text.

Counting Word Frequencies

Next, our summary process is to tally word frequencies in our document. We're basing our work on the assumption that relevant words will often appear.

word_frequencies = {}
for token in document:
  if token.text.lower() not in word_frequencies.keys():
    word_frequencies[token.text] = 1
  else:
    word_frequencies[token.text] += 1

Calculating word frequencies in a document.

In the above code snippet, we loop through each token in the document, recording how often each one appears. This tallying is key in pointing out the most substantial words in the text.

However, it’s important to acknowledge a flaw in our logic. Stop words and punctuation, such as ‘the’ or ‘is’, might influence word frequencies. To mitigate this, we have to make sure we filter out those words.

from string import punctuation
from spacy.lang.en.stop_words import STOP_WORDS

#Extract tokens excluding stop words and punctuation
tokens = [token.text.lower() for token in doc if token.text.lower() not in STOP_WORDS and token.text.lower() not in punctuation]

#Count words
word_frequencies = {}

for word in tokens:
  if word not in word_frequencies:
    word_frequencies = 1
  else:
    word_frequencies += 1

Tokenizing, excluding stop words, and counting frequencies.

In the above code, we create a filtered list of tokens by excluding common stop words and pronunciation. This ensures that the word frequencies are recorded for meaningful text content.

Normalizing Word Frequencies

Having counted the frequencies of the words in our text, our next step involves normalizing the word frequencies. Normalization refers to the process of adjusting or scaling values to a common scale.

#Normalize word frequencies
max_frequency = max(word_frequencies.values())
for word in word_frequencies.keys():
  word_frequencies[word] /= max_frequency

Normalizing word frequencies based on maximum frequency

In the above code snippet, we determine the highest frequency among all the words and normalize each word's frequency by dividing it by the maximum frequency so that the frequencies are relative and skewed by the absolute number of occurrences.

Scoring Sentences

With our word frequencies normalized, our next task is to extract important sentences from our text. We will assign a score to each sentence based on its words. This is where our normalized frequencies make sure that the sentence_scores are not skewed.

#Extract sentences and calculate scores
sentence_scores = {}
for sentence in doc.sents:
  for word in sentence:
    if word.text.lower() in word_frequencies:
      if sentence not in sentence_scores:
        sentence_scores[sentence] = word_frequencies[word.text.lower()]
      else:
        sentence_scores[sentence] += word_frequencies[word.text.lower()]

Scoring sentences based on normalized word frequencies.

In this process, we iterate through each sentence in the document and accumulate scores based on the normalized frequencies of the words within those sentences.

Now that we’ve scored each sentence based on the word_frequencies , let's create our summary.

ratio = 0.1 #ratio for summarization
select_length = int(len(doc.sents) * ratio)
summary = nlargest(select_length, sentence_scores, key=sentence_scores.get)
final_summary = [sentence.text for sentence in summary]
return ' '.join(final_summary)

Selecting top sentences for extractive summarization

In the above snippet, the ratio indicates that we want to include 10% of the sentences in the final summary. Here nlargest function is a part of heapq module in Python to retrieve the n largest elements from an iterable.

Here’s the full code:

from string import punctuation
import spacy
from txtai.pipeline import Summary
from spacy.lang.en.stop_words import STOP_WORDS
from heapq import nlargest

def summarize(text):
  nlp = spacy.load("en_core_web_sm")  
  document = nlp(text)
  #Extract tokens, excluding stop words and punctuation
  tokens = [token.text.lower() for token in doc if token.text.lower() not in STOP_WORDS and token.text.lower() not in punctuation]
  word_frequencies = {}
  for word in tokens:
    if word not in word_frequencies:
      word_frequencies[word] = 1
    else:
      word_frequencies[word] += 1
   #Extract sentences and calculate scores
   sentence_scores = {}
   for sentence in doc.sents:
     for word in sentence:
       if word.text.lower() in word_frequencies:
         if sentence not in sentence_scores:
           sentence_scores[sentence] = word_frequencies[word.text.lower()]
         else:
           sentence_scores[sentence] += word_frequencies[word.text.lower()]
    ratio = 0.1 #ratio for summarization
    select_length = int(len(doc.sents) * ratio)
    summary = nlargest(select_length, sentence_scores, key=sentence_scores.get)
    final_summary = [sentence.text for sentence in summary]
    return ' '.join(final_summary)

The full code to perform extractive text summarization based on word frequencies and sentence scores.

Now, let's put our summarization code to the test. Here’s another article by me on Understanding the NodeJS Architecture, which we will summarize using our code.

Here’s the generated summary of the article:

Single-Threaded Event Loop An event loop is an infinite single-threaded loop that keeps running as long as our NodeJS application runs, managing asynchronous operations and ensuring that our program remains responsive by efficiently handling multiple tasks without blocking the execution of other code. Conclusion To wrap up, we’ve taken a deep dive into NodeJS’ internal architecture, understanding how the Chrome V8 engine, libuv library, event loop and thread pool collaborate to craft a single-threaded, non-blocking, event-driven environment for creating awesome projects. When the event loop encounters one of these heavy-duty tasks, NodeJS intelligently offloads it to the thread pool, freeing up the main thread to continue with other essential tasks. However, instead of mindlessly repeating it like a mantra, let’s take a closer look at the terms — single-threaded, non-blocking, and event-driven — and delve into the magic that makes NodeJS so fascinating.

The summarization process has generated a quite good overview of the article. However, I encourage you to read the original content to ensure transparency and accuracy.

Conclusion

To conclude, we’ve successfully explored the generation of extractive summarization without relying on machine learning models. This has also helped us understand text summarization in general and prepared us for the next part of this article, where we will be focusing on using pre-trained ML models for generating abstractive summarization as well as understanding key concepts behind how abstractive summarization works.

Thank you for reading this article. Please leave a comment below if you have any queries or suggestions.

Text Summarization in Python - Part 1

Suman Khadka