Introduction
Text summarization is crucial in managing information overload by generating a concise version from extensive textual content. Text summarization can be a very important tool in today's era of information overload, where we are surrounded by information everywhere. There are generally two ways of summarization: extractive and abstractive. In this two-part article, we will look at implementing both of these techniques in Python.
Extractive Summarization
Extractive summarization is an approach of summarization which involves extracting important sentences from a large form of text, generally based on the frequency of relevant words in the sentence. In extractive summary, new sentences are not generated; rather, the most important sentence or phrases are used to create the summary, shortening the actual length of the summary.
Tokenization
Now, let’s dive into our summarization code. Initially, we first create tokens from long-form content to summarize a text. Tokenization is just splitting down of texts into smaller pieces called “tokens”. For instance:
Sentence: “Today, I will learn text summarization techniques in Python”.
Tokenization:
- “Today”
- ","
- “I”
- “will”
- “learn”
- “text”
- “summarization”
- “techniques”
- “in”
- “Python”
- “.”
While a basic .split()
could be used, we will be using spaCy lib for a more efficient tokenization. For instance, “you’re” does not contain white space but should be split into two tokens, “you” and “‘’re”. The lib will also handle commas, periods, hyphens, or quotes for us.
Firstly, we will create a virtual environment which lets us have a separate stable, reproducible and portable environment.
pip install virtualenv
virutalenv env
source env/bin/activate
Now, let’s install spaCy.
pip install -U pip setuptools wheel
pip install -U spaCy
python -m spacy download en_core_web_sm
With the stage set, let's write our summarization code:
Counting Word Frequencies
Next, our summary process is to tally word frequencies in our document. We're basing our work on the assumption that relevant words will often appear.
In the above code snippet, we loop through each token in the document, recording how often each one appears. This tallying is key in pointing out the most substantial words in the text.
However, it’s important to acknowledge a flaw in our logic. Stop words and punctuation, such as ‘the’ or ‘is’, might influence word frequencies. To mitigate this, we have to make sure we filter out those words.
In the above code, we create a filtered list of tokens by excluding common stop words and pronunciation. This ensures that the word frequencies are recorded for meaningful text content.
Normalizing Word Frequencies
Having counted the frequencies of the words in our text, our next step involves normalizing the word frequencies. Normalization refers to the process of adjusting or scaling values to a common scale.
In the above code snippet, we determine the highest frequency among all the words and normalize each word's frequency by dividing it by the maximum frequency so that the frequencies are relative and skewed by the absolute number of occurrences.
Scoring Sentences
With our word frequencies normalized, our next task is to extract important sentences from our text. We will assign a score to each sentence based on its words. This is where our normalized frequencies make sure that the sentence_scores
are not skewed.
In this process, we iterate through each sentence in the document and accumulate scores based on the normalized frequencies of the words within those sentences.
Now that we’ve scored each sentence based on the word_frequencies
, let's create our summary.
In the above snippet, the ratio indicates that we want to include 10% of the sentences in the final summary. Here nlargest
function is a part of heapq
module in Python to retrieve the n largest elements from an iterable.
Here’s the full code:
Now, let's put our summarization code to the test. Here’s another article by me on Understanding the NodeJS Architecture, which we will summarize using our code.
Here’s the generated summary of the article:
Single-Threaded Event Loop An event loop is an infinite single-threaded loop that keeps running as long as our NodeJS application runs, managing asynchronous operations and ensuring that our program remains responsive by efficiently handling multiple tasks without blocking the execution of other code. Conclusion To wrap up, we’ve taken a deep dive into NodeJS’ internal architecture, understanding how the Chrome V8 engine, libuv library, event loop and thread pool collaborate to craft a single-threaded, non-blocking, event-driven environment for creating awesome projects. When the event loop encounters one of these heavy-duty tasks, NodeJS intelligently offloads it to the thread pool, freeing up the main thread to continue with other essential tasks. However, instead of mindlessly repeating it like a mantra, let’s take a closer look at the terms — single-threaded, non-blocking, and event-driven — and delve into the magic that makes NodeJS so fascinating.
The summarization process has generated a quite good overview of the article. However, I encourage you to read the original content to ensure transparency and accuracy.
Conclusion
To conclude, we’ve successfully explored the generation of extractive summarization without relying on machine learning models. This has also helped us understand text summarization in general and prepared us for the next part of this article, where we will be focusing on using pre-trained ML models for generating abstractive summarization as well as understanding key concepts behind how abstractive summarization works.
Thank you for reading this article. Please leave a comment below if you have any queries or suggestions.