How we built Nuggets - a browser extension to summarise text from any webpage
Only on the 29th do we learn of this awesome hackathon; we understand it'd be a struggle to even get this live by the 2nd of Jan, but somehow we end up hacking it all together and so here we go (Tejas , Digvijay , and Me) have made a Web App and a Browser Extension that summarises text to help you extract value from content you consume, very easily and quickly.
Also, this goes without saying, but this is our submission for Hashnode's Christmas Hackathon.
We present to you, Nuggets.
Here's a quick video demo for you -
The entire project has got three main parts to it -
- The Frontend(web app and browser extension)
- The Backend
- The Text Summarization Algorithm
So without any further ado, let's get STRAIGGGHT into it!
1. The Frontend
The frontend was pretty straight forward, we used Next.js with Chakra UI library to speed up the development process. Authentication wasn’t done on the server side, instead we went with Firebase for social authentication and used user email to identify summary documents in the database.
Building the chrome extension required a bit of research. Writing the logic on the background script enabled us to get currently opened tab/page’s url, title and icon url which made it easier for users to resonate with the content they saved and are referring to.
2. The Backend
The crux of the server's task here was to call the Text Summarisation's service to analyze and summarize the text and then properly store it in a persistent DB for future reading.
Database : We went ahead with a simple
RDBMS
solution; you can argue all you want on how NoSQL is more appropriate for the application given all of thetextual
content, but as I said before, time was of the essence so we stuck to what we were most familiar with.Task management : You ask why? Well, the text summarization algorithm we used (which we will get to shortly) is very compute intensive, which immediately raises a concern; what do we do when we have multiple summarization requests? Won't the http connections be kept open till the summaries are obtained? That is extremely bad API and server design, which is why we made the summarization requests asynchronous! i.e : once you submit a summarization request, you get a
tracking id
which you can use with an API to check the status of your request. Once the task is completed by a worker, you get to view the summarized text! So, what task management tool did we use? Celery, with redis as thebroker
. Like I said, ease over everything.View all summaries : You've taken so much trouble to select your content and invoke summarization with our extension, it'd be criminal if we weren't saving them for you to read later on. Of course, at the same time, you might not want to save each of them, and hence we've set up an hourly cron (with Celery Beat!) that takes out the weed; deletes all summaries which haven't been saved by the user!
3. The Summarization Algorithm
Yippee, we're in NLP land now! In this section, I'll take you through how we implemented the algorithm that does the text summarization.
We've used the TF-IDF algorithm, which stands for "Term Frequency - Inverse Document Frequency." TF-IDF is basically a multiplication of two different algorithms, TF and IDF.
Let's look at them one-by-one first -
- Term Frequency(TF) - TF is basically a score that indicates how common a term is.
- Inverse Document Frequency(IDF) - IDF indicates how rare or unique a term is.
Now, for each word - by multiplying the TF
and IDF
score, you'll get the final TF-IDF
score of each word.
To find the TF-IDF score of a sentence, you simply sum up the TF-IDF
score of each word in the sentence.
Once you've obtained the score for each sentence, you can make the assumption - higher the score, more important is the sentence.
So, for summarization, you can take a couple of different approaches -
1. Pick top X%
Once you have the TF-IDF
score for all the sentences, you can pretty easily select let's say the top 10% of sentences and return them as the summary. The higher the score, the better the sentence is for the summary.
2. Pick sentences with a score above the average
The alternate approach would be to find the average TF-IDF
score of the sentences and then picking the sentences that have a score higher than the average.
You can also tweak this factor by multiplying it with a factor
.
Let's say you wanted more sentences in your summary, multiply the average with 0.75
to lower the selection criteria.
The awesome part is, Python has a super-rich ecosystem for NLP and Machine Learning in general. You don't need to write the algorithm from scratch, you just need to understand how it's used and what you can do with it.
For the TF-IDF algorithm, we'll use scikit-learn, a Python library for machine learning.
from sklearn.feature_extraction.text import TfidfVectorizer
model = TfidfVectorizer()
res = model.fit_transform(sents).toarray()
res
is a 2-d NumPy array that holds the TF-IDF
score for each word.
You can do some NumPy magic to find the scores for each sentence -
def get_threshold_value(res):
avg_of_sents = np.nan_to_num(np.apply_along_axis(get_average_of_sent, 1, res))
sum_of_averages = avg_of_sents.sum()
return sum_of_averages / avg_of_sents.shape[0]
And then, some more NumPy magic to find sentences that are over the threshold value -
def find_sentences(res, factor = 1):
threshold = get_threshold_value(res) * factor
filter_arr = []
for sentence in res:
passes = False
if get_average_of_sent(sentence) >= threshold:
passes = True
filter_arr.append(passes)
return filter_arr
summary = sentences[find_sentences(res, THRESHOLD)].tolist()
Et voila, here's your summary
.
To keep this section concise, I've skipped out on covering the very first step - preparing the data for the model. In any ML pipeline, you've to first clean the data before giving it to the model. In our pipeline too, you would have several steps like - breaking down the paragraph into sentences(tokenizing), removing the stop words(words that don't add much meaning - "and", "the", etc), removing punctuations, and so on. You could use spaCy for all of these tasks and much more.
This is a basic method of summarizing text, there's a whole array of techniques that we could explore for better results. I'll write about them as and when I learn about them, keep an eye on this blog if you're interested 👀
We'd love for you to try out our project, Nuggets .
That's it from our side, hope you liked our project. Wish us luck for the hackathon! Also, happy to answer any questions you've in the comments below.