BoW Vs. TF-IDF

We humans read words effortlessly, but computers only understand numbers. So how do we bridge that gap? That’s where methods like Bag-of-Words (BoW) and TF-IDF come in.

What problem are we solving?

When we read a piece of text, the words seem never-ending with an infinite vocabulary. The problem with raw text is that it is messy, unstructured, and consists of a multitude of words. We need a way to represent raw text in numerical form such that the data is structured and easier to process. This process is known as vectorization, and we’ll discuss it.

Bag-of-Words (BoW)

Bag-of-Words (BoW) is one of the fundamental ways of converting text to numbers. In simple terms, it keeps count of the occurrence of each unique word in a document. It is used for its simplicity, intuitiveness, and also its effectiveness. Let’s break it down step-by-step and see how it works.

Suppose you have the following sentences:

“How are you?” “I am very good.” “What a lovely day.”

Step 1

Make a set of unique words (vocabulary) vocab = (how, are, you, i, am, very, good, what, a, lovely, day) len(vocab) = 11

Step 2

Initialize a zero vector of length len(vocab) for each sentence

v1 = [0,0,0,0,0,0,0,0,0,0,0]
v2 = [0,0,0,0,0,0,0,0,0,0,0]
v3 = [0,0,0,0,0,0,0,0,0,0,0]

Step 3

Keep count of how many times a word in the vocab has occurred in the sentence, and replace the 0 with the count.

v1 = [1,1,1,0,0,0,0,0,0,0,0]
v2 = [0,0,0,1,1,1,1,0,0,0,0]
v3 = [0,0,0,0,0,0,0,1,1,1,1]

Now you have successfully converted the sentence to numerical form using BoW. Notice how each sentence is now just a row of numbers; no meaning, no grammar, just counts.

But, easy as it looks, there are many drawbacks to BoW.

Drawbacks:

All of the words are given the same priority.
Lots of filler words, such as “a”, “am”, “the”, etc., make the vector size large.
There is no priority given to rare words in the sentences.

But BoW isn’t perfect. That’s where TF-IDF comes in — it adds some ‘common sense’ to word importance.

Term Frequency - Inverse Document Frequency(TF-IDF)

This technique of vectorization measures the importance of different words in a document. To understand it, let’s break down each term from the name itself.

TF = word frequency in one document
DF = how many documents contain the word
IDF = how rare a word is across all documents
TF-IDF = TF × IDF

Term Frequency (TF)

t -> term/word
d -> document/page
N -> Number of document/page

tf(t,d)=count of t in d/Total no. of words in d

This reflects how frequently a word appears within a text

Document Frequency (DF)

df(t) = No. of d where t has occured

This reflects the number of documents/pages in which the term/word appears.

Inverse Document Frequency (IDF)

The main drawback of Term-Frequency is that it considers all of the words as equally important

The most occurring words, such as a, an, the, etc., should be weighed down and rare words should be scaled up.

idf(t) =N/df

Problems with the above formula:

If N becomes very high, idf explodes
If df = 0, idf becomes invalid

The updated formula for idf becomes, idf(t) =log(N/(df + 1))

TF-IDF

Incorporating all of the above, we get, tf-idf(t) = tf(t,d) x log(N/(df + 1))

Here’s a side-by-side view to make the differences clear.

BoW	TF-IDF
Treats all words equally	Adjusts word importance based on word frequency and rarity across documents
Stopwords reduce the meaningfulness of the model	Dilutes the importance of stopwords
Document length affects word frequency	Normalizes the effect of document length
Simple to implement, but creates a high-dimensional sparse vector	More complex, provides a more informative representation

BoW and TF-IDF may look old-school compared to embeddings like BERT, but they’re still powerful starting points. Personally, I find that learning these simple methods makes the modern ones much less of a black box.

Unique Karki