Measure text weight using TF-IDF in Python plain code and scikit-learn

['python'],['data science'],['tf-idf'],['textmining']

When dealing with text data, we want to measure the importance of a word to a document of a full text collection. One of the most intuitive solution would be counting the word appearance number, the higher the better. But simply counting the words # will lead to the result that favor to long document/article. After all, longer document contains more words.

We need another solution that can appropriately measure the importance of a word in the overall context. TF-IDF is one of the effective solution. And also functioning as the backbone of modern search engines like Google.

The core idea of TF-IDF is, the solution not only measure the word frequency and also how importance of the word in the overall context.

For example, words like "is", "the", "and" appear almost in all documents, the TF-IDF will lower down rate of those common words and increase the rate for those really matters.

The TF-IDF Formula

Imagine we have a huge text database in Python, which includes three documents:

text_db = ['problem of evil',
           'evil queen',
           'horizon problem']

We can use this formula to calculate a word's TF-IDF value in a certain document.

TFw,d×log(DallDw) TF_{w,d} \times log(\frac{D_{all}}{D_w})

TFw,dTF_{w,d} represent the Term Frequency of the word in a certain document:

TFw,d=the keyword show up times in document dtotal word count in document d TF_{w,d} = \frac{\text{the keyword show up times in document d}}{\text{total word count in document d}}

while the log part: $ log(\frac{D_{all}}{D_w}) $ represent the Inverse Document Frequency. The inverse here indicates this part will inverse the word frequency value, so that give some low values to those frequent used words.

  • DallD_{all} is the total document #.
  • DwD_w is the document # that include the keyword.

Calculate TF-IDF value of keyword 'evil' manually

Say, we want to get the TF-IDF value for keyword evil in document No.1("problem of evil").

It should be easy to see evil shows up 1 time, and there are 3 words in total; 3 documents in the database, and 2 of these include evil keyword. So,

  • TFevil,d1TF_{evil,d1} = 1/3
  • DallD_{all} = 3
  • DevilD_{evil} = 2

Together, we get the result:

TFw,a×log(DallDw)=1/3×log(32) TF_{w,a} \times log(\frac{D_{all}}{D_w}) = 1/3 \times log(\frac{3}{2})

In Python

import math
tf_1_evil       = 1/3
D_all           = 3
D_evil          = 2
tf_idf_evil     = tf_1_evil * math.log(D_all/D_evil)

print the result:


Calculate TF-IDF by scikit-learn

Scikit-learn provide a convenient way to calculate TF-IDF matrix in a quick way.

import pandas as pd 
from sklearn.feature_extraction.text import TfidfVectorizer
sample = ['problem of evil',
          'evil queen',
          'horizon problem']
tf_idf =  TfidfVectorizer().fit_transform(text_db)
print(pd.DataFrame(tf_idf.toarray(), columns=vec.get_feature_names()))

The result.

       evil   horizon        of   problem     queen
0  0.517856  0.000000  0.680919  0.517856  0.000000
1  0.605349  0.000000  0.000000  0.000000  0.795961
2  0.000000  0.795961  0.000000  0.605349  0.000000

Wait, you may ask, Andrew, are you kidding me? the evil TF-IDF value for document 1(index shows 0) is 0.517856. what is wrong here?

The differentiation from scikit-learn TfidfVectorizer implementation

There are two differences in the implementation of TfidfVectorizer of scikit-learn, which make the result different from the above formula that exists in most textbooks, and your professor told you.

First, sklearn use a different version of IDF formula,add 1s to numerator and denominator,to avoid dividing by zero scenario. TF remains the same. $ log(\frac{D_{all}+1}{D_w+1}+1) $ Second, sklearn smooth the TF-IDF result by Euclidean Norm in document level.

vi,norm=vivw1,dj2+vw2,dj2++vwn,dj22 v_{i,norm} = \frac{v_i}{\sqrt[2]{v_{w_1,d_j}^2+v_{w_2,d_j}^2+\dots+v_{w_n,d_j}^2}}

In the case of calculating evil value in the first document('problem of evil'), the formula is:

vevil,norm=vevilvevil,d12+vof,d12+vproblem,d122 v_{evil,norm} = \frac{v_{evil}}{\sqrt[2]{v_{evil,d_1}^2+v_{of,d_1}^2+v_{problem,d_1}^2}}

Now, let's reshape the Python code to align our code to the up two changes:

import math
tf_1_problem    = 1/3
tf_1_of         = 1/3
tf_1_evil       = 1/3
D_all           = 3
d_problem       = 2
d_of            = 1
d_evil          = 2
tf_idf_problem  = tf_1_problem * (math.log((D_all+1)/(d_problem+1))+1)
tf_idf_of       = tf_1_of * (math.log((D_all+1)/(d_of+1))+1)
tf_idf_evil     = tf_1_evil * (math.log((D_all+1)/(d_evil+1))+1)
denominator = math.sqrt(tf_idf_problem**2 + tf_idf_of**2 + tf_idf_evil**2)
result = tf_idf_evil/denominator
print("evil result:",result)

The TF-IDF value for evil is exactly the same as the one produced by sklearn.

evil result: 0.517856

If this is helpful to you, please also help me destroy the clap button. feel free to comments and correct me if you see anything incorrect. Thanks for reading it.

Links and books

Jones first brought out the idea of TF-IDF in 1972.

Jake touched TF-IDF briefly in Chapter Feature Engineering, He doesn't drill down too much of the TF-IDF usage, but he provided the best Python code to calculate TF-IDF values using scikit-learn. The sample text database with 3 documents used in this article is from this book.

  • 数学之美 by Wu Jun

This book is written in Chinese, Dr, Wu Jun a former Google researcher, former VP of Tencent. This book makes a great introduction to TF-IDF algorithm.

Thanks to Sivakar, this article shows the differences of the TF-IDF implementation in scikit-learn from traditional textbook.