# Measure text weight using TF-IDF in Python plain code and scikit-learn

['python'],['data science'],['tf-idf'],['textmining']

When dealing with text data, we want to measure the importance of a word to a document of a full text collection. One of the most intuitive solution would be counting the word appearance number, the higher the better. But simply counting the words # will lead to the result that favor to long document/article. After all, longer document contains more words.

We need another solution that can appropriately measure the importance of a word in the overall context. **TF-IDF** is one of the effective solution. And also functioning as the backbone of modern search engines like Google.

The core idea of TF-IDF is, the solution not only measure the word frequency and also how importance of the word in the overall context.

For example, words like "is", "the", "and" appear almost in all documents, the TF-IDF will lower down rate of those common words and increase the rate for those really matters.

## The TF-IDF Formula

Imagine we have a *huge* text database in Python, which includes three documents:

```
text_db = ['problem of evil',
'evil queen',
'horizon problem']
```

We can use this formula to calculate a word's TF-IDF value in a certain document.

**T**erm **F**requency of the **w**ord in a certain **d**ocument:

while the log part:
$
log(\frac{D_{all}}{D_w})
$
represent the **I**nverse **D**ocument **F**requency. The **inverse** here indicates this part will inverse the word frequency value, so that give some low values to those frequent used words.

$D_{all}$ is the total document #.$D_w$ is the document # that include the keyword.

## Calculate TF-IDF value of keyword 'evil' manually

Say, we want to get the TF-IDF value for keyword **evil** in document No.1("problem of evil").

It should be easy to see **evil** shows up 1 time, and there are 3 words in total; 3 documents in the database, and 2 of these include **evil** keyword.
So,

$TF_{evil,d1}$ = 1/3$D_{all}$ = 3$D_{evil}$ = 2

Together, we get the result:

In Python

```
import math
tf_1_evil = 1/3
D_all = 3
D_evil = 2
tf_idf_evil = tf_1_evil * math.log(D_all/D_evil)
print(tf_idf_evil)
```

print the result:

```
0.135
```

## Calculate TF-IDF by scikit-learn

Scikit-learn provide a convenient way to calculate TF-IDF matrix in a quick way.

```
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
sample = ['problem of evil',
'evil queen',
'horizon problem']
tf_idf = TfidfVectorizer().fit_transform(text_db)
print(pd.DataFrame(tf_idf.toarray(), columns=vec.get_feature_names()))
```

The result.

```
evil horizon of problem queen
0 0.517856 0.000000 0.680919 0.517856 0.000000
1 0.605349 0.000000 0.000000 0.000000 0.795961
2 0.000000 0.795961 0.000000 0.605349 0.000000
```

Wait, you may ask, Andrew, are you kidding me? the **evil** TF-IDF value for document 1(index shows 0) is `0.517856`

. what is wrong here?

## The differentiation from scikit-learn `TfidfVectorizer`

implementation

There are two differences in the implementation of `TfidfVectorizer`

of scikit-learn, which make the result different from the above formula that exists in most textbooks, and your professor told you.

First, sklearn use a different version of **IDF** formula,add **1s** to numerator and denominator,to avoid dividing by zero scenario. **TF** remains the same.
$
log(\frac{D_{all}+1}{D_w+1}+1)
$
Second, sklearn smooth the TF-IDF result by Euclidean Norm in document level.

In the case of calculating **evil** value in the first document('problem of evil'), the formula is:

Now, let's reshape the Python code to align our code to the up two changes:

```
import math
tf_1_problem = 1/3
tf_1_of = 1/3
tf_1_evil = 1/3
D_all = 3
d_problem = 2
d_of = 1
d_evil = 2
tf_idf_problem = tf_1_problem * (math.log((D_all+1)/(d_problem+1))+1)
tf_idf_of = tf_1_of * (math.log((D_all+1)/(d_of+1))+1)
tf_idf_evil = tf_1_evil * (math.log((D_all+1)/(d_evil+1))+1)
denominator = math.sqrt(tf_idf_problem**2 + tf_idf_of**2 + tf_idf_evil**2)
result = tf_idf_evil/denominator
print("evil result:",result)
```

The TF-IDF value for evil is **exactly the same** as the one produced by sklearn.

```
evil result: 0.517856
```

If this is helpful to you, please also help me destroy the clap button. feel free to comments and correct me if you see anything incorrect. Thanks for reading it.

## Links and books

- A statistical interpretation of term specificity and its application in retrieval by Karen Spärck Jones,

Jones first brought out the idea of TF-IDF in 1972.

- Python Data Science Handbook - Feature Engineering by Jake VanderPlas

https://jakevdp.github.io/PythonDataScienceHandbook/05.04-feature-engineering.html

Jake touched TF-IDF briefly in Chapter Feature Engineering, He doesn't drill down too much of the TF-IDF usage, but he provided the best Python code to calculate TF-IDF values using scikit-learn. The sample text database with 3 documents used in this article is from this book.

- 数学之美 by Wu Jun https://book.douban.com/subject/10750155/

This book is written in Chinese, Dr, Wu Jun a former Google researcher, former VP of Tencent. This book makes a great introduction to TF-IDF algorithm.

- “Sklearn’s TF-IDF” vs “Standard TF-IDF” by Sivakar Sivarajah

https://towardsdatascience.com/how-sklearns-tf-idf-is-different-from-the-standard-tf-idf-275fa582e73d

Thanks to Sivakar, this article shows the differences of the TF-IDF implementation in scikit-learn from traditional textbook.