TF-IDF: Term frequency-inverse document frequency

Share this post

TF-IDF, or term frequency-inverse document frequency, is a statistical measure that evaluates how relevant is a word in a document relative to a corpus of documents.

It is often used in Natural Language Processing and Information Retrieval.

What is TF-IDF?

To understand TF-IDF (term frequency-inverse document frequency) and how it evaluates word relevancy, we need to understand its individual parts.


Subscribe to my Newsletter


Term Frequency

Term frequency is the frequency of a keyword compared to the number of words in a document.

Document Frequency

Document frequency is the number of documents in which the keyword appears.

Inverse Document Frequency

Inverse document frequency (IDF) is a dampening factor to reduce the importance of queries that are common to a lot of documents.

Let’s look at the importance of certain queries for an article on Unsupervised machine Learning.

Queries like “about” or “but” will be occurring a lot in the article. Does it mean that the article is about these prepositions? Surely not. This is what IDF is used for. When

When a keyword appears in a large number of documents (e.g. “a”, “the”, “they”), it is considered less important to a specific document.

When the keyword rarely occurs across all documents (e.g. query like Dimensionality reduction), it will be considered higher relevancy to the documents in which it appears.

How is Tf-IDF is Calculated?

TF-IDF is essentially the multiplication of the Term Frequency (TF) and the inverse document frequency (IDF).

In the example below:

  • we have 3 documents
  • only 2 contain a certain keyword.
  • the keyword appears 4 times in a 100 words document.

TF-IDF will be equal to the keyword count (4) divided by the number of words in the document (100) multiplied by the IDF dampening factor. That factor will be the log of the number of documents (3) divided by the number of documents containing the keyword (2).

TF-IDF with Python

Here we will use TfidfVectorizer to apply TF-IDF to Wikipedia articles.

The first step is to query the Wikipedia API.

import requests

articles = []
wiki_pages = [
    'Machine learning',
    'Artificial intelligence', 
    'Neural networks'
    ]

for subject in wiki_pages:
    url = 'https://en.wikipedia.org/w/api.php'
    params = {
            'action': 'query',
            'format': 'json',
            'titles': subject,
            'prop': 'extracts',
            'exintro': True,
            'explaintext': True,
        }

    response = requests.get(url, params=params)
    data = response.json()

    page = next(iter(data['query']['pages'].values()))
    articles.append(page['extract'])

The next step is to build the dataframe from the list of articles stored in the articles variable.

There are two main ways to build a TF-IDF word frequency Dataframe with Scikit-learn:

  • Using TfidfVectorizer
  • Using CountVectorizer followed by TfidfTransformer

Either of these solutions will create a sparse matrix that can be used to create the Pandas Dataframe.

import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer

# Initialize the model
vec = TfidfVectorizer()

# Train the model
tf_idf =  vec.fit_transform(articles)

# Print the tf-idf sparse matrix
print(tf_idf)

The sparse matrix will have a structure like the one below. The tuples represent the row and column location, followed by the idf score.

From the sparse matrix, we build the word frequency dataframe.

This image has an empty alt attribute; its file name is image-16.png
df = pd.DataFrame(
    tf_idf.toarray(), 
    columns=vec.get_feature_names(), 
    index=wiki_pages
    )
df.head()

How to Use TF-IDF on real world data?

If you want to dive deeper into TF-IDF:

Conclusion

This concludes the article on how to estimate the relevancy of a word using the term frequency-inverse document frequency (TF-IDF).

5/5 - (1 vote)