What is Google’s Lexicon Generator (and How it Works)

In this tutorial, you will learn everything about the Lexicon Generator used in information retrieval (e.g. Google Search Engine). I will explain what the Lexicon Generator is, how it works, and how Google may use it in their infrastructure to provide search results.

This tutorial is part of a series on learning information retrieval and learning SEO using Google patents, specifically related to the article on “Multi-stage query processing system and method for use with tokenspace repository“.

Multi-stage query processing system and method for use with tokenspace repository

What is Google’s Lexicon Generator?

Google’s lexicon generator, also known as the lexicon builder, is the software that generates the lexicon mappings encoding a set of parsed documents. Simply put, it creates the master and child inverted indexes to be added in Google’s Index.


Subscribe to my Newsletter


Source: Towards a distributed web search engine

The Lexicon generator is composed of three different builders:

  • Global-lexicon builder: Software that generates mappings of all unique tokens and their global token identifier in a set of document
  • Mini-Lexicon builder: Software that generates mappings of unique tokens and their global token identifier used for encoding and decoding specific range of positions in documents.
  • Region-Lexicon builder: Software that generates mappings for encoding and decoding the document repository

Simply put, the global lexicon is a dictionary, and the mini-lexicon is a sub-dictionary mapped to the global dictionary.

How Google’s Lexicon Generator Work?

Google’s lexicon generator works by generating dictionaries from tokens found in documents inside the document repository.

First, it generates a general (e.g. global) dictionary of all the tokens.

Then, for efficiency, it splits the global dictionary (global lexicon) in sub-dictionaries (mini-lexicon), and create mappings (lexicon mappings) to map back the mini lexicon to the global lexicon.

After, it sends the mappings to encoding systems to compress the data further and store the encode tokens into the tokenspace repository.

An alternative technique may be to separate the global lexicon into region-lexicons instead of mini-lexicons.

The mini-lexicon can be used to generate snippets related to the query. For example, if tokens are grouped by their location on the page, then they can be used to find the locations where to start generating the snippet from.

Global Lexicon Builder

The global-lexicon builder is the software that generates the global-lexicon. It does so by retrieving documents from the document repository and assigning unique global token identifiers to each unique tokens contained in the documents.

The builder may split documents into portions, and global lexicons generated for each portions.

The result that the Global-lexicon builder tries to produce is a sorted list of unique tokens assigned to Global unique token IDs.

Example of Global-lexicon builder
Example of Global-lexicon builder

Document Identifier Reordering

In a process called document reordering, or document ID reassignment, Google will sort documents before building the lexicon so that documents with similar words are closer to each other.

They can be sorted by:

  • Language
  • Domain name
  • URL Path

This way, documents with a similar language are grouped with each other, documents from each website grouped together and document from same website and similar branches of the website are also grouped to each other.

The idea is that by sorting pages from the same language and same website, you can end-up with areas where consecutive document IDs are textually similar to each other.

This process helps to decrease the index size, increase the speed of conjunctive queries and facilitate compression.

For more information, read Document Reordering for Faster Intersection or Inverted Index Compression and Query Processing with Optimized Document Ordering.

Clustering Similar Documents

Similarly, documents may also be clustered by terms, words or phrases to group them by related concepts.

Again, the idea is to assign nearby document IDs to documents that are similar to each other (e.g. present in the same cluster).

Storing Data

The global-lexicon builder will store data such as:

  • Unique tokens
  • Unique token IDs
  • Number of occurrences of each token in the set of documents
  • Language associated with the token

Sorting Tokens

The global-lexicon builder with then sort the list of unique tokens using the frequency of occurrence of the tokens within the set of documents. Tokens may also be grouped and sorted to save bandwidth.

Special Tokens

Some tokens will occur more frequently than the average token. This is the case of punctuation, or HTML tags for example. These special tokens will all be grouped together in the global lexicon using the same prefix.

Partitioning the Index

To improve the efficiency of storage and retrieval, the document repository can be divided in partitions. In general, document-based index partitioning will be used for simplicity and cost.

By logically partitioning similar documents, a global lexicon will be generated for each partition.

document-based index partitioning
Source: Ricardo Baeza-Yates & B. Barla Cambazoglu, Yahoo Labs

Mini-Lexicon Builder

The mini-lexicon builder is the software that generates the mini-lexicon used for storing ranges of positions in documents and allow storage to occupy less space.

This is used to reduce the cost of storage be reusing IDs from one sub-directory to the other.

Mini-lexicon builder
Mini-lexicon builder

The mini-lexicon allows Google to store each fixed length sub-dictionaries (mini-lexicon) instead of the variable length dictionaries (global-lexicon). They are also smaller in bytes (1 byte) than the Global-lexicon (4 bytes), thus reducing the number of bytes per token.

Min-lexicon are generally reserved for the most popular tokens in the set of documents.

The mini-lexicon maps back to the Global Lexicon.

Region-Lexicon Builder

The region lexicon builder is the software that receives the inverted index (e.g. tokenspace repository) and split it into regions, each region targeting a set of similar tokens.

It enables faster posting list intersection using skip pointers. It is a technique to store lexicons by region and use skip pointers to avoid processing parts of the posting list that will not show in the search results.

Which Patents Mentions the Lexicon Generator?

Other Possible Names for the Lexicon Generator?

  • Lexicon Builder

Google Parent Infrastructure Involved

Where does the Query Processing System falls into?

Google Children Infrastructure Involved

The lexicon generator can be split-up into two different lexicon builders.

  • Lexicon Generator
    • Global-Lexicon builder
    • Mini-Lexicon builder

Definitions

Patent termDefinition
Lexicon GeneratorSoftware that generates the lexicon mappings encoding a set of parsed documents
Global-Lexicon BuilderSoftware that generates the global-lexicon
Mini-Lexicon BuilderSoftware that generates the mini-lexicon
Region-lexicon builderSoftware that generates the region-lexicon
Global LexiconData Store for the mappings of all unique tokens and their global token identifier in a set of document
Mini-Lexicon Data store of sequences of mappings of unique tokens and their global token identifier used for encoding and decoding specific range of positions in documents.
Region LexiconData store of sequences of mappings of unique tokens and their global token identifier used for encoding and decoding specific range of positions in documents.
Tokenspace repositoryTokenized collection of documents
TokenAny object found in a document (terms, phrases, punctuations, HTML tags).

Enjoyed This Post?