In this tutorial, you will learn everything about the Lexicon Generator used in information retrieval (e.g. Google Search Engine). I will explain what the Lexicon Generator is, how it works, and how Google may use it in their infrastructure to provide search results.
This tutorial is part of a series on learning information retrieval and learning SEO using Google patents, specifically related to the article on “Multi-stage query processing system and method for use with tokenspace repository“.
What is Google’s Lexicon Generator?
Google’s lexicon generator, also known as the lexicon builder, is the software that generates the lexicon mappings encoding a set of parsed documents. Simply put, it creates the master and child inverted indexes to be added in Google’s Index.
The Lexicon generator is composed of three different builders:
- Global-lexicon builder: Software that generates mappings of all unique tokens and their global token identifier in a set of document
- Mini-Lexicon builder: Software that generates mappings of unique tokens and their global token identifier used for encoding and decoding specific range of positions in documents.
- Region-Lexicon builder: Software that generates mappings for encoding and decoding the document repository
Simply put, the global lexicon is a dictionary, and the mini-lexicon is a sub-dictionary mapped to the global dictionary.
How Google’s Lexicon Generator Work?
Google’s lexicon generator works by generating dictionaries from tokens found in documents inside the document repository.
First, it generates a general (e.g. global) dictionary of all the tokens.
Then, for efficiency, it splits the global dictionary (global lexicon) in sub-dictionaries (mini-lexicon), and create mappings (lexicon mappings) to map back the mini lexicon to the global lexicon.
After, it sends the mappings to encoding systems to compress the data further and store the encode tokens into the tokenspace repository.
An alternative technique may be to separate the global lexicon into region-lexicons instead of mini-lexicons.
The mini-lexicon can be used to generate snippets related to the query. For example, if tokens are grouped by their location on the page, then they can be used to find the locations where to start generating the snippet from.
Global Lexicon Builder
The global-lexicon builder is the software that generates the global-lexicon. It does so by retrieving documents from the document repository and assigning unique global token identifiers to each unique tokens contained in the documents.
The builder may split documents into portions, and global lexicons generated for each portions.
The result that the Global-lexicon builder tries to produce is a sorted list of unique tokens assigned to Global unique token IDs.
Document Identifier Reordering
In a process called document reordering, or document ID reassignment, Google will sort documents before building the lexicon so that documents with similar words are closer to each other.
They can be sorted by:
- Language
- Domain name
- URL Path
This way, documents with a similar language are grouped with each other, documents from each website grouped together and document from same website and similar branches of the website are also grouped to each other.
The idea is that by sorting pages from the same language and same website, you can end-up with areas where consecutive document IDs are textually similar to each other.
This process helps to decrease the index size, increase the speed of conjunctive queries and facilitate compression.
For more information, read Document Reordering for Faster Intersection or Inverted Index Compression and Query Processing with Optimized Document Ordering.
Clustering Similar Documents
Similarly, documents may also be clustered by terms, words or phrases to group them by related concepts.
Again, the idea is to assign nearby document IDs to documents that are similar to each other (e.g. present in the same cluster).
Storing Data
The global-lexicon builder will store data such as:
- Unique tokens
- Unique token IDs
- Number of occurrences of each token in the set of documents
- Language associated with the token
Sorting Tokens
The global-lexicon builder with then sort the list of unique tokens using the frequency of occurrence of the tokens within the set of documents. Tokens may also be grouped and sorted to save bandwidth.
Special Tokens
Some tokens will occur more frequently than the average token. This is the case of punctuation, or HTML tags for example. These special tokens will all be grouped together in the global lexicon using the same prefix.
Partitioning the Index
To improve the efficiency of storage and retrieval, the document repository can be divided in partitions. In general, document-based index partitioning will be used for simplicity and cost.
By logically partitioning similar documents, a global lexicon will be generated for each partition.
Mini-Lexicon Builder
The mini-lexicon builder is the software that generates the mini-lexicon used for storing ranges of positions in documents and allow storage to occupy less space.
This is used to reduce the cost of storage be reusing IDs from one sub-directory to the other.
The mini-lexicon allows Google to store each fixed length sub-dictionaries (mini-lexicon) instead of the variable length dictionaries (global-lexicon). They are also smaller in bytes (1 byte) than the Global-lexicon (4 bytes), thus reducing the number of bytes per token.
Min-lexicon are generally reserved for the most popular tokens in the set of documents.
The mini-lexicon maps back to the Global Lexicon.
Region-Lexicon Builder
The region lexicon builder is the software that receives the inverted index (e.g. tokenspace repository) and split it into regions, each region targeting a set of similar tokens.
It enables faster posting list intersection using skip pointers. It is a technique to store lexicons by region and use skip pointers to avoid processing parts of the posting list that will not show in the search results.
Which Patents Mentions the Lexicon Generator?
- Systems and methods for generating statistics from search engine query logs
- Multi-stage query processing system and method for use with tokenspace repository
- Generating content snippets using a tokenspace repository
Other Possible Names for the Lexicon Generator?
- Lexicon Builder
Google Parent Infrastructure Involved
Where does the Query Processing System falls into?
- Information Retrieval System
- Document Processing System
- Lexicon Generator
- Document Processing System
Google Children Infrastructure Involved
The lexicon generator can be split-up into two different lexicon builders.
- Lexicon Generator
- Global-Lexicon builder
- Mini-Lexicon builder
Definitions
Patent term | Definition |
---|---|
Lexicon Generator | Software that generates the lexicon mappings encoding a set of parsed documents |
Global-Lexicon Builder | Software that generates the global-lexicon |
Mini-Lexicon Builder | Software that generates the mini-lexicon |
Region-lexicon builder | Software that generates the region-lexicon |
Global Lexicon | Data Store for the mappings of all unique tokens and their global token identifier in a set of document |
Mini-Lexicon | Data store of sequences of mappings of unique tokens and their global token identifier used for encoding and decoding specific range of positions in documents. |
Region Lexicon | Data store of sequences of mappings of unique tokens and their global token identifier used for encoding and decoding specific range of positions in documents. |
Tokenspace repository | Tokenized collection of documents |
Token | Any object found in a document (terms, phrases, punctuations, HTML tags). |
SEO Strategist at Tripadvisor, ex- Seek (Melbourne, Australia). Specialized in technical SEO. Writer in Python, Information Retrieval, SEO and machine learning. Guest author at SearchEngineJournal, SearchEngineLand and OnCrawl.