What is Google’s Document Processing System (and How it Works)

In this tutorial, you will learn everything about the document processing system used in information retrieval (e.g. Google Search Engine). I will explain what the Document Processing System is, how it works, and how Google may use it in their infrastructure to provide search results.

This tutorial is part of a series on learning information retrieval and learning SEO using Google patents, specifically related to the article on “Multi-stage query processing system and method for use with tokenspace repository“.

Example document processing system at Google
Example document processing system at Google

What is Google’s Document Processing System?

At Google, a document processing system represents any computer system used to receive and process documents.


Subscribe to my Newsletter


The document processing system is the system that prepares the inverted index from documents, pages or sites retrieved from the document corpus.

Document Processing System in the Information Retrieval System

In Google’s information retrieval system, the document processing system include the systems used to crawl, parse, cluster and store documents or portions of documents into a document repository to be processed by the query processing system.

It is likely located inside the Indexing system.

The document processing system here

  • builds the document inverted index and the tokenspace repository.
  • can be connected to various networks, such as the Internet and Intranets.

Various Configurations of Document Processing Systems

Document processing systems may be implemented in different ways in Google infrastructure. For instance, depending on the task, a document processing system may or may not include crawling, clustering or indexing modules.

Different Document Processing Systems at Google

There is not only one document processing system as document processing systems are used to perform various additional tasks, such as identifying semantics from image documents, generate classification data from a document or extracting facts from documents. Each document processing system will thus have its own configuration.

Document Processing System in the Web Crawler System

Document processing in the web crawler system is done by the content processing servers.

For the rest of this tutorial, we will focus on the document processing system used inside the information retrieval system as described in the “Multi-stage query processing system and method for use with tokenspace repository” patent.

How Google’s Document Processing System Works?

The document processing system used in information retrieval works by crawling documents, then parsing the documents into tokens and finally storing tokens into lexicons by assigning token IDs to each unique token.

Parsing the Documents

The goal of the document processing system is to turn each document of a document corpus into a list of tokens and then use linguistic preprocessing to normalize them into indexing terms (e.g. tokens).

A token is any object found in a document. Tokens can be terms, phrases, punctuations or even HTML tags.

After parsing, a set of documents is represented as a sequence of tokens, along with their position in the document.

In “Indexing the World Wide Web“, Google engineers mention some of the linguistic text processing that is generally done on documents such as:

  • Stripping HTML tags
  • Tokenization
  • Stopword removal
  • Case Conversion
  • Stemming

Other patents hint that these two additional document processing steps may be added to the funnel.

  • Generation of descriptive information (see: document locator)
  • Attribute labelling

At this point, the tokenization may be done with the MapReduce algorithm.

Each token is assigned a token ID and the position of the token within the document.

Building the Lexicon with the Lexicon Generator

The lexicon generator, also known as the lexicon builder, is the software that generates the lexicon mappings encoding a set of parsed documents.

Here the goal of the lexicon builder is to:

  1. prepare the main dictionary (lexicon) of tokens,
  2. sort it in a way that related tokens are closer to each other,
  3. split it in smaller sub-dictionaries (mini-lexicons),
  4. generate the mappings to be able to map back smaller dictionaries to the main dictionary
  5. send the lexicons to be encoded and reduce their sizes
  6. send the compressed data to the tokenspace repository.
Structure of Lexicon Mappings at Google
Structure of Lexicon Mappings at Google: Systems and methods for generating statistics from search engine query logs

The lexicon generator includes a global-lexicon builder and a mini-lexicon builder. The global lexicon stores all the token, and the mini-lexicon generate mappings in groups to save some space in storage.

This information is sent to the query processing system that will decode the encoded tokens by matching them to the tokenspace repository.

Compress the Lexicons with Encoding System

What the encoding system does essentially is that it encodes documents into compressed representations that take fewer bytes than they normally would.

The encoding and decoding systems are described in detail in the patent titled: “System and method for encoding and decoding variable-length data“.

Store the Compressed Inverted Index in the Tokenspace Repository

Which Patents Mentions the Document Processing System?

Google Parent Infrastructure Involved

Where does the Document Processing System falls into?

  • Information Retrieval System
    • Document Processing System
    • Web Crawling System
      • Document Processing System

Google Children Infrastructure Involved

Here are some of the components used inside or along with the following:

  • Document Processing System
    • Document repositories
    • Lexicon Generator
    • Encoding System
      • Pre-processor,
      • Delta encoder,
      • Variable-length data encoder
    • Decoding System
      • Delta Decoder,
      • Variable-length data decoder
    • Tokenspace repository

Document Processing System is related to

The Document Processing System is often used in conjunction with the query processing system.

Definitions

Patent termDefinition
Document Processing SystemSystem that prepares the inverted index from documents, pages or sites retrieved from the document corpus. It processes the contents using an automated or manual process
TokenAny object found in a document (terms, phrases, punctuations, HTML tags).
Inverted IndexData structure used by information retrieval systems that is composed of a vocabulary of terms, along with list of term occurrences and locations in documents
Tokenspace RepositoryTokenized collection of documents
LexiconOther name for a dictionary of terms, or vocabulary
DictionaryData structure that stores terms, or vocabulary
Vocabulary Set of terms stored in a dictionary
Lexicon generatorSoftware that generates the lexicon mappings encoding a set of parsed documents
Lexicon MappingsData store for the mappings of all the unique tokens and token IDs
Global-LexiconData store for the mappings of all unique tokens and their global token identifier in a set of document
Mini-LexiconData store of sequences of mappings of unique tokens and their global token identifier used for encoding and decoding specific range of positions in documents.
Query Processing SystemAny computing or processing system that can take action responsive to a query
5/5 - (1 vote)