Hashing is the process of transforming a key or a string of characters into another value.
The conversion is done using a hashing algorithm (function).
A checksum is an algorithm that contains a hashing function.
In this post, we will learn how hashing and checksum work and how Google uses checksums in search.
What is Hashing?
Hashing uses functions or algorithms to convert object data to an integer value.
Hashing is used to store, process and retrieve data more efficiently.
What is a Checksum?
A checksum is a number that is calculated based on the data held in a file.
A checksum can represent a file (or other data structures) as a fixed length value based on the sum of the contents of the file (such as the sum of the bytes in the file).
How Checksums Works?
Checksums work by calculating the sum of the bytes in a file.
Checksum may be calculated based on the complete contents of files. They can also be calculated based on a portion, a modified version or a normalized version a file.
On text files for instance, all the letters of the document are converted to bytes (0s and 1s).
Then, all the letters (or bytes) on the page are summed and return a fixed length value.
Finally, two documents can be compared by subtracting their respective checksums.
Difference Between a Checksum and a Hash
Checksums and similarity hashes are often used interchangeably, but they have slight differences.
In a nutshell, a Checksum is a hash, but a hash isn’t necessarily a Checksum.
Hashing Applications
- Encryption
- Storage
- Performance
Why Use Checksums?
Why use checksums to compare data over byte-by-byte comparison?
The answer: because it is much smaller (256 bits).
Byte by Byte comparison requires having the entire copy of files which can be very large (gigabytes).
A checksum’s relatively small size is small enough to be treated as file metadata.
How can Checksums be Used?
Checksums can be used in many ways:
- in search engines to check for duplicate documents,
- in engineering to check for corrupted files
- in cryptography to transfer data securely.
Example of How Google Uses Checksum?
Why hashing is important in search engines?
If you listen to “Search of the record”, you may have heard Gary Illyes talk about hashing and its role in search.
Hashing helps reduce the memory required to process large sets of text data by converting text into hashes.
- It is easier to compare a short string than a large 20K words article.
At Google, they use different hashing algorithms to hash the main content (MC) and compare each centrepiece content to identify the canonical. They compare checksums
to identify:
- Duplicate content
- If a file as changed since last time the crawler visited the site
- News recommendations
Comparing Files
Checksum is a hash function that can be used to evaluate the redundancy between documents.
As we’ve discussed, a file checksum is a number calculated from the data. Two files with the same contents will have the same checksum.
Very similar files will have closer checksums than different files.
When two different files have the same checksum, it is called a collision. This should generally be avoided as much as possible.
So, in order to identify duplicates, Google will reduce content into hashes or checksum and compare the values to identify duplicates.
How Googlebot uses the Checksum?
Googlebot looks at the Last Modifed
response header.
It sends a If-Modified-Since
request using the last crawled date of the document.
If the server sends a 304 Not Modified
response to the browser, the document will not be processed further.
However, if the Last Modified header is not present, Googlebot will download the file and compute the Checksum of the content and compare it to the Checksum of the content when it was last crawled.
Where is the Checksum Used?
Checksum is used in many places at Google:
- Googlebot
- Google Indexing System
- Google Search Applicance
Google reported using simhash for crawling and Minhash and LSH for Google News personalization.
Which Goole Patents Mentions Checksums?
- Updating search engine document index based on calculated age of changed portions in a document
- Predictive-based clustering with representative redirect targets
- Document Near-Duplicate Detector
- Scheduler for a search engine crawler
- Deduplication in Search Results
Other Names of Checksums?
Although not exactly the same, checksums are often referred to as hashes.
Definitions
Term | Definition |
---|---|
checksum | Value that represents the number of bits in a data element |
hash function | Function that can be used to map data to a fixed-sized value |
hashing | Process of using a hashing algorithm to convert data to a fixed-sized value |
MD5 Algorithm | One of the most widely used hashing algorithm producing a 128-bit hash value. |
SHA (Secure Hashing Algorithm) | Popular hashing algorithm |
Sources
- [1] Updating search engine document index based on calculated age of changed portions in a document
- [2] Assimilator using image check data
- [3] Determining the relationship between source code bases
Conclusion
To conclude, hashing and checksums are widely used in the Internet to convert a data structure to a fixed-sized hash.
SEO Strategist at Tripadvisor, ex- Seek (Melbourne, Australia). Specialized in technical SEO. Writer in Python, Information Retrieval, SEO and machine learning. Guest author at SearchEngineJournal, SearchEngineLand and OnCrawl.