What is Hashing (and how it works). Checksums, hash functions and more!

Hashing is the process of transforming a key or a string of characters into another value.

The conversion is done using a hashing algorithm (function).

A checksum is an algorithm that contains a hashing function.


Subscribe to my Newsletter


In this post, we will learn how hashing and checksum work and how Google uses checksums in search.

What is Hashing?

Hashing uses functions or algorithms to convert object data to an integer value.

Hashing is used to store, process and retrieve data more efficiently.

What is a Checksum?

A checksum is a number that is calculated based on the data held in a file.

A checksum can represent a file (or other data structures) as a fixed length value based on the sum of the contents of the file (such as the sum of the bytes in the file).

How Checksums Works?

Checksums work by calculating the sum of the bytes in a file.

Checksum may be calculated based on the complete contents of files. They can also be calculated based on a portion, a modified version or a normalized version a file.

On text files for instance, all the letters of the document are converted to bytes (0s and 1s).

Then, all the letters (or bytes) on the page are summed and return a fixed length value.

Finally, two documents can be compared by subtracting their respective checksums.

Difference Between a Checksum and a Hash

Checksums and similarity hashes are often used interchangeably, but they have slight differences.

In a nutshell, a Checksum is a hash, but a hash isn’t necessarily a Checksum.

Hashing Applications

  • Encryption
  • Storage
  • Performance

Why Use Checksums?

Why use checksums to compare data over byte-by-byte comparison?

The answer: because it is much smaller (256 bits).

Byte by Byte comparison requires having the entire copy of files which can be very large (gigabytes).

A checksum’s relatively small size is small enough to be treated as file metadata.

How can Checksums be Used?

Checksums can be used in many ways:

  • in search engines to check for duplicate documents,
  • in engineering to check for corrupted files
  • in cryptography to transfer data securely.

Example of How Google Uses Checksum?

Why hashing is important in search engines?

If you listen to “Search of the record”, you may have heard Gary Illyes talk about hashing and its role in search.

Hashing helps reduce the memory required to process large sets of text data by converting text into hashes.

  • It is easier to compare a short string than a large 20K words article.

At Google, they use different hashing algorithms to hash the main content (MC) and compare each centrepiece content to identify the canonical. They compare checksums to identify:

  • Duplicate content
  • If a file as changed since last time the crawler visited the site
  • News recommendations

Comparing Files

Checksum is a hash function that can be used to evaluate the redundancy between documents.

As we’ve discussed, a file checksum is a number calculated from the data. Two files with the same contents will have the same checksum.

Very similar files will have closer checksums than different files.

When two different files have the same checksum, it is called a collision. This should generally be avoided as much as possible.

So, in order to identify duplicates, Google will reduce content into hashes or checksum and compare the values to identify duplicates.

How Googlebot uses the Checksum?

Googlebot looks at the Last Modifed response header.

It sends a If-Modified-Since request using the last crawled date of the document.

If the server sends a 304 Not Modified response to the browser, the document will not be processed further.

However, if the Last Modified header is not present, Googlebot will download the file and compute the Checksum of the content and compare it to the Checksum of the content when it was last crawled.

Where is the Checksum Used?

Checksum is used in many places at Google:

  • Googlebot
  • Google Indexing System
  • Google Search Applicance

Google reported using simhash for crawling and Minhash and LSH for Google News personalization.

Which Goole Patents Mentions Checksums?

Other Names of Checksums?

Although not exactly the same, checksums are often referred to as hashes.

Definitions

TermDefinition
checksumValue that represents the number of bits in a data element
hash functionFunction that can be used to map data to a fixed-sized value
hashingProcess of using a hashing algorithm to convert data to a fixed-sized value
MD5 AlgorithmOne of the most widely used hashing algorithm producing a 128-bit hash value.
SHA (Secure Hashing Algorithm)Popular hashing algorithm

Sources

Conclusion

To conclude, hashing and checksums are widely used in the Internet to convert a data structure to a fixed-sized hash.

Enjoyed This Post?