What is a “Document” for Google?

In this post, I will explain what a document is when referred to by Google.

What is a Document?

According to various patents, a “document” is any machine-readable and machine-storable work product [1].

More broadly, according to Google’s Gary Illyes, a document can be:


Subscribe to my Newsletter


[…] any content that Google Search is able to index at the moment

Search off the record [2]

Examples of Documents

Among various patents, some examples of documents are listed.

  • HTML web pages
  • blog posts
  • Image file
  • Video file
  • PDF
  • Spreadsheets
  • Google docs
  • E-mails
  • Web sites
  • Files
  • Combination of files
  • News group posting
  • Web advertisement
  • Yellow pages entry
  • Scanned book
  • Electronic version of printed text
  • One or more files with embedded links to other files

Most Common Document

A common document is a web page.

 Web pages often include textual information and may include embedded information (such as meta information, images, hyperlinks, etc.) and/or embedded instructions (such as Javascript, etc.). A “link” as the term is used here, is to be broadly interpreted to include any reference to or from a document.

Documents Indexable by Google

If we rely on Gary Illyes’ definition which is any indexable document, we can find a list of indexable documents here.

  • Adobe Portable Document Format (.pdf)
  • Adobe PostScript (.ps)
  • Google Earth (.kml, .kmz)
  • GPS eXchange Format (.gpx)
  • Hancom Hanword (.hwp)
  • HTML (.htm, .html, other file extensions)
  • Microsoft Excel (.xls, .xlsx)
  • Microsoft PowerPoint (.ppt, .pptx)
  • Microsoft Word (.doc, .docx)
  • OpenOffice presentation (.odp)
  • OpenOffice spreadsheet (.ods)
  • OpenOffice text (.odt)
  • Rich Text Format (.rtf)
  • Scalable Vector Graphics (.svg)
  • TeX/LaTeX (.tex)
  • Text (.txt, .text, other file extensions), including source code in common programming languages:
    • Basic source code (.bas)
    • C/C++ source code (.c, .cc, .cpp, .cxx, .h, .hpp)
    • C# source code (.cs)
    • Java source code (.java)
    • Perl source code (.pl)
    • Python source code (.py)
  • Wireless Markup Language (.wml, .wap)
  • XML (.xml)

Which Patents Mentions Documents?

Most patents related to search engine will mention documents at some point or another.

Definitions

Patent termDefinition
DocumentAny machine-readable and machine-storable work product
LinkAny reference to or from a document

Sources

Conclusion

We now have covered what is a document when Google refers to it.

5/5 - (1 vote)