What is a “Document” for Google?

In this post, I will explain what a document is when referred to by Google.

What is a Document?

According to various patents, a “document” is any machine-readable and machine-storable work product [1].

More broadly, according to Google’s Gary Illyes, a document can be:

Join the Newsletter

    […] any content that Google Search is able to index at the moment

    Search off the record [2]

    Examples of Documents

    Among various patents, some examples of documents are listed.

    • HTML web pages
    • blog posts
    • Image file
    • Video file
    • PDF
    • Spreadsheets
    • Google docs
    • E-mails
    • Web sites
    • Files
    • Combination of files
    • News group posting
    • Web advertisement
    • Yellow pages entry
    • Scanned book
    • Electronic version of printed text
    • One or more files with embedded links to other files

    Most Common Document

    A common document is a web page.

     Web pages often include textual information and may include embedded information (such as meta information, images, hyperlinks, etc.) and/or embedded instructions (such as Javascript, etc.). A “link” as the term is used here, is to be broadly interpreted to include any reference to or from a document.

    Documents Indexable by Google

    If we rely on Gary Illyes’ definition which is any indexable document, we can find a list of indexable documents here.

    • Adobe Portable Document Format (.pdf)
    • Adobe PostScript (.ps)
    • Google Earth (.kml, .kmz)
    • GPS eXchange Format (.gpx)
    • Hancom Hanword (.hwp)
    • HTML (.htm, .html, other file extensions)
    • Microsoft Excel (.xls, .xlsx)
    • Microsoft PowerPoint (.ppt, .pptx)
    • Microsoft Word (.doc, .docx)
    • OpenOffice presentation (.odp)
    • OpenOffice spreadsheet (.ods)
    • OpenOffice text (.odt)
    • Rich Text Format (.rtf)
    • Scalable Vector Graphics (.svg)
    • TeX/LaTeX (.tex)
    • Text (.txt, .text, other file extensions), including source code in common programming languages:
      • Basic source code (.bas)
      • C/C++ source code (.c, .cc, .cpp, .cxx, .h, .hpp)
      • C# source code (.cs)
      • Java source code (.java)
      • Perl source code (.pl)
      • Python source code (.py)
    • Wireless Markup Language (.wml, .wap)
    • XML (.xml)

    Which Patents Mentions Documents?

    Most patents related to search engine will mention documents at some point or another.

    Definitions

    Patent termDefinition
    DocumentAny machine-readable and machine-storable work product
    LinkAny reference to or from a document

    Sources

    Conclusion

    We now have covered what is a document when Google refers to it.

    5/5 - (1 vote)