In this post, I will explain what a document is when referred to by Google.
What is a Document?
According to various patents, a “document” is any machine-readable and machine-storable work product [1].
More broadly, according to Google’s Gary Illyes, a document can be:
[…] any content that Google Search is able to index at the moment
Search off the record [2]
Examples of Documents
Among various patents, some examples of documents are listed.
- HTML web pages
- blog posts
- Image file
- Video file
- Spreadsheets
- Google docs
- E-mails
- Web sites
- Files
- Combination of files
- News group posting
- Web advertisement
- Yellow pages entry
- Scanned book
- Electronic version of printed text
- One or more files with embedded links to other files
Most Common Document
A common document is a web page.
Web pages often include textual information and may include embedded information (such as meta information, images, hyperlinks, etc.) and/or embedded instructions (such as Javascript, etc.). A “link” as the term is used here, is to be broadly interpreted to include any reference to or from a document.
Documents Indexable by Google
If we rely on Gary Illyes’ definition which is any indexable document, we can find a list of indexable documents here.
- Adobe Portable Document Format (.pdf)
- Adobe PostScript (.ps)
- Google Earth (.kml, .kmz)
- GPS eXchange Format (.gpx)
- Hancom Hanword (.hwp)
- HTML (.htm, .html, other file extensions)
- Microsoft Excel (.xls, .xlsx)
- Microsoft PowerPoint (.ppt, .pptx)
- Microsoft Word (.doc, .docx)
- OpenOffice presentation (.odp)
- OpenOffice spreadsheet (.ods)
- OpenOffice text (.odt)
- Rich Text Format (.rtf)
- Scalable Vector Graphics (.svg)
- TeX/LaTeX (.tex)
- Text (.txt, .text, other file extensions), including source code in common programming languages:
- Basic source code (.bas)
- C/C++ source code (.c, .cc, .cpp, .cxx, .h, .hpp)
- C# source code (.cs)
- Java source code (.java)
- Perl source code (.pl)
- Python source code (.py)
- Wireless Markup Language (.wml, .wap)
- XML (.xml)
Which Patents Mentions Documents?
Most patents related to search engine will mention documents at some point or another.
Definitions
Patent term | Definition |
---|---|
Document | Any machine-readable and machine-storable work product |
Link | Any reference to or from a document |
Sources
- [1] Systems and methods for determining document freshness
- [2] Search off the Record
- [3] Changing a rank of a document by applying a rank transition function
- [4] Updating search engine document index based on calculated age of changed portions in a document
Conclusion
We now have covered what is a document when Google refers to it.
SEO Strategist at Tripadvisor, ex- Seek (Melbourne, Australia). Specialized in technical SEO. Writer in Python, Information Retrieval, SEO and machine learning. Guest author at SearchEngineJournal, SearchEngineLand and OnCrawl.