Web Scraping with XPath (with Python Example)

In this tutorial, you will learn what Xpath is and how to leverage Xpath in web scraping.

XPath allows you to locate exact elements within an HTML document. It is also supported by most web scraping tools, making it a super useful tool for web scraping.

What is Xpath

Xpath, or XML Path Language, is a query language that can be used to access different elements and attributes of an XML or an HTML document.

Join the Newsletter

    The Xpath notation is used to navigate nodes in a path like syntax and using conditions to extract specific information.

    Why Xpath is Useful in Web Scraping

    XPath is very useful in web scraping. Xpath allows you to:

    • locate the element you want to extract from a webpage,
    • identify and extract data from HTML and XML documents quickly.
    • automate the scraping of webpages.

    Simplest Way to Find the XPath in Chrome

    Chrome DevTools as an incredible feature that allows you to find the XPath of any DOM element without any prior knowledge.

    Open Chrome DevTool with Command + Shift + I, or by using right-click > inspect.

    Then right click on any element in the DOM and select Copy > Copy XPath.

    Basics of XPath Expressions

    Xpath expressions are strings used to describe the location of an element (node), or multiple elements, within an HTML (or XML) document.

    For example, the xpath below locates the h1, found within the HTML body element:

    //html/body/h1
    

    Basic Structure of the XPath Expression

    The basic structure of an XPath expression is similar to the structure used to navigate a URL.

    The XPath expression is represented by a series of steps. Each step is separated by forward slashes (/), which moves forward one generation.

    Each step contains any of these elements:

    • element name: html, body, div, etc.,
    • attribute name: id, class, href, etc.,
    • function call: text(), count(), etc.,
    • wildcard character: *.

    Brackets ([]) can be used after a tag name to define which sibling should be chosen.

    Take the Xpath expression below shows how to select the first div of the body element.

    //html/body/div[1]
    

    Unlike Python that uses zero-based indexing, the index in XPath starts at 1.

    • The double-slash “//” means to look at all the elements within the HTML code.
    • the html/body/div shows the path from the root to the tag we want to select

    XPath Wildcards

    XPath wildcards are special characters used to match one or multiple elements and attributes in markup documents. The two main wildcards used in XPath expressions are:

    • double slash (//)
    • the asterisk (*)

    Double-Slash (//)

    The double-slash is a wildcard that can be used to find all future generations of elements within the entire HTML. It is useful to find relative path.

    For example, this XPath would selects all the <p> tags within a div.

    //html/body/div/p
    

    Any <p> tag outside of a div, or inside a tag within a div would be excluded:

    You could work around that using the double-slash. The notation below would select any <p> tag within the HTML.

    //p
    

    Alternatively, you can restrict to a specific element. For instance, select all the <p> tags that are within a div:

    //html/body/div//p
    

    Asterisk (*)

    The asterisk is a wild card that can be used in XPath to match any element node HTML (or XML) document. For example, you can use the asterisk to select all the children of an element:

    //html/body/div/*
    

    Xpath Operators

    You can used logical and comparison operators in XPath expressions.

    • logical operators: and, or,
    • comparison operators: =, <, >

    Using these operators, you can be start creating more complex XPath expressions.

    Logical Operators

    Logical operators like “and” and “or” can be used to select elements that satisfy certain conditions.

    For example, if you have a div with 4 paragraphs and want to select the 1st and the last.

    <div>
      <p>Paragraph 1</p>
      <p>Paragraph 2</p>
      <p>Paragraph 3</p>
      <p>Paragraph 4</p>
    </div>
    

    You could use the “and” logical operator in XPath to combine the expressions.

    The position() function call just checks the position of the selected node. We’ll learn more about function calls later in the tutorial.

    //div/p[position()=1 and position()=4]
    

    Comparison Operators

    Comparison operators like “=”, “>” and “<” can be used to select elements that satisfy certain conditions.

    Using the same structure of paragraph as above, you could select element for which the position is above 3:

    //div/p[position() > 3]
    

    Selecting Attributes in XPath

    To select the attribute of an HTML or XML tag, you can use the @ symbol in XPath.

    • @id: select tag id
    • @class: select class id
    • @href: select href attribute.
    • etc.

    This example shows how to use the square brackets notation to select the attributes of a tag.

    //html/body/div[@class='content']
    

    This expression selects all the <div> HTML tags that have a class attribute with “content” as its value.

    The “@class='content'” is the attribute value of the div tag. The @ represents “attribute”.

    Xpath Functions

    XPath has various functions that can be used to further enhance node navigation. Let’s check out some of the important function calls you can use in XPath

    • text()
    • contains(@attribute-name, “expression”)
    • count()
    • starts-with()
    • etc.

    Text()

    The text() function selects the text of an element. For example, the following XPath expression would return the entire HTML block.

    //h1
    

    Output:

    &lt;h1>My Title&lt;/h1>
    

    Using the text function allows you to select only the text of the HTML element.

    //h1/text()
    

    Output:

    My Title
    

    Contains()

    The contains() function checks if a the element contains a specified substring and uses this format:

    contains(@attribute-name, "expression")
    

    For example, to select all the elements that contain a string, e.g. all the links whose href attribute contains the string “contact”

    //a[contains(@href, 'contact')] 
    

    Count()

    The count() function allows you to count the nodes that match your XPath expression.

    For example, you can count the number of links on the page:

    count(//a)
    

    Starts-with()

    The starts-with() function checks if a string starts with something.

    For example, you can extract any <div> element for which the class starts with the string “this”.

     //div[starts-with(@class, 'this')]
    

    Python Scraping XPath with Scrapy

    Scrapy allows you to scrape content using XPath with the xpath() method from the Selector class.

    from scrapy import Selector
    html = '''<html>
        <head>
            <title>Title of your web page</title>
        </head>
        <body>
            <h1>Heading of the page</h1>
            <p id="first-paragraph" class="paragraph">Paragraph of text</p>
            <p class="paragraph">Paragraph of text 2</p>
            <div><p class="paragraph">Nested paragraph</p></div>
            <a href="/a-link">hyperlink</a>
        </body>
    </html>'''
    
    # Instantiate Selector
    sel = Selector(text=html)
    
    sel.xpath('//h1/text()').extract()
    
    ['Heading of the page']
    

    Top 50 Most Used XPath Notations in Web Scraping

    1. // – Selects nodes in the document from the current node that match the selection no matter where they are
    2. * – Selects all child nodes of the current node
    3. @ – Selects attributes of the current node
    4. .. – Selects the parent of the current node
    5. . – Selects the current node
    6. //tagname – Selects all nodes with the specified tagname
    7. //tagname[@attribute='value'] – Selects all nodes with the specified tagname and attribute value
    8. //tagname[contains(@attribute,'value')] – Selects all nodes with the specified tagname and attribute containing the specified value
    9. //tagname[@attribute1='value1' and @attribute2='value2'] – Selects all nodes with the specified tagname and multiple attributes
    10. //tagname[position()=1] – Selects the first occurrence of the specified tagname
    11. //tagname[last()] – Selects the last occurrence of the specified tagname
    12. //tagname[position()>1] – Selects all occurrences of the specified tagname except the first one
    13. //tagname[@attribute1='value1'][@attribute2='value2'] – Selects all nodes with the specified tagname and both attributes
    14. //tagname[@attribute1='value1' or @attribute2='value2'] – Selects all nodes with the specified tagname and either attribute
    15. //tagname[starts-with(@attribute,'value')] – Selects all nodes with the specified tagname and attribute starting with the specified value
    16. //tagname[ends-with(@attribute,'value')] – Selects all nodes with the specified tagname and attribute ending with the specified value
    17. //tagname[substring(@attribute,start,length)='value'] – Selects all nodes with the specified tagname and attribute substring matching the specified value
    18. //tagname/text() – Selects the text content of all nodes with the specified tagname
    19. //tagname[@attribute]/text() – Selects the text content of all nodes with the specified tagname and attribute
    20. //tagname[@attribute='value']/text() – Selects the text content of all nodes with the specified tagname and attribute value
    21. //tagname/following-sibling::siblingtagname – Selects all siblings after the current node with the specified sibling tagname
    22. //tagname/preceding-sibling::siblingtagname – Selects all siblings before the current node with the specified sibling tagname
    23. //tagname/child::childtagname – Selects all child nodes of the specified tagname with the specified child tagname
    24. //tagname/descendant::descendanttagname – Selects all descendant nodes of the specified tagname with the specified descendant tagname
    25. //tagname/ancestor::ancestortagname – Selects all ancestor nodes of the specified tagname with the specified ancestor tagname
    26. //tagname[count(child::*)=0] – Selects all nodes with the specified tagname that have no child nodes
    27. //tagname[count(child::*)>0] – Selects all nodes with the specified tagname that have at least one child node
    28. //tagname[count(attribute::*)=0] – Selects all nodes with the specified tagname that have no attributes
    29. //tagname[count(attribute::*)>0] – Selects all nodes with the specified tagname that have at least one attribute
    30. //tagname[not(tagname2)] – Selects all nodes with the specified tagname that do not have the specified tagname2 as a child node
    31. //tagname[not(attribute)] – Selects all nodes with the specified tagname that do not have any attributes
    32. //tagname[not(@attribute='value')] – Selects all nodes with the specified tagname that do not have the specified attribute value
    33. //tagname[position() mod 2 = 0] – Selects all even-indexed nodes with the specified tagname
    34. //tagname[position() mod 2 = 1] – Selects all odd-indexed nodes with the specified tagname
    35. //tagname[position() < n] – Selects the first n occurrences of nodes with the specified tagname
    36. //tagname[position() > n] – Selects all occurrences of nodes with the specified tagname after the first n occurrences
    37. //tagname[@attribute][1] – Selects the first occurrence of nodes with the specified tagname and attribute
    38. //tagname[@attribute][last()] – Selects the last occurrence of nodes with the specified tagname and attribute
    39. //tagname[@attribute][position()=n] – Selects the nth occurrence of nodes with the specified tagname and attribute
    40. //tagname[@attribute][position()=last()-n+1] – Selects the nth to last occurrence of nodes with the specified tagname and attribute
    41. //tagname[position()=1]/following::siblingtagname[1] – Selects the first occurrence of the specified sibling tagname following the first occurrence of the specified tagname
    42. //tagname[position()=1]/following::siblingtagname[position()<n+1] – Selects the first n occurrences of the specified sibling tagname following the first occurrence of the specified tagname
    43. //tagname[position()=1]/following::siblingtagname[position()>n-1] – Selects all occurrences of the specified sibling tagname after the first n occurrences following the first occurrence of the specified tagname
    44. //tagname[contains(text(),'value')] – Selects all nodes with the specified tagname containing the specified text value
    45. //tagname[starts-with(text(),'value')] – Selects all nodes with the specified tagname starting with the specified text value
    46. //tagname[ends-with(text(),'value')] – Selects all nodes with the specified tagname ending with the specified text value
    47. //tagname[matches(text(),'pattern')] – Selects all nodes with the specified tagname matching the specified regular expression pattern in the text content
    48. //tagname[contains(translate(@attribute,'ABCDEFGHIJKLMNOPQRSTUVWXYZ','abcdefghijklmnopqrstuvwxyz'),'value')] – Selects all nodes with the specified tagname containing the specified case-insensitive attribute value
    49. //tagname[translate(@attribute,'ABCDEFGHIJKLMNOPQRSTUVWXYZ','abcdefghijklmnopqrstuvwxyz')='value'] – Selects all nodes with the specified tagname having the specified case-insensitive attribute value
    50. //tagname[translate(text(),'ABCDEFGHIJKLMNOPQRSTUVWXYZ','abcdefghijklmnopqrstuvwxyz')='value'] – Selects all nodes with the specified tagname

    CSS Selector to XPath Conversion

    EquivalencyXPath NotationCSS Selector
    Select by element type//divdiv
    Select by class name//div[@class=”example”]div.example
    Select by ID//*[@id=”example”]#example
    Select by attribute//input[@name=”example”]input[name=”example”]
    Select by attribute value containing//input[contains(@class, “example”)]input[class*=”example”]
    Select by attribute value starting with//input[starts-with(@id, “example”)]input[id^=”example”]
    Select by attribute value ending with//a[ends-with(@href, “example”)]a[href$=”example”]
    Select by sibling//div/following-sibling::pdiv + p
    Select by descendant//div//pdiv p
    Select by first child//div/p[1]div > p:first-child
    Select by last child//div/p[last()]div > p:last-child

    Articles Related to Web Scraping

    Conclusion

    This is it we now have covered everything that you need to know about XPath in Web Scraping.

    Enjoyed This Post?