Today, you will learn about how to do web scraping with BeautifulSoup. You will learn how to use the requests library to fetch web pages and the BeautifulSoup library to parse the HTML in Python.
In this tutorial, you will learn:
- The basics of web scraping and parsing
- How to install BeautifulSoup
- How to parse a local HTML file
- Use Requests and BeautifulSoup to scrape and parse web pages
- Methods that you will need to find elements inside an HTML
- Filters that you can use to improve matching inside the HTML
What is BeautifulSoup
BeautifulSoup is a parsing library in Python that is used to scrape information from HTML or XML. The BeautifulSoup parser provides Python idioms to search and modify the parse tree.
Simply put, BeautifulSoup is the library that allows you to format the HTML in a usable way and extract elements from it.
See the BeautifulSoup Documentation.
What is Web Scraping
Web scraping is the process of using a bot to extract data from a website and export it into a digestible format. A web scraper extracts the HTML code from a web page, which is then parsed to extract valuable information.
How is BeautifulSoup Useful in Web Scraping?
BeautifulSoup is a Python library that makes it simple to parse HTML or XML to extract valuable information from it.
What is Parsing in Web Scraping?
Parsing in web scraping is the process of transforming unstructured data into a structured format (e.g. parse tree) that is easier to read, use and extract data from.
Basically, parsing means splitting a document in usable chunks.
Getting Started with BeautifulSoup
How to Install BeautifulSoup
Use pip to install BeautifulSoup in Python.
$ pip install beautifulsoup4
Simple Parsing with BeautifulSoup
Here is a simple example using the Beautifulsoup HTML parser:
from bs4 import BeautifulSoup
soup = BeautifulSoup("<p>I am learning <span>BeautifulSoup</span></p>")
soup.find('span')
<span>BeautifulSoup</span>
Web Scraping Best Practices
- Use toy websites to practice
- Use APIs instead of web scraping
- Investigate if data is available elsewhere first (e.g. common crawl)
- Respect robots.txt
- Slow down your scraping speed
- Cache your requests
How to Parse HTML with BeautifulSoup
Follow these steps to parse HTML in BeautifulSoup:
- Install BeautifulSoup
Use pip to install BeautifulSoup
$ pip install beautifulsoup4
- Import the BeautifulSoup library in Python
To import BeautifulSoup in Python, import the BeautifulSoup class from the bs4 library.
from bs4 import BeautifulSoup
- Parse the HTML
To parse the HTML, create BeautifulSoup object and add the HTML to be parsed as a required argument. The soup object will be a parsed version of the HTML.
soup = BeautifulSoup("<p>your HTML</p>")
- Use BeautifulSoup’s object methods to pull information from the HTML
The BeautifulSoup library has many built-in methods to extract data from the HTML. Use methods like soup.find() or soup.find_all() to extract specific elements from the parsed HTML
Parsing Your First HTML with BeautifulSoup
The first learning of this BeautifulSoup tutorial will be to parse this simple HTML.
Example HTML to be Parsed
html = '''
<html>
<head>
<title>Simple SEO Title</title>
<meta name="description" content="Meta Description with less than 300 characters.">
<meta name="robots" content="noindex, nofollow">
<link rel="alternate" href="https://www.example.com/en" hreflang="en-ca">
<link rel="canonical" href="https://www.example.com/fr">
</head>
<body>
<header>
<div class="nav">
<ul>
<li class="home"><a href="#">Home</a></li>
<li class="blog"><a class="active" href="#">Blog</a></li>
<li class="about"><a href="#">About</a></li>
<li class="contact"><a href="#">Contact</a></li>
</ul>
</div>
</header>
<div class="body">
<h1>Blog</h1>
<p>Lorem ipsum dolor <a href="#">Anchor Text Link</a> sit amet consectetur adipisicing elit. Ipsum vel laudantium a voluptas labore. Dolorum modi doloremque, dolore molestias quos nam a laboriosam neque asperiores fugit sed aut optio earum!</p>
<h2>Subtitle</h2>
<p>Lorem ipsum dolor sit amet consectetur adipisicing elit. Ipsum vel laudantium a voluptas labore. Dolorum modi doloremque, dolore molestias quos nam a <a href="#" rel="nofollow">Nofollow link</a> laboriosam neque asperiores fugit sed aut optio earum!</p>
</div>
</body>
</html>'''
The HTML variable that we just created is similar to the output that we would get when scraping a web page. This is HTML, but stored as text.
This is not very useful as it is hard to search within it.
You could use regular expressions to parse the text content, but a better way is available: parsing with BeautifulSoup.
Parsing the HTML
To parse HTML with BeautifulSoup, instantiate a BeautifulSoup constructor by adding the HTML to be parsed as a required argument, and the name of the parser as an optional argument.
# Parsing an HTML File
from bs4 import BeautifulSoup
import requests
soup = BeautifulSoup(html, 'html.parser')
This returns the parsed HTML and creates a BeautifulSoup object from the HTML response.
We then retrieve any HTML element (the title tag in this case) from the BeautifulSoup object with the soup.find()
method.
print(soup.find('title'))
<title>Simple SEO Title</title>
Parse a Local HTML File with BeautifulSoup
If you have an HTML file saved somewhere on your computer, you can also parse a local HTML File with BeautifulSoup.
# Parsing an HTML File
from bs4 import BeautifulSoup
import requests
with open('/path/to/file.html') as f:
soup = BeautifulSoup(f, 'html.parser')
print(soup)
Parsing a Web page with BeautifulSoup
To parse a web page with BeautifulSoup, fetch the HTML of the page using the Python requests library. You can access the HTML of the page in many different ways: HTTP Requests, Browser-based application, Manually downloading from the web browser.
So far, we have seen how to parse a local HTML file which is not really web scraping… yet.
Now we will fetch a web page to extract its HTML and then parse the content with BeautifulSoup.
Practice Web Scraping on Crawler-test.com
For this tutorial, we will practice web scraping with BeautifulSoup on a toy website that was created for that purpose: crawler-test.com.
Extract Content From a Web Page with Requests
To extract content from a web page, make an HTTP request to a URL using the Python requests library.
# Making an HTTP Request
import requests
url = 'https://crawler-test.com/'
response = requests.get(url)
print('Status code: ', response.status_code)
print('Text: ', response.text[:50])
Parsing the Response with BeautifulSoup
To parse the response with BeautifulSoup, add the retrieved HTML as an argument of the BeautifulSoup constructor.
When fetching a web page with requests, the response object is returned. From that response, you can retrieve the HTML of the page. That HTML is stored in Unicode or bytes (text).
We already have seen how to parse this textual representation of the HTML with BeautifulSoup. To do so, all we need is to pass the response.text to the BeautifulSoup class.
from bs4 import BeautifulSoup
# Parse the HTML
soup = BeautifulSoup(response.text, 'html.parser')
# Extract any HTML tag
soup.find('title')
<title>Crawler Test Site</title>
How to Extract HTML Tags with BeautifulSoup
Use BeautifulSoup’s find()
and find_all()
methods to extract HTML tags from the parsed HTML.
Some of the very common HTML tags that you will want to scrape are the title, the h1 and the links.
Find Elements by Tag Name
To find an HTML element by its tag name in BeautifulSoup, pass the tag name as an argument to the BeautifulSoup object’s method.
Here is an example of how you can extract these tags with bs4.
# Extracting HTML tags
title = soup.find('title')
h1 = soup.find('h1')
links = soup.find_all('a', href=True)
# Print the outputs
print('Title: ', title)
print('h1: ', h1)
print('Example link: ', links[1]['href'])
Title: <title>Crawler Test Site</title>
h1: <h1>Crawler Test Site</h1>
Example link: /mobile/separate_desktop
Find Elements by ID
To find an HTML element by its ID in BeautifulSoup, pass the id name to the id
parameter of soup object’s find() method.
soup.find(id="header")
<div id="header">
<a href="/" id="logo">Crawler Test <span class="neon-effect">two point oh!</span></a>
<div style="position:absolute;right:520px;top:-4px;"></div>
</div>
Find Elements by HTML Class Name
To find an HTML element by its class in BeautifulSoup, pass a dictionary as an argument of the soup object’s find_all()
method.
soup.find_all('div', {'class': 'panel-header'})[0]
<div class="panel-header">
<h3>Mobile</h3>
</div>
You can then loop the object to do whatever you want.
# Find Elements by Class
boxes = soup.find_all('div', {'class': 'panel'})
box_names = []
for box in boxes:
title = box.find('h3')
box_names.append(title.text)
box_names[:5]
['Mobile', 'Description Tags', 'Encoding', 'Titles', 'Robots Protocol']
Select Elements with CSS Selector
The alternative to select elements by class name is to use the select()
method on the BeautifulSoup object and use the CSS selector to view the content.
# Soup select CSS selector
soup.select('div .panel-header')
How to Extract Text From HTML Elements
To extract text from an HTML element using BeautifulSoup, use the .text
attribute on the soup object. If the object is a list (e.g. found using find_all
) use a for loop to iterate each element and use the text attribute on each element.
logo = soup.find('a', {'id': 'logo'})
logo.text
'Crawler Test two point oh!'
How to Extract Meta HTML Tags (with Attributes) in Bs4
To extract HTML tags using their attributes, pass a dictionary to the attrs
parameter in the find() method.
Example attributes are the name
attribute used in the meta description, or the href
attribute used in a hyperlink.
Some HTML elements require you to get elements using their attribute.
Below, we will parse the meta description and the meta robots name attributes.
# Parsing using HTML tag attributes
description = soup.find('meta', attrs={'name':'description'})
meta_robots = soup.find('meta', attrs={'name':'robots'})
print('description: ',description)
print('meta robots: ',meta_robots)
description: <meta content="Default description XIbwNE7SSUJciq0/Jyty" name="description"/>
meta robots: None
Here, the meta robots element was not available, and thus returned None
.
Let’s see how to take care of unavailable tags.
Parsing Canonical with BeautifulSoup
Below, we will scrape the description, the meta robots and the canonical tags from the web page. If they are not available, we will return an empty string.
# Conditional parsing
canonical = soup.find('link', {'rel': 'canonical'})
# Extract if attribute ifound
description = description['content'] if description else ''
meta_robots = meta_robots['content'] if meta_robots else ''
canonical = canonical['href'] if canonical else ''
# Print output
print('description: ', description)
print('meta_robots: ', meta_robots)
print('canonical: ', canonical)
description: Default description XIbwNE7SSUJciq0/Jyty
meta_robots:
canonical:
How to Extract all Links on a Web Page
To extract all the links of a web page with BeautifulSoup, use the soup.find_all()
method with the "a"
tag as its argument. Set the href
parameter to True.
soup.find_all('a', href=True)
Extracting links on a web page is the most relevant thing to know in web scraping.
Now, we will learn how to do that.
In addition, we will learn how to use the urljoin method from urllib to overcome one of the most common challenges of building a web crawler: taking care of relative and absolute URLs.
# Extract all links on the page
from urllib.parse import urlparse, urljoin
url = 'https://crawler-test.com/'
# Parse URL
parsed_url = urlparse(url)
domain = parsed_url.scheme + '://' + parsed_url.netloc
print('Domain root is: ', domain)
# Get href from all links
links = []
for link in soup.find_all('a', href=True):
# join domain to path
full_url = urljoin(domain, link['href'])
links.append(full_url)
# Print output
print('Top 5 links are :\n', links[:5])
Domain root is: https://crawler-test.com
Top 5 links are :
['https://crawler-test.com/', 'https://crawler-test.com/mobile/separate_desktop', 'https://crawler-test.com/mobile/desktop_with_AMP_as_mobile', 'https://crawler-test.com/mobile/separate_desktop_with_different_h1', 'https://crawler-test.com/mobile/separate_desktop_with_different_title']
Extract Element on a Specific Part of the Page
To extract an element on a specific part of a page (e.g using its id), assign the result of a soup.find() to another variable and use one of bs4 built-in methods on that object.
# Get div that contains a specific ID
subset = soup.find('div', {'id': 'idname'})
# Find all p inside that div
subset.find_all('p')
Example:
# Extract all links on the page
from urllib.parse import urljoin
domain = 'https://crawler-test.com/'
# Get div that contains a specific ID
menu = soup.find('div', {'id': 'header'})
# Find all links within the div
menu_links = menu.find_all('a', href=True)
# Print output
for link in menu_links[:5]:
print(link['href'])
print(urljoin(domain, link['href']) )
/
https://crawler-test.com/
Remove Specific HTML Tags with BeautifulSoup
To remove a specific HTML tag from an HTML document with BeautifulSoup, use the decompose()
method.
soup.tag_name.decompose()
Example
# Prompt
wikipedia_text = '''
<p>In the United States, website owners can use three major <a href="/wiki/Cause_of_action" title="Cause of action">legal claims</a> to prevent undesired web scraping: (1) copyright infringement (compilation), (2) violation of the <a href="/wiki/Computer_Fraud_and_Abuse_Act" title="Computer Fraud and Abuse Act">Computer Fraud and Abuse Act</a> ("CFAA"), and (3) <a href="/wiki/Trespass_to_chattels" title="Trespass to chattels">trespass to chattel</a>.<sup id="cite_ref-6" class="reference"><a href="#cite_note-6">[6]</a></sup> However, the effectiveness of these claims relies upon meeting various criteria...</p>'''
# Parse HTML
wiki_soup = BeautifulSoup(wikipedia_text, 'html.parser')
# Get first paragraph
par = wiki_soup.find_all('p')[0]
# Get all links
par.find_all('a')
# Remove references tags from wikipedia
par.sup.decompose()
par.find_all('a')
Remove All Script Tags with BeautifulSoup
from bs4 import BeautifulSoup
html_content = """
<html>
<head>
<script type="text/javascript">alert('Hello!');</script>
</head>
<body>
<h1>Hello, World!</h1>
<script type="text/javascript">console.log('Hello, World!');</script>
</body>
</html>
"""
# Parse the HTML with BeautifulSoup
soup = BeautifulSoup(html_content, 'html.parser')
# Find all <script> tags and remove them
for script in soup.find_all('script'):
script.decompose()
Find Elements by Text Content
To find elements in the HTML using textual content, add the text to be matched as the value of the string
parameter inside the find_all()
method.
# Getting Elements using Text
soup = BeautifulSoup(r.text, 'html.parser')
soup.find_all('a',string="Description Tag Too Long")
[<a href="/description_tags/description_over_max">Description Tag Too Long</a>]
The problem with this is that the string has to be exact match. Thus, using a partial string would for matching with BeautifulSoup return nothing:
# Getting Elements using Partial String
soup.find_all('a',string="Description")
[]
The work around to this string matching issue is to use apply functions or use regular expressions with the string
parameter. We will next cover these two situations.
Apply a Function to a BeautifulSoup Method
To apply a function inside of the BeautifulSoup method, add the function to the string parameter of the find_all()
method.
# Apply function to BeautifulSoup
def find_a_string(value):
return lambda text: value in text
soup.find_all(string=find_a_string('Description Tag'))
['Description Tags',
'Description Tag With Whitespace',
'Description Tag Missing',
'Description Tag Missing With Meta Nosnippet',
'Description Tag Duplicate',
'Description Tag Duplicate',
'Noindex and Description Tag Duplicate',
'Noindex and Description Tag Duplicate',
'Description Tag Too Long']
Parse HTML Page with Regular Expressions in BeautifulSoup
To use regular expressions to parse an HTML page with BeautifulSoup, import the re
module, and assign a re.compile()
object to the string parameter of the find_all()
method.
from bs4 import BeautifulSoup
import re
# Parse using regex
soup = BeautifulSoup(r.text, 'html.parser')
soup.find_all(string=re.compile('Description Tag'))
['Description Tags',
'Description Tag With Whitespace',
'Description Tag Missing',
'Description Tag Missing With Meta Nosnippet',
'Description Tag Duplicate',
'Description Tag Duplicate',
'Noindex and Description Tag Duplicate',
'Noindex and Description Tag Duplicate',
'Description Tag Too Long']
How to Use XPath with BeautifulSoup (lxml)
To use XPath to extract elements from an HTML document with BeautifulSoup, you need to install the lxml python library as Beautiful does not support XPath expressions.
from lxml import html
# Parse HTML with XPath
content = html.fromstring(r.content)
panels = content.xpath('//*[@class="panel-header"]')
# get text from tags
[panel.find('h3').text for panel in panels][:5]
Find Parents, Children and Siblings of an HTML element
BeautifulSoup returns a parse tree that you can move through each parent, child and sibling in the tree to find the elements that you want.
Find Parent(s) of an HTML Element
To find a single parent of an HTML element, use the find_parent()
method which will show you the element that you are looking for as well as it parent in the HTML tree.
a_child = soup.find_all('a')[0]
a_child.find_parent()
<div id="header">
<a href="/" id="logo">Crawler Test <span class="neon-effect">two point oh!</span></a>
<div style="position:absolute;right:520px;top:-4px;"></div>
</div>
To find all the parents of an HTML element, use the find_parents()
method on the BeautifulSoup object.
a_child.find_parents()
Find Child(ren) of an HTML Element
To find a single child of an HTML element, use the findChild()
method which will show you the child of the element in the HTML tree.
a_child = soup.find_all('a')[0]
a_child.findChild()
<span class="neon-effect">two point oh!</span>
To find all the children of an HTML element, use the fetchChildren()
method on the BeautifulSoup object.
a_child.findChildren()
#or
list(a_child.children)
Find all the descendants of the HTML element.
list(a_child.descendants)
Find Sibling(s) of an HTML Element
To find the next sibling after an HTML element, use the find_next_sibling()
method which will show you the next sibling in the HTML tree.
a_child = soup.find_all('a')[0]
a_child.find_next_sibling()
a_child.find_next_siblings()
<div style="position:absolute;right:520px;top:-4px;"></div>
Where this comes from is that if you take the parent element:
a_parent = soup.find('div',{'id':'header'})
a_parent
You get the HTML of the element, and you get the next sibling of the <a> tag, which is the highlighted div (and the result shown before.
<div id="header">
<a href="/" id="logo">Crawler Test <span class="neon-effect">two point oh!</span></a>
<div style="position:absolute;right:520px;top:-4px;"></div>
</div>
Similarly, you can find the previous sibling(s) too.
a_child. find_previous_sibling()
a_child. find_previous_siblings()
To find all the parents of an HTML element, use the find_parents()
method on the BeautifulSoup object.
Fix Broken HTML with BeautifulSoup
With BeautifulSoup, you can take broken HTML and completing the missing parts using the prettify()
method on the BeautifulSoup object.
# Fix Broken HTML with BeautifulSoup
from bs4 import BeautifulSoup
soup = BeautifulSoup("<p>Incomplete Broken <span>HTML<</p>")
print(soup.prettify())
<html>
<body>
<p>
Incomplete Broken
<span>
HTML
</span>
</p>
</body>
</html>
How to Use urllib and BeautifulSoup to Scrape a Web Page in Python
Urllib can be used in combination with Bs4 as an alternative to the Python requests library to retrieve information from the web in Python.
To scrape a web page with urllib and BeautifulSoup, use the urlopen() method from urllib.request and pass the decoded response to the BeautifulSoup class.
# Retrieve Webpage with Urllib BeautifulSoup
from bs4 import BeautifulSoup
from urllib.request import urlopen
r = urlopen('https://www.crawler-test.com')
html = r.read().decode("utf8")
soup = BeautifulSoup(html)
soup.find('title')
How to use BeautifulSoup with Requests-HTML
BeautifulSoup can parse any HTML that you give it, thus can parse the Requests-HTML response. To parse the HTML of the Requests-HTML object with BeautifulSoup, pass the response.html.raw_html attribute to the BeautifulSoup object.
# requests-html beautifulsoup
from bs4 import BeautifulSoup
from requests_html import HTMLSession
url = 'https://crawler-test.com/'
session = HTMLSession()
r = session.get(url)
soup = BeautifulSoup(r.html.raw_html, features='lxml')
soup.find('h1')
Write the Output of BeautifulSoup to HTML file in Python
Use the open() function to write the output to HTML file with Python BeautifulSoup. In addition, use the prettify() method on the BeautifulSoup object to fix potential errors and make a prettier output.
# write the output to html file with BeautifulSoup
with open('filename.html', 'w') as f:
pretty_soup = str(soup.prettify())
f.write(pretty_soup)
BeautifulSoup Methods
When listing BeautifulSoup methods you will discover that method names are written in two different casings: camelCase and snake_case. Camel case was used in the previous version of BeautifulSoup and snake_case in the latest version.
Table List of BeautifulSoup Methods
BeautifulSoup Method | Description |
---|---|
append() | Appends the given PageElement to the contents of this one. |
childGenerator() | Deprecated generator. |
clear() | Wipe out all children of this PageElement by calling extract() on them. |
currentTag() | A data structure representing a parsed HTML or XML document. |
decode() | Returns a string or Unicode representation of the parse tree as an HTML or XML document. |
decode_contents() | Renders the contents of this tag as a Unicode string. |
decompose() | Recursively destroys this PageElement and its children. |
encode() | Render a bytestring representation of this PageElement and its contents. |
encode_contents() | Renders the contents of this PageElement as a bytestring. |
endData() | Method called by the TreeBuilder when the end of a data segment occurs. |
extend() | Appends the given PageElements to this one’s contents. |
extract() | Destructively rips this element out of the tree. |
fetchNextSiblings() | Find all siblings of this PageElement that match the given criteria and appear later in the document. |
fetchParents() | Find all parents of this PageElement that match the given criteria. |
fetchPrevious() | Look backwards in the document from this PageElement and find all PageElements that match the given criteria. |
fetchPreviousSiblings() | Returns all siblings to this PageElement that match the given criteria and appear earlier in the document. |
find() | Look in the children of this PageElement and find the first PageElement that matches the given criteria. |
findAll() | Look in the children of this PageElement and find all PageElements that match the given criteria. |
findAllNext() | Find all PageElements that match the given criteria and appear later in the document than this PageElement. |
findAllPrevious() | Look backwards in the document from this PageElement and find all PageElements that match the given criteria. |
findChild() | Look in the children of this PageElement and find the first PageElement that matches the given criteria. |
findChildren() | Look in the children of this PageElement and find all PageElements that match the given criteria. |
findNext() | Find the first PageElement that matches the given criteria and appears later in the document than this PageElement. |
findNextSibling() | Find the closest sibling to this PageElement that matches the given criteria and appears later in the document. |
findNextSiblings() | Find all siblings of this PageElement that match the given criteria and appear later in the document. |
findParent() | Find the closest parent of this PageElement that matches the given criteria. |
findParents() | Find all parents of this PageElement that match the given criteria. |
findPrevious() | Look backwards in the document from this PageElement and find the first PageElement that matches the given criteria. |
findPreviousSibling() | Returns the closest sibling to this PageElement that matches the given criteria and appears earlier in the document. |
findPreviousSiblings() | Returns all siblings to this PageElement that match the given criteria and appear earlier in the document. |
find_all() | Look in the children of this PageElement and find all PageElements that match the given criteria. |
find_all_next() | Find all PageElements that match the given criteria and appear later in the document than this PageElement. |
find_all_previous() | Look backwards in the document from this PageElement and find all PageElements that match the given criteria. |
find_next() | Find the first PageElement that matches the given criteria and appears later in the document than this PageElement. |
find_next_sibling() | Find the closest sibling to this PageElement that matches the given criteria and appears later in the document. |
find_next_siblings() | Find all siblings of this PageElement that match the given criteria and appear later in the document. |
find_parent() | Find the closest parent of this PageElement that matches the given criteria. |
find_parents() | Find all parents of this PageElement that match the given criteria. |
find_previous() | Look backwards in the document from this PageElement and find the first PageElement that matches the given criteria. |
find_previous_sibling() | Returns the closest sibling to this PageElement that matches the given criteria and appears earlier in the document. |
find_previous_siblings() | Returns all siblings to this PageElement that match the given criteria and appear earlier in the document. |
format_string() | Format the given string using the given formatter. |
formatter_for_name() | Look up or create a Formatter for the given identifier, if necessary. |
get() | Returns the value of the ‘key’ attribute for the tag, or the value given for ‘default’ if it doesn’t have that attribute. |
getText() | Get all child strings of this PageElement, concatenated using the given separator. |
get_attribute_list() | The same as get(), but always returns a list. |
get_text() | Get all child strings of this PageElement, concatenated using the given separator. |
handle_data() | Called by the tree builder when a chunk of textual data is encountered. |
handle_endtag() | Called by the tree builder when an ending tag is encountered. |
handle_starttag() | Called by the tree builder when a new tag is encountered. |
has_attr() | Does this PageElement have an attribute with the given name? |
has_key() | Deprecated method. This was kind of misleading because has_key() (attributes) was different from __in__ (contents). |
index() | Find the index of a child by identity, not value. |
insert() | Insert a new PageElement in the list of this PageElement’s children. |
insert_after() | This method is part of the PageElement API, but `BeautifulSoup` doesn’t implement it because there is nothing before or after it in the parse tree. |
insert_before() | This method is part of the PageElement API, but `BeautifulSoup` doesn’t implement it because there is nothing before or after it in the parse tree. |
new_string() | Create a new NavigableString associated with this BeautifulSoup object. |
new_tag() | Create a new Tag associated with this BeautifulSoup object. |
nextGenerator() | Generator to find the next element |
nextSiblingGenerator() | Generator to find the next sibling |
object_was_parsed() | Method called by the TreeBuilder to integrate an object into the parse tree. |
parentGenerator() | Generator to find the parent |
parserClass() | A data structure representing a parsed HTML or XML document. |
parser_class() | A data structure representing a parsed HTML or XML document. |
popTag() | Internal method called by _popToTag when a tag is closed. |
prettify() | Pretty-print this PageElement as a string. |
previousGenerator() | Generator to find the previous element |
previousSiblingGenerator() | Generator to find the previous Sibling |
pushTag() | Internal method called by handle_starttag when a tag is opened. |
recursiveChildGenerator() | Deprecated generator. |
renderContents() | Deprecated method for BS3 compatibility. |
replaceWith() | Replace this PageElement with one or more PageElements, keeping the rest of the tree the same. |
replaceWithChildren() | Replace this PageElement with its contents. |
replace_with() | Replace this PageElement with one or more PageElements, keeping the rest of the tree the same. |
replace_with_children() | Replace this PageElement with its contents. |
reset() | Reset this object to a state as though it had never parsed any markup. |
select() | Perform a CSS selection operation on the current element. |
select_one() | Perform a CSS selection operation on the current element. |
setup() | Sets up the initial relations between this element and other elements. |
smooth() | Smooth out this element’s children by consolidating consecutive strings. |
string_container() | |
unwrap() | Replace this PageElement with its contents. |
wrap() | Wrap this PageElement inside another one. |
Articles Related to Web Scraping
Conclusion
We have covered everything that you can possibly need to know around web scraping with BeautifulSoup. Good luck!
SEO Strategist at Tripadvisor, ex- Seek (Melbourne, Australia). Specialized in technical SEO. Writer in Python, Information Retrieval, SEO and machine learning. Guest author at SearchEngineJournal, SearchEngineLand and OnCrawl.