python requests library full tutorial
Share this post

The Python requests library is one of the most-used libraries to make HTTP requests using Python.

In this tutorial, you will learn how to:

  • Understand the structure of a request
  • Make GET and POST requests
  • Read and extract elements of the HTML of a web page
  • Improve your requests

Install Packages

For this guide, you will need to install Python and install the following packages.


Subscribe to my Newsletter


$ pip install requests
$ pip install beautifulsoup4
$ pip install urllib

Requests Methods

  • get: Request data
  • post: Publish data
  • put: Replace data
  • patch: Make Partial changes to the data
  • delete: Delete data
  • head: Similar to get request but without the body
  • Request: Create request object by specifying the method to choose

Get Requests

import requests

url = 'https://crawler-test.com/'
response = requests.get(url)

print('URL: ', response.url)
print('Status code: ', response.status_code)
print('HTTP header: ', response.headers)

Output:

URL:  https://crawler-test.com/
Status code:  200
HTTP header:  {'Content-Encoding': 'gzip', 'Content-Type': 'text/html;charset=utf-8', 'Date': 'Sun, 03 Oct 2021 23:41:59 GMT', 'Server': 'nginx/1.10.3', 'Vary': 'Accept-Encoding', 'X-Content-Type-Options': 'nosniff', 'X-Frame-Options': 'SAMEORIGIN', 'X-XSS-Protection': '1; mode=block', 'Content-Length': '8098', 'Connection': 'keep-alive'}

Post Requests

import requests

payload = {
    'name':'Jean-Christophe',
    'last_name':'Chouinard',
    'website':'https://www.jcchouinard.com/'
    }

response = requests.post(, data = payload)

response.json()

Output:

{'args': {},
 'data': '',
 'files': {},
 'form': {'last_name': 'Chouinard',
  'name': 'Jean-Christophe',
  'website': 'https://www.jcchouinard.com/'},
 'headers': {'Accept': '*/*',
  'Accept-Encoding': 'gzip, deflate',
  'Content-Length': '85',
  'Content-Type': 'application/x-www-form-urlencoded',
  'Host': 'httpbin.org',
  'User-Agent': 'python-requests/2.24.0',
  'X-Amzn-Trace-Id': 'Root=1-615a4271-417e9fff3c75f47f3af9fde2'},
 'json': None,
 'origin': '149.167.130.162',
 'url': 'https://httpbin.org/post'}

Response Methods and Attributes

The response object contains the server’s response to the HTTP request.

You Might Also Like  Python Script Automation Using CRON on Mac

You can investigate the details of the Response object by using help().

import requests

url = 'https://crawler-test.com/'
response = requests.get(url)

help(response)

In this tutorial we will look at the following:

  • text, data descriptor : Content of the response, in unicode.
  • content, data descriptor : Content of the response, in bytes.
  • url, attribute : URL of the request
  • status_code, attribute : Status code returned by the server
  • headers, attribute : HTTP headers returned by the server
  • history, attribute : list of response objects holding the history of request
  • links, attribute : Returns the parsed header links of the response, if any.
  • json, method : Returns the json-encoded content of a response, if any.

Access the Response Methods and Attributes

The response from the request is an object in which you can access its methods and attributes.

You can access the attributes using the object.attribute notation and the methods using the object.method() notation.

import requests

url = 'http://archive.org/wayback/available?url=jcchouinard.com'
response = requests.get(url)

response.text # access response data atributes and descriptors
response.json() # access response methods
{'url': 'jcchouinard.com',
 'archived_snapshots': {'closest': {'status': '200',
   'available': True,
   'url': 'http://web.archive.org/web/20210930032915/https://www.jcchouinard.com/',
   'timestamp': '20210930032915'}}}

Process the Response

Show Status Code

import requests

url = 'https://crawler-test.com/'
r = requests.get(url)

r.status_code
# 200

Get HTML of the page

import requests

url = 'https://crawler-test.com/'
r = requests.get(url)

r.text # get content as a string
r.content # get content as bytes

Show HTTP header

import requests

url = 'https://crawler-test.com/'
r = requests.get(url)
r.headers
{'Content-Encoding': 'gzip', 'Content-Type': 'text/html;charset=utf-8', 'Date': 'Tue, 05 Oct 2021 04:23:27 GMT', 'Server': 'nginx/1.10.3', 'Vary': 'Accept-Encoding', 'X-Content-Type-Options': 'nosniff', 'X-Frame-Options': 'SAMEORIGIN', 'X-XSS-Protection': '1; mode=block', 'Content-Length': '8099', 'Connection': 'keep-alive'}

Show redirections

import requests

url = 'https://crawler-test.com/redirects/redirect_chain_allowed'
r = requests.get(url)

for redirect in r.history:
    print(redirect.url, redirect.status_code)
print(r.url, r.status_code)
https://crawler-test.com/redirects/redirect_chain_allowed 301
https://crawler-test.com/redirects/redirect_chain_disallowed 301
https://crawler-test.com/redirects/redirect_target 200

Parse the HTML with Request and BeautifulSoup

Parsing with BeautifulSoup

from bs4 import BeautifulSoup
import requests

# Make the request
url = 'https://crawler-test.com/'
r = requests.get(url)

r.text[:500]

You can see that the text is hard to interpret string.

'<!DOCTYPE html>\n<html>\n  <head>\n    <title>Crawler Test Site</title>\n    \n      <meta content="en" HTTP-EQUIV="content-language"/>\n         \n    <link type="text/css" href="/css/app.css" rel="stylesheet"/>\n    <link type="image/x-icon" href="/favicon.ico?r=1.6" rel="icon"/>\n    <script type="text/javascript" src="/bower_components/jquery/jquery.min.js"></script>\n    \n      <meta content="Default description XIbwNE7SSUJciq0/Jyty" name="description"/>\n    \n\n    \n        <link rel="alternate" media'
# Parse the HTML
soup = BeautifulSoup(r.text, 'html.parser')
soup
<!DOCTYPE html>

<html>
<head>
<title>Crawler Test Site</title>
<meta content="en" http-equiv="content-language"/>
<link href="/css/app.css" rel="stylesheet" type="text/css"/>
...
</html>

The output is easier to interpret now that it was parsed with BeautifulSoup.

You Might Also Like  Tutorial: Run Python With Spyder IDE

You can extract tag using the find() or find_all() methods.

soup.find('title')

Output:

<title>Crawler Test Site</title>
soup.find_all('meta')

Output:

[<meta content="en" http-equiv="content-language"/>,
 <meta content="Default description XIbwNE7SSUJciq0/Jyty" name="description"/>,
 <meta content="nositelinkssearchbox" name="google"/>,
 <meta content="0H-EBys8zSFUxmeV9xynoMCMePTzkUEL_lXrm9C4a8A" name="google-site-verification"/>]

Or, even select the attributes of the tag.

soup.find('meta', attrs={'name':'description'})

Output:

<meta content="Default description XIbwNE7SSUJciq0/Jyty" name="description"/>

Getting main SEO tags from a webpage

from bs4 import BeautifulSoup
import requests

# Make the request
url = 'https://crawler-test.com/'
r = requests.get(url)

# Parse the HTML
soup = BeautifulSoup(r.text, 'html.parser')

# Get the HTML tags
title = soup.find('title')
h1 = soup.find('h1')
description = soup.find('meta', attrs={'name':'description'})
meta_robots =  soup.find('meta', attrs={'name':'robots'})
canonical = soup.find('link', {'rel': 'canonical'})

# Get the text from the HTML tags
title = title.get_text() if title else ''
h1 = h1.get_text() if h1 else ''
description = description['content'] if description else ''
meta_robots =  meta_robots['content'] if meta_robots else ''
canonical = canonical['href'] if canonical else ''

# Print the tags
print('Title: ', title)
print('h1: ', h1)
print('description: ', description)
print('meta_robots: ', meta_robots)
print('canonical: ', canonical)

Output:

Title:  Crawler Test Site
h1:  Crawler Test Site
description:  Default description XIbwNE7SSUJciq0/Jyty
meta_robots:  
canonical:  

Extracting all the links on a page

from bs4 import BeautifulSoup
import requests
from urllib.parse import urljoin

url = 'https://crawler-test.com/'
r = requests.get(url)
soup = BeautifulSoup(r.text, 'html.parser')

links = []
for link in soup.find_all('a', href=True):
    full_url = urljoin(url, link['href']) # join domain to path
    links.append(full_url)

# Show 5 links
links[:5]

Output:

['https://crawler-test.com/',
 'https://crawler-test.com/mobile/separate_desktop',
 'https://crawler-test.com/mobile/desktop_with_AMP_as_mobile',
 'https://crawler-test.com/mobile/separate_desktop_with_different_h1',
 'https://crawler-test.com/mobile/separate_desktop_with_different_title']

Improve the Request

Handle Errors

import requests

url = 'bad url'

try:
    r = requests.get(url)
except Exception as e:
    print(f'There was an error: {e}')
There was an error: Invalid URL 'bad url': No schema supplied. Perhaps you meant http://bad url?

Change User-Agent

import requests 

url = 'https://www.reddit.com/r/python/top.json?limit=1&t=day'

headers = {
    'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36'
}

r = requests.get(url, headers=headers)

Add Timeout to request

import requests

url = 'http://httpbin.org/basic-auth/user/pass'

try:
    r = requests.get(url, timeout=0.1)
except Exception as e:
    print(e)

r.status_code
HTTPConnectionPool(host='httpbin.org', port=80): Max retries exceeded with url: /basic-auth/user/pass (Caused by ConnectTimeoutError(<urllib3.connection.HTTPConnection object at 0x7fb03a7fa290>, 'Connection to httpbin.org timed out. (connect timeout=0.1)'))
401

Use Proxies

import requests 

url = 'https://crawler-test.com/'

proxies = {
    'http': '128.199.237.57:8080'
}

r = requests.get(url, proxies=proxies)

Add Headers to Requests

import requests 

url = 'http://httpbin.org/headers'

access_token = {
    'Authorization': 'Bearer {access_token}'
    }

r = requests.get(url, headers=access_token)
r.json()
{'headers': {'Accept': '*/*',
  'Accept-Encoding': 'gzip, deflate',
  'Authorization': 'Bearer {access_token}',
  'Host': 'httpbin.org',
  'User-Agent': 'python-requests/2.24.0',
  'X-Amzn-Trace-Id': 'Root=1-615aa0b3-5c680dcb575d50f22e9565eb'}}

Requests Session

The session object is useful when you need to make requests with parameters that persist through all the requests in a single session.

import requests

session = requests.Session()

url = 'https://httpbin.org/headers'

access_token = {
    'Authorization': 'Bearer {access_token}'
    }

session.headers.update(access_token)

r1 = session.get(url)
r2 = session.get(url)

print('r1: ', r1.json()['headers']['Authorization'])
print('r2: ', r2.json()['headers']['Authorization'])
r1:  Bearer {access_token}
r2:  Bearer {access_token}

Tutorials using Requests

You Might Also Like  Mirror a Webpage on NGrok with Python and Wget

Interesting work from the community

Conclusion

If you are looking for an alternative to the requests library, you may be interested in the requests-HTML library that provides some built-in HTML parsing options.

We now conclude the introduction on the Python Requests library.

This library is not only useful for web scraping, but also for web development and any other endeavour that uses APIs.