XML sitemaps are useful for the discovery of URLs by Google. Some free tools let you create XML sitemaps, but usually have a limitation to 500 URLs or so. Other paid tools can help you create an extra-large sitemap. Python can help you create XML sitemaps for SEO.
In this post, I will show you how to create a sitemap.xml file using Python and split it into files with less than 50,000 URLs.
I will also Gzip the XML Sitemaps since sitemaps with 50,000 rows can be quite heavy to process and since Google has the capacity to process Gzip compressed sitemaps.
Best Option to Build an XML Sitemap in Python
The best way to build an XML Sitemap in Python is to use the Pandas to_xml() method.
import pandas as pd
# List of URLs
urls = [
"https://example.com/page1",
"https://example.com/page2",
"https://example.com/page3"
]
# Create a DataFrame from the list of URLs
df = pd.DataFrame(urls, columns=["URL"])
# Convert DataFrame to XML
xml_data = df.to_xml(root_name="urlset", row_name="url", xml_declaration=True)
# Print the output
print(xml_data)
# Save the XML data to a file
with open("sitemap.xml", "w") as file:
file.write(xml_data)
<?xml version='1.0' encoding='utf-8'?>
<urlset>
<url>
<index>0</index>
<URL>https://example.com/page1</URL>
</url>
<url>
<index>1</index>
<URL>https://example.com/page2</URL>
</url>
<url>
<index>2</index>
<URL>https://example.com/page3</URL>
</url>
</urlset>
Thanks to Erik Heiken for bringing awareness to building a sitemap with Pandas using the pandas to_xml() functions.
Generate XML Sitemap with jinja2
The code below may not be the best solution for the job anymore.
A special thanks to Hamlet Batista for showing me how to do this
Read: Reorganizing XML Sitemaps with Python for Fun & Profit
You can download the full code in my Github Repository.
import pandas as pd
import os
import datetime
from jinja2 import Template
import gzip
# Import List of URLs
list_of_urls = pd.read_csv('list_of_urls.csv')
list_of_urls
# Set-Up Maximum Number of URLs (recommended max 50,000)
n = 50000
# Create New Empty Row to Store the Splitted File Number
list_of_urls.loc[:,'name'] = ''
# Split the file with the maximum number of rows specified
new_df = [list_of_urls[i:i+n] for i in range(0,list_of_urls.shape[0],n)]
# For Each File Created, add a file number to a new column of the dataframe
for i,v in enumerate(new_df):
v.loc[:,'name'] = str(v.iloc[0,1])+'_'+str(i)
print(v)
# Create a Sitemap Template to Populate
sitemap_template='''<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
{% for page in pages %}
<url>
<loc>{{page[1]|safe}}</loc>
<lastmod>{{page[3]}}</lastmod>
<changefreq>{{page[4]}}</changefreq>
<priority>{{page[5]}}</priority>
</url>
{% endfor %}
</urlset>'''
template = Template(sitemap_template)
# Get Today's Date to add as Lastmod
lastmod_date = datetime.datetime.now().strftime('%Y-%m-%d')
# Fill the Sitemap Template and Write File
for i in new_df: # For each URL in the list of URLs ...
i.loc[:,'lastmod'] = lastmod_date # ... add Lastmod date
i.loc[:,'changefreq'] = 'daily' # ... add changefreq
i.loc[:,'priority'] = '1.0' # ... add priority
# Render each row / column in the sitemap
sitemap_output = template.render(pages = i.itertuples())
# Create a filename for each sitemap like: sitemap_0.xml.gz, sitemap_1.xml.gz, etc.
filename = 'sitemap' + str(i.iloc[0,1]) + '.xml.gz'
# Write the File to Your Working Folder
with gzip.open(filename, 'wt') as f:
f.write(sitemap_output)
Other Technical SEO Guides With Python
- Find Rendering Problems On Large Scale Using Python + Screaming Frog
- Recrawl URLs Extracted with Screaming Frog (using Python)
- Find Keyword Cannibalization Using Google Search Console and Python
- Get BERT Score for SEO
- Web Scraping With Python and Requests-HTML
- Randomize User-Agent With Python and BeautifulSoup
- Create a Simple XML Sitemap With Python
- Web Scraping with Scrapy and Python
That’s it you now have created XML Sitemaps, divided into groups of less than 50,000 URLs, using Python.
SEO Strategist at Tripadvisor, ex- Seek (Melbourne, Australia). Specialized in technical SEO. Writer in Python, Information Retrieval, SEO and machine learning. Guest author at SearchEngineJournal, SearchEngineLand and OnCrawl.