Python Pickle

Share this post

Pickle python module is used to serialize (pickle) python objects to bytes and de-serialize those bytes back to objects.

It is very useful as it keeps the data type along with the data.

Pickle can be useful to:


Subscribe to my Newsletter


  • Read csv files faster
  • Store results of a crawl
  • Store machine learning trained modelsĀ 

What is Pickle

Pickle files are serialized file types native to Python that is useful to store data when the data types (int, str, float, …) are not obvious.

Serializing is the act of converting objects into a sequence of Bytes (Bytestream).

Example Use Case of Pickle

Pickle can be used when you make a web crawler.

When you crawl a website, you request a web page and receive a status code, the full HTML of the page, the HTTP header, etc.

When you want to save your crawl in a CSV, you need to extract the data that you want from the HTML (title, href elements, h1, etc.). Then, you’d store each element in a new column and save the CSV file. If you forget something, it is gone and you need to recrawl your site.

By using Pickle, you can simply store the URL and the entire response as bytes. Then you can reprocess that whenever you want.

Beware

Pickle is unsafe, as it can allow code execution.

There is NO safe way to unpickle untrusted data. Do NOT unpickle a file that is not your own.

A good alternative to that would be to store in JSON files instead. Make sure that you don’t use Pickle except for your own code. Don’t load pickle that you don’t know about.

You Might Also Like  Get List of Verified Properties using Google Search Console API and Python

Save to pickle

import pickle

with open('filename.pkl', 'wb') as f:
    pickle.dump(object_name, f)

Read Pickle

with open('filename.pkl', 'rb') as f:
    unpickled_file = pickle.load(f)
import pandas as pd 
import pickle
import time

filename = 'largefile.csv'

start = time.time()
data = pd.read_csv(filename)
end = time.time()
print(f'Reading CSV took: {end - start} seconds')

with open('pickled_data.pkl', 'wb') as f:
    pickle.dump(data, f)

start = time.time()
with open('pickled_data.pkl', 'rb') as f:
    unpickled_file = pickle.load(f)
end = time.time()
print(f'Reading Pickle took: {end - start} seconds')

Read Pickle File to Pandas DataFrame

import pandas as pd

df = pd.read_pickle("./pickled_data.pkl")
import time

start = time.time()
data = pd.read_pickle(pickled)
end = time.time()
print(f'Reading Pick with Pandas took: {end - start} seconds')

start = time.time()
with open(pickled, 'rb') as f:
    unpickled_file = pickle.load(f)
end = time.time()
print(f'Reading Pickle with open took: {end - start} seconds')

# Reading Pickle with Pandas.read_pickle() took: 8.135507822036743 seconds
# Reading Pickle with Open() took: 5.7462098598480225 seconds

Save Trained Machine Learning Model

# save the trained model
filename = 'trained_model.sav'
pickle.dump(model, open(filename, 'wb'))
 
# load the trained model
loaded_model = pickle.load(open(filename, 'rb'))

Compress Pickle Files

When working with large datasets, your pickled file will come to take a lot of space.

You will want to compress your pickle file.

You can compress a Pickle file using bzip2 or gzip.

  • bzip2 is slower
  • gzip creates 2X larger files than bzip2
import bz2
import pickle

with bz2.BZ2File('filename', 'w') as f:
    pickle.dump(data, f)

Conclusion

This is the end of this tutorial on using Pickle with Python. Remember, use pickle safely. Even better, use a safer alternative like JSON.

Enjoyed This Post?