Pickle python module is used to serialize (pickle) python objects to bytes and de-serialize those bytes back to objects.
It is very useful as it keeps the data type along with the data.
Pickle can be useful to:
- Read csv files faster
- Store results of a crawl
- Store machine learning trained models
What is Pickle
Pickle files are serialized file types native to Python that is useful to store data when the data types (int, str, float, …) are not obvious.
Serializing is the act of converting objects into a sequence of Bytes (Bytestream).
Example Use Case of Pickle
Pickle can be used when you make a web crawler.
When you crawl a website, you request a web page and receive a status code, the full HTML of the page, the HTTP header, etc.
When you want to save your crawl in a CSV, you need to extract the data that you want from the HTML (title, href elements, h1, etc.). Then, you’d store each element in a new column and save the CSV file. If you forget something, it is gone and you need to recrawl your site.
By using Pickle, you can simply store the URL and the entire response as bytes. Then you can reprocess that whenever you want.
Beware
Pickle is unsafe, as it can allow code execution.
There is NO safe way to unpickle untrusted data. Do NOT unpickle a file that is not your own.
A good alternative to that would be to store in JSON files instead. Make sure that you don’t use Pickle except for your own code. Don’t load pickle that you don’t know about.
Save to pickle (Pickle Dump)
import pickle
with open('filename.pkl', 'wb') as f:
pickle.dump(object_name, f)
Read Pickle
with open('filename.pkl', 'rb') as f:
unpickled_file = pickle.load(f)
import pandas as pd
import pickle
import time
filename = 'largefile.csv'
start = time.time()
data = pd.read_csv(filename)
end = time.time()
print(f'Reading CSV took: {end - start} seconds')
with open('pickled_data.pkl', 'wb') as f:
pickle.dump(data, f)
start = time.time()
with open('pickled_data.pkl', 'rb') as f:
unpickled_file = pickle.load(f)
end = time.time()
print(f'Reading Pickle took: {end - start} seconds')
Read Pickle File to Pandas DataFrame
You can open a pickle file with a context manager, but you can also do it using the Pandas Python library.
import pandas as pd
df = pd.read_pickle("./pickled_data.pkl")
import time
start = time.time()
data = pd.read_pickle(pickled)
end = time.time()
print(f'Reading Pick with Pandas took: {end - start} seconds')
start = time.time()
with open(pickled, 'rb') as f:
unpickled_file = pickle.load(f)
end = time.time()
print(f'Reading Pickle with open took: {end - start} seconds')
# Reading Pickle with Pandas.read_pickle() took: 8.135507822036743 seconds
# Reading Pickle with Open() took: 5.7462098598480225 seconds
Save Trained Machine Learning Model
# save the trained model
filename = 'trained_model.sav'
pickle.dump(model, open(filename, 'wb'))
# load the trained model
loaded_model = pickle.load(open(filename, 'rb'))
Compress Pickle Files
When working with large datasets, your pickled file will come to take a lot of space.
You will want to compress your pickle file.
You can compress a Pickle file using bzip2
or gzip
.
bzip2
is slowergzip
creates 2X larger files thanbzip2
import bz2
import pickle
with bz2.BZ2File('filename', 'w') as f:
pickle.dump(data, f)
PKL to CSV: Convert Pickle to CSV
You may want to convert a Pickle file to a CSV file. The simplest solution to convert .pkl
to .csv
is to use the to_csv()
method from the pandas library.
import pickle as pkl
import pandas as pd
with open("file.pkl", "rb") as f:
file = pkl.load(f)
df = pd.DataFrame(file)
df.to_csv(r'filename.csv')
CSV to Pickle
The simplest way to convert a CSV file to a Pickle file is to use the to_pickle
method from the pandas library.
import pickle
import pandas as pd
df = pd.read_csv('filename.csv')
df.to_pickle('filename.pkl')
Conclusion
This is the end of this tutorial on using Pickle with Python. Remember, use pickle safely. Even better, use a safer alternative like JSON.
SEO Strategist at Tripadvisor, ex- Seek (Melbourne, Australia). Specialized in technical SEO. Writer in Python, Information Retrieval, SEO and machine learning. Guest author at SearchEngineJournal, SearchEngineLand and OnCrawl.