-
-
Notifications
You must be signed in to change notification settings - Fork 19.4k
Description
Code Sample (copy-pastable)
from __future__ import division, print_function
import pandas as pd
import numpy as np
import os
import gc
import psutil
def log_memory(label):
for i in xrange(3):
gc.collect(i)
process = psutil.Process(os.getpid())
mem_usage = process.memory_info().rss / float(2 ** 20)
print("[Memory usage] {:<25s} {:12.1f} MB".format(
label, mem_usage
))
def generate_test_data(num_partitions=20):
for i in range(num_partitions):
N = 10 * 1000 * 1000
# randomness required, identical files don't have the issue
df = pd.DataFrame({
"A": np.random.uniform(0, 1, size=N),
})
df.to_msgpack("/tmp/pd_test_{:02d}.msg".format(i), compress='zlib')
def load_msgpack(f):
data = open(f).read()
df = pd.read_msgpack(data)
return df
def load_partitions_sequentially(num_partitions=20):
for i in range(num_partitions):
fn = "/tmp/pd_test_{:02d}.msg".format(i)
df = load_msgpack(fn)
del df
log_memory("After partition {}".format(i+1))
log_memory("At initialization")
generate_test_data()
log_memory("After data generation")
load_partitions_sequentially()Problem description
There is a memory leak in pandas.read_msgpack when reading from a string. Calling pandas.read_msgpack(str_data) increases the ref count of str_data if and only if read_msgpack sees the content of str_data for the first time. This implies that there is a memory leak, but only when reading different files -- when reading the same file over and over again str_data will only leak once.
The problem does not exist when reading from file handles or BytesIO.
Output of above example
The output clearly shows the effect of the memory leak when loading data frame partitions sequentially:
[Memory usage] At initialization 39.4 MB
[Memory usage] After data generation 39.9 MB
[Memory usage] After partition 1 185.9 MB
[Memory usage] After partition 2 329.8 MB
[Memory usage] After partition 3 473.7 MB
[Memory usage] After partition 4 617.6 MB
[Memory usage] After partition 5 761.5 MB
[Memory usage] After partition 6 905.4 MB
[Memory usage] After partition 7 1049.3 MB
[Memory usage] After partition 8 1193.2 MB
[Memory usage] After partition 9 1337.1 MB
[Memory usage] After partition 10 1481.0 MB
[Memory usage] After partition 11 1624.9 MB
[Memory usage] After partition 12 1768.8 MB
[Memory usage] After partition 13 1912.7 MB
[Memory usage] After partition 14 2056.6 MB
[Memory usage] After partition 15 2200.4 MB
[Memory usage] After partition 16 2344.3 MB
[Memory usage] After partition 17 2488.2 MB
[Memory usage] After partition 18 2631.7 MB
[Memory usage] After partition 19 2775.6 MB
[Memory usage] After partition 20 2919.5 MB
Output of pd.show_versions()
Details
INSTALLED VERSIONS ------------------ commit: None python: 2.7.3.final.0 python-bits: 64 OS: Linux OS-release: 3.13.0-100-generic machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: en_US.UTF-8pandas: 0.20.2
pytest: None
pip: 9.0.1
setuptools: 36.0.1
Cython: None
numpy: 1.13.0
scipy: None
xarray: None
IPython: 5.4.1
sphinx: None
patsy: None
dateutil: 2.6.0
pytz: 2017.2
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: None
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: None
s3fs: None
pandas_gbq: None
pandas_datareader: None