-
Notifications
You must be signed in to change notification settings - Fork 6
Closed
Milestone
Description
I noticed that while converting multiple large ~30GiB .sqlite files to .parquet files the memory in use never dropped from the spike of conveting.
A code example of use:
for file in list_of_sqlite_files:
convert(
source_path=source_path,
dest_path=dest_path,
dest_datatype=dest_datatype,
preset=preset,
)After each iteration the memory is use was ~2x what the file convert size was and never reset.
So for 30GiB files after the first iteration the memory spike was 60GiB and on iteration 2 I maxed out my 128GiB of memory for the second 30GiB file.
I am not sure if this as python thing on my end or a memory leak on CytoTable end. To test this I have an example with no loop to test on below.
I have a subset 23.8MiB .sqlite file I can send for a reproducible issue with profiled code:
# Notebook to call CytoTable
# import
import pathlib
from cytotable import convert
from memory_profiler import profile
from time import sleep
# set args
sqlite_file = 'PBMC_batch_1_subset'
source_path = str(pathlib.Path('PBMC_batch_1_subset.sqlite'))
dest_path = str(pathlib.Path('PBMC_batch_1_subset.parquet'))
dest_datatype = "parquet"
preset = "cellprofiler_sqlite_pycytominer"
print(f"Performing merge single cells and conversion on {sqlite_file}!")
@profile
def my_func(source_path,dest_path,dest_datatype,preset):
convert(
source_path=source_path,
dest_path=dest_path,
dest_datatype=dest_datatype,
preset=preset,
)
# call profiled function
my_func(source_path,dest_path,dest_datatype,preset)
# to show that memory spike persists while script is running post convert call
sleep(45)Below is the memory-time graph from the profiling
d33bs
Metadata
Metadata
Assignees
Labels
No labels
