Skip to content

Possible Memory leak in the convert module #75

@MikeLippincott

Description

@MikeLippincott

I noticed that while converting multiple large ~30GiB .sqlite files to .parquet files the memory in use never dropped from the spike of conveting.

A code example of use:

for file in list_of_sqlite_files:
    convert(
        source_path=source_path,
        dest_path=dest_path,
        dest_datatype=dest_datatype,
        preset=preset,
    )

After each iteration the memory is use was ~2x what the file convert size was and never reset.
So for 30GiB files after the first iteration the memory spike was 60GiB and on iteration 2 I maxed out my 128GiB of memory for the second 30GiB file.

I am not sure if this as python thing on my end or a memory leak on CytoTable end. To test this I have an example with no loop to test on below.

I have a subset 23.8MiB .sqlite file I can send for a reproducible issue with profiled code:

# Notebook to call CytoTable
# import
import pathlib
from cytotable import convert
from memory_profiler import profile
from time import sleep

# set args
sqlite_file = 'PBMC_batch_1_subset'
source_path = str(pathlib.Path('PBMC_batch_1_subset.sqlite'))
dest_path = str(pathlib.Path('PBMC_batch_1_subset.parquet'))
dest_datatype = "parquet"
preset = "cellprofiler_sqlite_pycytominer"

print(f"Performing merge single cells and conversion on {sqlite_file}!")

@profile 
def my_func(source_path,dest_path,dest_datatype,preset):
    convert(
        source_path=source_path,
        dest_path=dest_path,
        dest_datatype=dest_datatype,
        preset=preset,
    )

# call profiled function
my_func(source_path,dest_path,dest_datatype,preset)

# to show that memory spike persists while script is running post convert call
sleep(45)

Below is the memory-time graph from the profiling

image

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions