Skip to content

ValueError: mismatch of shapes when sampling data for compas dataset #329

@bronval

Description

@bronval

Environment Details

Please indicate the following details about the environment in which you found the bug:

  • CTGAN version: 0.8.0
  • Python version: 3.9.6
  • Pandas version: 2.0.3
  • Operating System: Ubuntu 22

Error Description

Hello, when trying to use CTGAN to sample (18000) synthetic data for the compas dataset (https://www.kaggle.com/datasets/danofer/compass), I came across the following error:
ValueError: Shape of passed values is (18000, 2), indices imply (18000, 3)

Although it is an error thrown by Pandas, it originally comes from the function "_inverse_transform_continuous" in the file "data_transformer.py", and more specifically the line
data = pd.DataFrame(column_data[:, :2], columns=list(gm.get_output_sdtypes()))

Note that I am not sure if it is because of the version of Pandas or not.

Steps to reproduce

First, I trained a model on the compas data (file: cox-violent-parsed_filt.csv) and the following columns were removed:
to_remove_compas = ["id", "name", "first", "last", "dob", "c_jail_in", "c_jail_out", "c_charge_desc", "r_offense_date", "r_charge_desc", "r_jail_in", "violent_recid", "vr_offense_date", "screening_date"]

I tried to sample synthetic data using ctgan.sample(18000) and obtained directly this error:

Traceback (most recent call last):
  File "data_generation.py", line 177, in <module>
    fake = ctgan.sample(18000)
  File "/venv/lib/python3.9/site-packages/ctgan/synthesizers/base.py", line 50, in wrapper
    return function(self, *args, **kwargs)
  File "/venv/lib/python3.9/site-packages/ctgan/synthesizers/ctgan.py", line 498, in sample
    return self._transformer.inverse_transform(data)
  File "/venv/lib/python3.9/site-packages/ctgan/data_transformer.py", line 218, in inverse_transform
    recovered_column_data = self._inverse_transform_continuous(
  File "/venv/lib/python3.9/site-packages/ctgan/data_transformer.py", line 192, in _inverse_transform_continuous
    data = pd.DataFrame(column_data[:, :2], columns=list(gm.get_output_sdtypes()))
  File "/venv/lib/python3.9/site-packages/pandas/core/frame.py", line 758, in __init__
    mgr = ndarray_to_mgr(
  File "/venv/lib/python3.9/site-packages/pandas/core/internals/construction.py", line 337, in ndarray_to_mgr
    _check_values_indices_shape_match(values, index, columns)
  File "/venv/lib/python3.9/site-packages/pandas/core/internals/construction.py", line 408, in _check_values_indices_shape_match
    raise ValueError(f"Shape of passed values is {passed}, indices imply {implied}")
ValueError: Shape of passed values is (18000, 2), indices imply (18000, 3)

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingresolution:WAIThe software is working as intended

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions