Skip to content

Assigning values breaks in different ways when duplicate column names #24798

@user347

Description

@user347

Code Sample, a copy-pastable example if possible

Our primary data with two columns with identical name:

df = pd.DataFrame(np.arange(12).reshape(4, 3).T)
df.columns = list('AABC')
print(df)
"""
   A  A  B   C
0  0  3  6   9
1  1  4  7  10
2  2  5  8  11
"""

Issue 1a: Series.replace throws ValueError when assigning:

print(df['B'].replace(6, np.nan))  # will work as expected with int as well
"""
0    NaN
1    7.0
2    8.0
Name: B, dtype: float64
"""
# ValueError: Buffer has wrong number of dimensions (expected 1, got 0):
df['B'] = df['B'].replace(6, np.nan)  # inplace=True does not raise error, but no change
df['B'] = df['B'].replace(6, 5)

Issue 1b: Same ValueError as above thrown when assigning np.nan with loc:

# ValueError: Buffer has wrong number of dimensions (expected 1, got 0):
df.loc[df['B'] == 6, 'B'] = np.nan

# Assigning int with loc will however work: 
df.loc[df['B'] == 6, 'B'] = 5

Issue 2a: assigning np.nan with iloc on column with a duplicate will apply on both columns:

# Assigning np.nan with iloc on column with a duplicate will apply on both columns:
df.iloc[0, 0] = np.nan
print(df)
"""
     A    A  B   C
0  NaN  NaN  5   9
1  1.0  4.0  7  10
2  2.0  5.0  8  11
"""

Issue 2b: assigning int with iloc will work int v0.22.0 but not v0.23.4

df.iloc[0, 0] = 10
print(df)
"""
0.22.0:
    A  A  B   C
0  10  3  5   9
1   1  4  7  10
2   2  5  8  11

0.23.4:
      A     A  B   C
0  10.0  10.0  5   9
1   1.0   4.0  7  10
2   2.0   5.0  8  11
"""
# Assigning with iloc will not break if BOTH columns contain a nan:
x = pd.DataFrame({'a': np.array([np.nan, 1, 2])})
y = pd.DataFrame({'a': np.array([0, np.nan, 2])})

df = pd.concat([x, y], axis=1)

df.iloc[0, 0] = 10
print(df)
"""
      a    a 
0  10.0  0.0
1   1.0  NaN
2   2.0  2.0
"""

Problem description

The main topic for this issue is assigning different values to a DataFrame that contains duplicate column names. List of issues reported:

he issue with iloc and np.nan (above called Issue 2a) was reported and closed as fixed here: #13423 per 0.18.0 but I'm able to recreate that issue with v0.23.4.

Expected Output

Either the same output as we would expect if we had only unique names in our columns or a DuplicateColumnWarning/DuplicateColumnException when DataFrame contains duplicate columns.

Output of pd.show_versions()

Details pandas: 0.23.4 pytest: None pip: 18.0 setuptools: 39.0.1 Cython: None numpy: 1.16.0 scipy: 1.1.0T pyarrow: None xarray: None IPython: 6.3.1 sphinx: None patsy: None dateutil: 2.7.2 pytz: 2018.3 blosc: None bottleneck: None tables: None numexpr: None feather: None matplotlib: 2.1.0 openpyxl: 2.5.1 xlrd: 1.1.0 xlwt: None xlsxwriter: None lxml: 4.2.1 bs4: 4.6.0 html5lib: 1.0.1 sqlalchemy: 1.2.5 pymysql: None psycopg2: None jinja2: 2.10 s3fs: None fastparquet: None pandas_gbq: None pandas_datareader: None

Metadata

Metadata

Assignees

No one assigned

    Labels

    Needs TestsUnit test(s) needed to prevent regressionsReshapingConcat, Merge/Join, Stack/Unstack, Explodegood first issue

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions