Skip to content

load_table_from_dataframe produces incorrect results when used in list of dict  #781

@Lmmejia11

Description

@Lmmejia11

Environment details

  • OS type and version: Debian 10 (dataproc image 2.0-debian10)
  • Python version: python --version: Python 3.8.10
  • pip version: pip --version: pip 21.1.2
  • google-cloud-bigquery version: pip show google-cloud-bigquery: google-cloud-bigquery==2.6.2, pyarrow==2.0.0

Steps to reproduce

  1. Create a big dataframe (1000 lines) with a column containing a list (at least length 6) of identically structured dictionaries
  2. Create a bq client and use load_table_from_dataframe to create a table in bigquery
  3. Check the resulting table in bigquery. Structures seem to switch values with other instances in the list. (eg should have [STRUCT('w0' AS name, 0.1 AS value),STRUCT('h1' AS name, 1.2 AS value)] but have [STRUCT('h1' AS name, 0.1 AS value),STRUCT('w0' AS name, 1.2 AS value)]. The big problem is not the order, is that the integrity of information of each structure is not kept (eg. 'w0' should be 0.1, not 1.2).

Code example

# create df with a list of dictionaries
# In this example, the dict structure is {"name": str, "value":float}. name is a letter + int, and value are increasingly big floats
data = [[[{'name':'whyist'[i]+str(i), 'value':np.random.random()*10**i} for i in range(6)]] for n in range(1000)]
df = pd.DataFrame(data, columns=['vals'])

# load
project = 'myproject'
bq_client = bigquery.Client(project=project)
job_config = bigquery.LoadJobConfig()
job_config.write_disposition = 'WRITE_TRUNCATE'
bq_client.load_table_from_dataframe(
    dataframe=df,
    destination='tmp.test_bug',
    job_config = job_config
)

# Checking in bigquery, 

At least for this example, the 'value' attribute is transcribed in the correct order (first item has the smallest value, and it increases). The 'name' value was sampled with possibility of repetition. All table lines have the same 'name' values in the same order, and it can change if the code is reexecuted.

Metadata

Metadata

Assignees

Labels

api: bigqueryIssues related to the googleapis/python-bigquery API.priority: p1Important issue which blocks shipping the next release. Will be fixed prior to next release.type: bugError or flaw in code with unintended results or allowing sub-optimal usage patterns.

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions