-
Notifications
You must be signed in to change notification settings - Fork 14
Closed
Labels
bugSomething isn't workingSomething isn't working
Description
It looks like the most recent update on pandas-gbq might have broken our tests. When writing to bigquery this
pd.DataFrame.to_gbq(
df,
destination_table=f"{dataset_id}.{table_id}",
project_id=project_id,
chunksize=5,
if_exists="append",
)with pandas-gbq=0.15 and reading it back with dask_bigquery.read_gbqreturns 2 dask partitions, while if the writing is done withpandas-gbq=0.16when reading back withdask_bigquery.read_gbq` returns only 1 dask partitions.
From the discussion on #11 we know that
pandas-gbq 0.16 changed the default intermediate data serialization format to parquet instead of CSV.
Likely this means the backend loader required fewer workers and wrote it to fewer files behind the scenes
- Short term solution: pin
pandas-gbq <= 0.15or avoid asserting forddf.npartitions - Long term solution: Avoid using
pandas-gbqand usebigquery.Client.load_table_from_dataframeor something like this https://cloud.google.com/bigquery/docs/loading-data-cloud-storage-csv#loading_csv_data_into_a_table_that_uses_column-based_time_partitioning
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working