-
Notifications
You must be signed in to change notification settings - Fork 126
Closed
Labels
accepting pull requestsapi: bigqueryIssues related to the googleapis/python-bigquery-pandas API.Issues related to the googleapis/python-bigquery-pandas API.type: feature request‘Nice-to-have’ improvement, new feature or different behavior or design.‘Nice-to-have’ improvement, new feature or different behavior or design.
Description
We use pandas-gbq a lot for our daily analyses. It is known that memory consumption can be a pain, see e.g. https://www.dataquest.io/blog/pandas-big-data/
I have started to write a patch, which could be integrated into an enhancement for read_gbq (rough idea, details TBD):
- Provide boolean
optimize_memoryoption - If
True, the source table is inspected with a query to get min, max, presence of nulls and % of unique number of strings for INTEGER and STRING columns, respectively - When calling
to_dataframethis information is passed to thedtypesoption, downcasting integers to the appropriate numpy (u)int type, and converting strings to pandascategorytype at some threshold (less than 50% of unique values)
I already have a working monkey-patch, which is still a bit rough. If there is enough interest I'd happily make it more robust and submit a PR. Would be my first significant contribution to an open source project, so some help and feedback would be appreciated.
Curious to hear your views on this.
owengo
Metadata
Metadata
Assignees
Labels
accepting pull requestsapi: bigqueryIssues related to the googleapis/python-bigquery-pandas API.Issues related to the googleapis/python-bigquery-pandas API.type: feature request‘Nice-to-have’ improvement, new feature or different behavior or design.‘Nice-to-have’ improvement, new feature or different behavior or design.