Skip to content

Conversation

@renzepost
Copy link
Contributor

closes: #36793

As mentioned in #36793, a default setting of 1 for parquet_row_group_size leads to quite a few problems. For example the output Parquet files become huge, the tasks run into OOM issues and the task duration is extended.

I must note that the documentation says that a large number means the worker needs more memory to execute, while on the contrary I noticed that with 1 row per row group the tasks get killed with OOM issues very quickly and with a large number everything seems fine...

I changed it to 100000 based on what other Parquet writers are doing (DuckDB and Polars), but of course this number is open for debate. @Taragolis suggested a much lower value between 100 and 1000.


^ Add meaningful description above
Read the Pull Request Guidelines for more information.
In case of fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in a newsfragment file, named {pr_number}.significant.rst or {issue_number}.significant.rst, in newsfragments.

@boring-cyborg boring-cyborg bot added area:providers provider:google Google (including GCP) related issues labels Jan 16, 2024
@Taragolis
Copy link
Contributor

@Taragolis suggested a much lower value between 100 and 1000.

This was a suggestion from the pessimist inside of me. 🤣

@potiuk
Copy link
Member

potiuk commented Jan 16, 2024

It's borderline breaking change, but I'd hate to bump MAJOR version of google provider because of it - I think however it would be enough if there is sa STRONG mention in the Changelong for the provider - can you add one please ?

Copy link
Member

@potiuk potiuk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm. I was thinking about a description that will explain consequences to the users. The *STRONG thing I was thining about was not merely linking to the PR (this happens automatically) but adding some explanation to the users - what it means to them, what kid of behaviour change they can expect (memory, performance etc.). and how they can remediate it (including explanation on what they should do to come back to the previous behaviour). yes it can all be seen from the discussion in PR, but we want to explain it in the changelog to the user so that they do not have to go to the detailed PR.

This is what I mean by "STRONG"

@renzepost
Copy link
Contributor Author

Ah, got it! I've added a more verbose description in the changelog. Let me know if I missed anything or need to change the wording.

@potiuk
Copy link
Member

potiuk commented Jan 17, 2024

Nice

Co-authored-by: Andrey Anshin <Andrey.Anshin@taragol.is>
Co-authored-by: Elad Kalif <45845474+eladkal@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:providers provider:google Google (including GCP) related issues

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Change SQL to GCS operators default row group size when output is Parquet

4 participants