-
Notifications
You must be signed in to change notification settings - Fork 16.4k
Change default parquet_row_group_size in BaseSQLToGCSOperator
#36817
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Change default parquet_row_group_size in BaseSQLToGCSOperator
#36817
Conversation
This was a suggestion from the pessimist inside of me. 🤣 |
|
It's borderline breaking change, but I'd hate to bump MAJOR version of google provider because of it - I think however it would be enough if there is sa STRONG mention in the Changelong for the provider - can you add one please ? |
potiuk
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm. I was thinking about a description that will explain consequences to the users. The *STRONG thing I was thining about was not merely linking to the PR (this happens automatically) but adding some explanation to the users - what it means to them, what kid of behaviour change they can expect (memory, performance etc.). and how they can remediate it (including explanation on what they should do to come back to the previous behaviour). yes it can all be seen from the discussion in PR, but we want to explain it in the changelog to the user so that they do not have to go to the detailed PR.
This is what I mean by "STRONG"
|
Ah, got it! I've added a more verbose description in the changelog. Let me know if I missed anything or need to change the wording. |
|
Nice |
Co-authored-by: Andrey Anshin <Andrey.Anshin@taragol.is>
Co-authored-by: Elad Kalif <45845474+eladkal@users.noreply.github.com>
closes: #36793
As mentioned in #36793, a default setting of 1 for
parquet_row_group_sizeleads to quite a few problems. For example the output Parquet files become huge, the tasks run into OOM issues and the task duration is extended.I must note that the documentation says that a large number means the worker needs more memory to execute, while on the contrary I noticed that with 1 row per row group the tasks get killed with OOM issues very quickly and with a large number everything seems fine...
I changed it to 100000 based on what other Parquet writers are doing (DuckDB and Polars), but of course this number is open for debate. @Taragolis suggested a much lower value between 100 and 1000.
^ Add meaningful description above
Read the Pull Request Guidelines for more information.
In case of fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in a newsfragment file, named
{pr_number}.significant.rstor{issue_number}.significant.rst, in newsfragments.