Change default `parquet_row_group_size` in `BaseSQLToGCSOperator` #36817

renzepost · 2024-01-16T14:44:01Z

As mentioned in #36793, a default setting of 1 for parquet_row_group_size leads to quite a few problems. For example the output Parquet files become huge, the tasks run into OOM issues and the task duration is extended.

I must note that the documentation says that a large number means the worker needs more memory to execute, while on the contrary I noticed that with 1 row per row group the tasks get killed with OOM issues very quickly and with a large number everything seems fine...

I changed it to 100000 based on what other Parquet writers are doing (DuckDB and Polars), but of course this number is open for debate. @Taragolis suggested a much lower value between 100 and 1000.

^ Add meaningful description above
Read the Pull Request Guidelines for more information.
In case of fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in a newsfragment file, named {pr_number}.significant.rst or {issue_number}.significant.rst, in newsfragments.

Taragolis · 2024-01-16T18:56:57Z

@Taragolis suggested a much lower value between 100 and 1000.

This was a suggestion from the pessimist inside of me. 🤣

potiuk · 2024-01-16T22:34:42Z

It's borderline breaking change, but I'd hate to bump MAJOR version of google provider because of it - I think however it would be enough if there is sa STRONG mention in the Changelong for the provider - can you add one please ?

potiuk

Hmm. I was thinking about a description that will explain consequences to the users. The *STRONG thing I was thining about was not merely linking to the PR (this happens automatically) but adding some explanation to the users - what it means to them, what kid of behaviour change they can expect (memory, performance etc.). and how they can remediate it (including explanation on what they should do to come back to the previous behaviour). yes it can all be seen from the discussion in PR, but we want to explain it in the changelog to the user so that they do not have to go to the detailed PR.

This is what I mean by "STRONG"

renzepost · 2024-01-17T16:33:43Z

Ah, got it! I've added a more verbose description in the changelog. Let me know if I missed anything or need to change the wording.

potiuk · 2024-01-17T17:01:04Z

Nice

airflow/providers/google/CHANGELOG.rst

Co-authored-by: Andrey Anshin <Andrey.Anshin@taragol.is>

airflow/providers/google/CHANGELOG.rst

Co-authored-by: Elad Kalif <45845474+eladkal@users.noreply.github.com>

Change default parquet_row_group_size in BaseSQLToGCSOperator

58929da

boring-cyborg bot added area:providers provider:google Google (including GCP) related issues labels Jan 16, 2024

Add change to changelog

677c252

potiuk approved these changes Jan 17, 2024

View reviewed changes

potiuk requested changes Jan 17, 2024

View reviewed changes

renzepost added 2 commits January 17, 2024 17:27

Added a better change description

0f36540

Remove unnecessary extra newline

b526985

potiuk approved these changes Jan 17, 2024

View reviewed changes

Taragolis reviewed Jan 17, 2024

View reviewed changes

airflow/providers/google/CHANGELOG.rst Outdated Show resolved Hide resolved

Applied suggested changes

a4d278e

Co-authored-by: Andrey Anshin <Andrey.Anshin@taragol.is>

eladkal reviewed Jan 18, 2024

View reviewed changes

airflow/providers/google/CHANGELOG.rst Outdated Show resolved Hide resolved

eladkal approved these changes Jan 18, 2024

View reviewed changes

Applied suggested changes

5a5d7b1

Co-authored-by: Elad Kalif <45845474+eladkal@users.noreply.github.com>

eladkal approved these changes Jan 18, 2024

View reviewed changes

eladkal merged commit 681859c into apache:main Jan 18, 2024

eladkal mentioned this pull request Jan 22, 2024

Status of testing Providers that were prepared on January 26, 2024 #36948

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Change default `parquet_row_group_size` in `BaseSQLToGCSOperator` #36817

Change default `parquet_row_group_size` in `BaseSQLToGCSOperator` #36817

Uh oh!

renzepost commented Jan 16, 2024

Uh oh!

Taragolis commented Jan 16, 2024

Uh oh!

potiuk commented Jan 16, 2024

Uh oh!

potiuk left a comment

Uh oh!

renzepost commented Jan 17, 2024

Uh oh!

potiuk commented Jan 17, 2024

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Change default parquet_row_group_size in BaseSQLToGCSOperator #36817

Change default parquet_row_group_size in BaseSQLToGCSOperator #36817

Uh oh!

Conversation

renzepost commented Jan 16, 2024

Uh oh!

Taragolis commented Jan 16, 2024

Uh oh!

potiuk commented Jan 16, 2024

Uh oh!

potiuk left a comment

Choose a reason for hiding this comment

Uh oh!

renzepost commented Jan 17, 2024

Uh oh!

potiuk commented Jan 17, 2024

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Change default `parquet_row_group_size` in `BaseSQLToGCSOperator` #36817

Change default `parquet_row_group_size` in `BaseSQLToGCSOperator` #36817