Skip to content

Conversation

@liferoad
Copy link
Contributor

@liferoad liferoad commented Aug 7, 2025

Add _get_temp_dataset_project helper method to determine the correct project ID for temporary dataset operations. Update _setup_temporary_dataset and finish_bundle to use this method instead of _get_project when cleaning up or creating temporary datasets. This ensures the correct project is used when temp_dataset is a DatasetReference.

Add comprehensive unit tests to verify the behavior with different temp_dataset configurations.

Fixes #35813


Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:

  • Mention the appropriate issue in your description (for example: addresses #123), if applicable. This will automatically add a link to the pull request in the issue. If you would like the issue to automatically close on merging the pull request, comment fixes #<ISSUE NUMBER> instead.
  • Update CHANGES.md with noteworthy changes.
  • If this contribution is large, please file an Apache Individual Contributor License Agreement.

See the Contributor Guide for more tips on how to make review process smoother.

To check the build health, please visit https://github.com/apache/beam/blob/master/.test-infra/BUILD_STATUS.md

GitHub Actions Tests Status (on master branch)

Build python source distribution and wheels
Python tests
Java tests
Go tests

See CI.md for more information about GitHub Actions CI or the workflows README to see a list of phrases to trigger workflows.

Add _get_temp_dataset_project helper method to determine the correct project ID for temporary dataset operations. Update _setup_temporary_dataset and finish_bundle to use this method instead of _get_project when cleaning up or creating temporary datasets. This ensures the correct project is used when temp_dataset is a DatasetReference.

Add comprehensive unit tests to verify the behavior with different temp_dataset configurations.
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary of Changes

Hello @liferoad, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

I've addressed a bug in BigQuery temporary dataset operations where the incorrect Google Cloud project might be used, particularly when a DatasetReference is provided. This change ensures that temporary datasets are created and cleaned up within the correct project, preventing potential issues and aligning with user expectations.

Highlights

  • New Helper Method for Project ID: I've introduced _get_temp_dataset_project in bigquery_read_internal.py. This new method intelligently determines the correct Google Cloud project ID for temporary BigQuery dataset operations, prioritizing the project specified in a DatasetReference if provided, otherwise falling back to the pipeline's default project.
  • Correct Project Usage in Dataset Operations: The _setup_temporary_dataset and finish_bundle methods have been updated to utilize the new _get_temp_dataset_project helper. This ensures that both the creation and cleanup of temporary BigQuery datasets consistently use the appropriate project, resolving the bug where an incorrect project might have been used.
  • Comprehensive Unit Test Coverage: I've added a new dedicated unit test file, bigquery_read_internal_test.py, to thoroughly validate the behavior of the _get_temp_dataset_project method and its integration. These tests cover various scenarios, including different temp_dataset configurations (string, DatasetReference, None, and ValueProvider), ensuring robustness and correctness.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in issue comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments or fill out our survey to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Skip tests when BigQuery dependencies are not installed to prevent test failures
@liferoad
Copy link
Contributor Author

liferoad commented Aug 8, 2025

R: @stankiewicz

@github-actions
Copy link
Contributor

github-actions bot commented Aug 8, 2025

Stopping reviewer notifications for this pull request: review requested by someone other than the bot, ceding control. If you'd like to restart, comment assign set of reviewers

Copy link
Contributor

@stankiewicz stankiewicz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changes to bigquery_tools.py are also needed as I mentioned in the bug and comment.

Add _get_temp_table_project method to handle project ID resolution for temporary tables
Add corresponding tests to verify fallback behavior
@liferoad liferoad requested a review from stankiewicz August 8, 2025 12:39
@liferoad
Copy link
Contributor Author

liferoad commented Aug 8, 2025

changes to bigquery_tools.py are also needed as I mentioned in the bug and comment.

Thanks.

Copy link
Contributor

@stankiewicz stankiewicz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good! I think it's worth adding note to changes as for some customers it will start creating datasets in the proper project.

@liferoad
Copy link
Contributor Author

liferoad commented Aug 8, 2025

looks good! I think it's worth adding note to changes as for some customers it will start creating datasets in the proper project.

Let me do this with another PR later.

@liferoad liferoad merged commit 111c83f into apache:master Aug 8, 2025
85 of 87 checks passed
liferoad added a commit to liferoad/beam that referenced this pull request Aug 8, 2025
…e#35817)

* fix(bigquery): use correct project for temp dataset operations

Add _get_temp_dataset_project helper method to determine the correct project ID for temporary dataset operations. Update _setup_temporary_dataset and finish_bundle to use this method instead of _get_project when cleaning up or creating temporary datasets. This ensures the correct project is used when temp_dataset is a DatasetReference.

Add comprehensive unit tests to verify the behavior with different temp_dataset configurations.

* test(bigquery): handle missing bigquery dependencies in tests

Skip tests when BigQuery dependencies are not installed to prevent test failures

* fix lint

* feat(bigquery): add temp table project resolution helper

Add _get_temp_table_project method to handle project ID resolution for temporary tables
Add corresponding tests to verify fallback behavior
parveensania pushed a commit to parveensania/beam-dp that referenced this pull request Aug 17, 2025
…e#35817)

* fix(bigquery): use correct project for temp dataset operations

Add _get_temp_dataset_project helper method to determine the correct project ID for temporary dataset operations. Update _setup_temporary_dataset and finish_bundle to use this method instead of _get_project when cleaning up or creating temporary datasets. This ensures the correct project is used when temp_dataset is a DatasetReference.

Add comprehensive unit tests to verify the behavior with different temp_dataset configurations.

* test(bigquery): handle missing bigquery dependencies in tests

Skip tests when BigQuery dependencies are not installed to prevent test failures

* fix lint

* feat(bigquery): add temp table project resolution helper

Add _get_temp_table_project method to handle project ID resolution for temporary tables
Add corresponding tests to verify fallback behavior
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: BigQueryIO creates temp_dataset in wrong project

2 participants