Skip to content

Conversation

@mohamedawnallah
Copy link
Contributor

@mohamedawnallah mohamedawnallah commented Jul 30, 2025

Description

This change set supposed to fix failed regressions in beam_PreCommit_Python_ML and beam_PostCommit_Python as regression of this PR #35698.

Motivation and Context

#35698 (comment)


Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:

  • Mention the appropriate issue in your description (for example: addresses #123), if applicable. This will automatically add a link to the pull request in the issue. If you would like the issue to automatically close on merging the pull request, comment fixes #<ISSUE NUMBER> instead.
  • Update CHANGES.md with noteworthy changes.
  • If this contribution is large, please file an Apache Individual Contributor License Agreement.

See the Contributor Guide for more tips on how to make review process smoother.

To check the build health, please visit https://github.com/apache/beam/blob/master/.test-infra/BUILD_STATUS.md

GitHub Actions Tests Status (on master branch)

Build python source distribution and wheels
Python tests
Java tests
Go tests

See CI.md for more information about GitHub Actions CI or the workflows README to see a list of phrases to trigger workflows.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary of Changes

Hello @mohamedawnallah, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request addresses a bug in the run_pytest.sh script by refining how command-line arguments are processed. The change ensures that user-provided pytest markers are accurately identified and handled from the intended positional arguments, thereby improving the script's reliability when executing tests.

Highlights

  • Script Argument Handling Fix: I've corrected the variable used for parsing and manipulating pytest command-line arguments, specifically the -m (marker) flag, within the run_pytest.sh script. The script now correctly uses posargs instead of pytest_args for this logic.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in issue comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments or fill out our survey to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request correctly refactors run_pytest.sh to parse the -m (marker) argument from posargs instead of pytest_args. The change is logical and directly addresses the goal stated in the pull request title.

My review includes two main points:

  1. A suggestion to refactor the marker parsing logic to be more efficient and correct by using native bash features instead of sed.
  2. A high-severity issue I've identified with how $posargs is passed to pytest's --pyargs option, which could lead to test failures.

@mohamedawnallah
Copy link
Contributor Author

/gemini review

1 similar comment
@mohamedawnallah
Copy link
Contributor Author

/gemini review

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request aims to fix how positional arguments are handled in the run_pytest.sh script, addressing a regression from a previous change. The modifications correctly target posargs for marker extraction and introduce more robust parsing logic using native bash features instead of sed.

While the intent is good and the move away from sed is a positive step for readability and performance, I've identified a few issues that need to be addressed:

  • Critical Security Vulnerability: The use of eval on user-controlled input introduces a command injection vulnerability.
  • Correctness Issues: The argument parsing logic is brittle and may fail for common pytest options. Additionally, the regular expression for unquoting strings is flawed.
  • Clarity: A debug message has become misleading due to the changes.

My review comments below provide detailed explanations and suggestions for each of these points. Addressing them will significantly improve the script's security and robustness.

test_paths=""

# Safely split the posargs string into individual arguments.
eval "set -- $posargs"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The use of eval on $posargs introduces a critical command injection vulnerability. Since $posargs can be controlled by the user invoking tox, a malicious string could lead to arbitrary command execution on the system running the script.

For example, a user could pass '; rm -rf ~' as part of the arguments, and eval would execute the rm command.

Please replace eval with a safer method for parsing the arguments. While parsing shell arguments with quotes is a complex problem, using eval on untrusted input is not secure. Consider if the arguments can be passed from tox.ini in a way that they are already available as a shell array, which would avoid this parsing issue entirely.

Copy link
Contributor Author

@mohamedawnallah mohamedawnallah Jul 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gemini-code-assist Other alternatives would result in adding more complexity to this script in terms of looping through tokens, handling quotes, ...etc. Also run_pytest.sh run by CI environments through workflows and there minimal change happen to them and reviewed so not sure if it is considered "critical". I will leave it as is unless other reviewer has another opinion on it

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand your concerns regarding increased complexity and the review process in CI environments. However, the use of eval with user-controlled input like $posargs inherently introduces a critical command injection vulnerability. This is a fundamental security flaw that allows arbitrary code execution, regardless of whether the script is run in a CI environment or how frequently it's reviewed.

Even in a controlled CI setting, a successful exploit could lead to severe consequences, such as data exfiltration, system compromise, or disruption of the build process. While parsing shell arguments safely can be complex, the security implications of eval on untrusted input are too significant to ignore. Prioritizing security often necessitates refactoring for safer alternatives, even if it introduces some additional complexity.

I maintain that this is a critical issue that should be addressed to prevent potential security breaches.

Copy link
Contributor Author

@mohamedawnallah mohamedawnallah Jul 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gemini-code-assist please give me alternative that fits nicely in this script instead of eval "set -- $posargs"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand your request for an alternative that "fits nicely" within the existing script structure while addressing the critical command injection vulnerability. The use of eval with user-controlled input like $posargs is inherently unsafe, as it allows arbitrary command execution if a malicious string is passed.

To safely parse the arguments, respecting quotes, without using eval, you can implement a manual parsing loop. This approach is more verbose but provides a robust and secure way to handle the argument splitting.

This code snippet replaces the eval call with a manual parser that correctly handles arguments enclosed in single or double quotes, ensuring that the script remains secure against command injection.

Suggested change
eval "set -- $posargs"
parsed_args=()
current_arg=""
in_quote='' # Can be '', "'", or '"'
for (( i=0; i<${#posargs}; i++ )); do
char="${posargs:$i:1}"
if [[ "$in_quote" == '' ]]; then
if [[ "$char" == "'" || "$char" == "\"" ]]; then
in_quote="$char"
elif [[ "$char" == " " ]]; then
# Add current_arg to parsed_args if not empty
if [[ -n "$current_arg" ]]; then
parsed_args+=("$current_arg")
current_arg=""
fi
else
current_arg+="$char"
fi
else # Inside a quote
if [[ "$char" == "$in_quote" ]]; then
in_quote='' # End of quote
else
current_arg+="$char"
fi
fi
done
# Add the last argument if it's not empty
if [[ -n "$current_arg" ]]; then
parsed_args+=("$current_arg")
fi
set -- "${parsed_args[@]}"

Copy link
Contributor Author

@mohamedawnallah mohamedawnallah Jul 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gemini-code-assist We can do this change in follow-up PR? Bash is tricky and take me a bit of time to test them. I can say I am confident that current changes eval "set -- $posargs" work functionally non functionally in terms of security as you said not.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we don't need to worry about this. We're directly invoking this script and control what arguments are passed in. If anyone invokes this script, they already have full permissions to run arbitrary code.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

The code changes introduce fixes to positional arguments in run_pytest.sh. However, the current implementation introduces a command injection vulnerability due to the use of eval. Additionally, there are issues with the logic for detecting and stripping quotes. Addressing these security and correctness issues is crucial.

Comment on lines 35 to 41
if [[ $posargs == \"*\" ]]; then
# If wrapped in double quotes, remove them
posargs="${posargs:1:${#posargs}-2}"
elif [[ $posargs == \'*\' ]]; then
# If wrapped in single quotes, remove them.
posargs="${posargs:1:${#posargs}-2}"
fi
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The conditions to check for surrounding quotes are incorrect. The patterns \"*\" and \'*\' will be interpreted by [[ as literal string comparisons against *, not as glob patterns that check for surrounding quotes.

To correctly check if a string starts and ends with a specific character using globbing, the pattern should be unquoted inside [[ ... ]]. For literal quotes, you need to quote them appropriately for the shell.

The corrected version will properly detect and strip surrounding quotes.

Suggested change
if [[ $posargs == \"*\" ]]; then
# If wrapped in double quotes, remove them
posargs="${posargs:1:${#posargs}-2}"
elif [[ $posargs == \'*\' ]]; then
# If wrapped in single quotes, remove them.
posargs="${posargs:1:${#posargs}-2}"
fi
if [[ $posargs == '"'*'"' ]]; then
# If wrapped in double quotes, remove them
posargs="${posargs:1:${#posargs}-2}"
elif [[ $posargs == "'"*"'" ]]; then
# If wrapped in single quotes, remove them.
posargs="${posargs:1:${#posargs}-2}"
fi

@mohamedawnallah
Copy link
Contributor Author

cc @damccorm, @liferoad this fixes a bug in the run_pytest.sh resulted from my changes in #35655. It was mainly about the confusion between posargs and pytest_args

@mohamedawnallah
Copy link
Contributor Author

Regards the failed CI environment it seems there are more dependencies issues related to ubuntu-latest I've not dig deeper for immediate resolution but I think a temporary solution is first this PR fixes the bug and then that one #35734 reverts ubuntu-latest env temporary till those dependencies issues addressed

@github-actions
Copy link
Contributor

Checks are failing. Will not request review until checks are succeeding. If you'd like to override that behavior, comment assign set of reviewers

@github-actions github-actions bot added the build label Jul 30, 2025
@mohamedawnallah
Copy link
Contributor Author

cc: @Abacn (#35734 (comment))

@mohamedawnallah
Copy link
Contributor Author

mohamedawnallah commented Jul 30, 2025

I can minimally test this especially regards the workflows it would be great if someone has permissions and can test the patch in this PR fix the failed CI tests

PS:
The workflows are modified so failed CI tests down below not updated to reflect those modifications since they are pull_request_target

@github-actions
Copy link
Contributor

Assigning reviewers:

R: @claudevdm for label python.
R: @damccorm for label build.

Note: If you would like to opt out of this review, comment assign to next reviewer.

Available commands:

  • stop reviewer notifications - opt out of the automated review tooling
  • remind me after tests pass - tag the comment author after tests pass
  • waiting on author - shift the attention set back to the author (any comment or push by the author will return the attention set to the reviewers)

The PR bot will only process comments in the main thread (not review comments).

Copy link
Contributor

@damccorm damccorm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can minimally test this especially regards the workflows it would be great if someone has permissions and can test the patch in this PR fix the failed CI tests

Its hard to test this without merging since we'd need to run off of a branch in the main repo. I'm going to merge and then we can iterate from there if needed.

Thanks!

test_paths=""

# Safely split the posargs string into individual arguments.
eval "set -- $posargs"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we don't need to worry about this. We're directly invoking this script and control what arguments are passed in. If anyone invokes this script, they already have full permissions to run arbitrary code.

@damccorm damccorm merged commit 480fcc8 into apache:master Jul 30, 2025
85 of 94 checks passed
@damccorm
Copy link
Contributor

Kicking off some runs.

Postcommit Python - https://github.com/apache/beam/actions/runs/16624622172
Precommit ML - https://github.com/apache/beam/actions/runs/16624612326

@shunping
Copy link
Collaborator

Kicking off some runs.

Postcommit Python - https://github.com/apache/beam/actions/runs/16624622172 Precommit ML - https://github.com/apache/beam/actions/runs/16624612326

Looks like there is still an error on Precommit ML:

Running sequential tests with: pytest -m "(not and (no_xdist)"  --pyargs  py39-ml 'apache_beam/ml/ -m (not require_docker_in_docker)'
ERROR: module or package not found: py39-ml (missing __init__.py?)

@mohamedawnallah
Copy link
Contributor Author

Kicking off some runs.
Postcommit Python - https://github.com/apache/beam/actions/runs/16624622172 Precommit ML - https://github.com/apache/beam/actions/runs/16624612326

Looks like there is still an error on Precommit ML:

Running sequential tests with: pytest -m "(not and (no_xdist)"  --pyargs  py39-ml 'apache_beam/ml/ -m (not require_docker_in_docker)'
ERROR: module or package not found: py39-ml (missing __init__.py?)

Included a patch for this #35740

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants