Skip to content

Conversation

@mzhang-code
Copy link

@mzhang-code mzhang-code commented Apr 18, 2020

Use SPARK_PYTHON or 'python' to run find_spark_home.py

What changes were proposed in this pull request?

Use the SPARK_PYTHON or python to run find_spark_home.py instead of PYSPARK_DRIVER_PYTHON because PYSPARK_DRIVER_PYTHON can be ipython and ipython adds an invisible header to spark home path.

Why are the changes needed?

I'm trying launching pyspark shell with IPython interface via

PYSPARK_DRIVER_PYTHON=ipython pyspark

However it hits error .../pyspark/bin/load-spark-env.sh: No such file or directory

$ PYSPARK_DRIVER_PYTHON=ipython pyspark
/Users/mengyu/opt/anaconda2/envs/py3-spark/bin/pyspark: line 24: /Users/mengyu/opt/anaconda2/envs/py3-spark/lib/python3.7/site-packages/pyspark/bin/load-spark-env.sh: No such file or directory
/Users/mengyu/opt/anaconda2/envs/py3-spark/bin/pyspark: line 77: /Users/mengyu/workspace/tmp//Users/mengyu/opt/anaconda2/envs/py3-spark/lib/python3.7/site-packages/pyspark/bin/spark-submit: No such file or directory
/Users/mengyu/opt/anaconda2/envs/py3-spark/bin/pyspark: line 77: exec: /Users/mengyu/workspace/tmp//Users/mengyu/opt/anaconda2/envs/py3-spark/lib/python3.7/site-packages/pyspark/bin/spark-submit: cannot execute: No such file or directory

It is strange because the path /Users/mengyu/opt/anaconda2/envs/py3-spark/lib/python3.7/site-packages/pyspark/bin/load-spark-env.sh exists.

Then I found it is because ipython interpreter adds an invisible header to stdout output.

$ ipy_output=$(ipython -c "print('/Users')")
$ echo $ipy_output 
/Users

$ ls $ipy_output
ls: \033[22;0t\033]0;IPython:: No such file or directory
ls: libs/spark\a/Users: No such file or directory

$ echo $ipy_output | cat -v
^[[22;0t^[]0;IPython: libs/spark^G/Users

Compare the output from python:

$ py_output=$(python -c "print('/Users')")
$ echo $py_output | cat -v
/Users

A workaround is to use ipython ----no-term-title option. But I think not using ipython to run find_spark_home is better because ipython is more of a frontend. Besides, with this fix, we can open SparkSession-enabled jupyter notebook session via

PYSPARK_DRIVER_PYTHON=jupyter PYSPARK_DRIVER_PYTHON_OPTS=notebook  pyspark

Does this PR introduce any user-facing change?

How was this patch tested?

Tested by manual run of pyspark and PYSPARK_DRIVER_PYTHON=ipython pyspark

Use SPARK_PYTHON or 'python' to run find_spark_home.py
@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@mzhang-code mzhang-code changed the title [SPARK-31483][PySpark] Update find-spark-home [SPARK-31483][PySpark] Use SPARK_PYTHON or 'python' to run find_spark_home.py Apr 18, 2020
PYSPARK_DRIVER_PYTHON="${PYSPARK_PYTHON:-"python"}"
fi
export SPARK_HOME=$($PYSPARK_DRIVER_PYTHON "$FIND_SPARK_HOME_PYTHON_SCRIPT")
export SPARK_HOME=$(${PYSPARK_PYTHON:-"python"} "$FIND_SPARK_HOME_PYTHON_SCRIPT")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm .. can we strip the non-printable characters instead?
Respecting PYSPARK_DRIVER_PYTHON falling back to PYSPARK_PYTHON is expected.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At least I can come up with one way although it's hacky. e.g.)

a=$(ipython -c "import sys; print('/User', file=sys.stderr)" 2>&1 >/dev/null)
ls $a

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a workaround for ipython, but not for jupyter, because jupyter doesn't support jupyter find_spark_home.py. I think PYSPARK_DRIVER_PYTHON is more meant for "frontend". This fix enables PYSPARK_DRIVER_PYTHON=jupyter PYSPARK_DRIVER_PYTHON_OPTS=notebook  pyspark too.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Btw, PYSPARK_DRIVER_PYTHON's falling back to PYSPARK_PYTHON happens after find-spark-home too:

PYSPARK_DRIVER_PYTHON=$PYSPARK_PYTHON

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mzhang-code, can we just add a bandaid fix like: if PYSPARK_DRIVER_PYTHON ends with jupyter or ipython, uses PYSPARK_PYTHON or python for now with some comments about why we're using PYSPARK_PYTHON instead of PYSPARK_DRIVER_PYTHON?

@github-actions
Copy link

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants