Skip to content

Conversation

@Joffreybvn
Copy link
Contributor

@Joffreybvn Joffreybvn commented Jul 1, 2023

Hello,

Using the ImpalaHook with kerberos ("auth_mechanism": "GSSAPI") fails with the following error:

[2023-06-26, 11:00:42 UTC] {base.py:73} INFO - Using connection ID 'impala_conn' for task execution.
[2023-06-26, 11:00:42 UTC] {warnings.py:109} WARNING - /opt/app-root/lib64/python3.9/site-packages/puresasl/client.py:215: SASLWarning: kerberos module not installed, GSSAPI will be ignored
  warn('kerberos module not installed, {0} will be ignored'.format(
[2023-06-26, 11:00:42 UTC] {taskinstance.py:1824} ERROR - Task failed with exception
Traceback (most recent call last):
  File "/opt/app-root/lib64/python3.9/site-packages/airflow/operators/python.py", line 181, in execute
    return_value = self.execute_callable()
  File "/opt/app-root/lib64/python3.9/site-packages/airflow/operators/python.py", line 198, in execute_callable
    return self.python_callable(*self.op_args, **self.op_kwargs)
  File "/usr/local/airflow/dags/impala_hook_connection_test.py", line 19, in sample_select
    connection = impala_hook.get_conn()
  File "/opt/app-root/lib64/python3.9/site-packages/airflow/providers/apache/impala/hooks/impala.py", line 35, in get_conn
    return connect(
  File "/opt/app-root/lib64/python3.9/site-packages/impala/dbapi.py", line 194, in connect
    service = hs2.connect(host=host, port=port,
  File "/opt/app-root/lib64/python3.9/site-packages/impala/hiveserver2.py", line 865, in connect
    transport.open()
  File "/opt/app-root/lib64/python3.9/site-packages/thrift_sasl/__init__.py", line 84, in open
    raise TTransportException(type=TTransportException.NOT_OPEN,
thrift.transport.TTransport.TTransportException: Could not start SASL: None of the mechanisms listed meet all required properties

Solution

The kerberos module, an optional dependency of impyla, is not bundled with Airflow. (Only requests-kerberos for hdfs and pykerberos come default with Airflow, if I'm not mistaken)

This PR add kerberos as dependency. Fixing the above error.

About license

The package is under Apache 2.0 license (see github repo, and pypi).


^ Add meaningful description above

Read the Pull Request Guidelines for more information.
In case of fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in a newsfragment file, named {pr_number}.significant.rst or {issue_number}.significant.rst, in newsfragments.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this here? Can users not install this additionally in their setup if they need it? I mean for users who do not want to use Kerberos authentication and use other mechanisms like LDAP, for them we would be installing this additional dependency which many not be needed, no?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, indeed, this will bother users that do not need it. My reasoning was the following: For Hadoop, the provider is shipped with kerberos. Thus, to stay consistent, for Impala (which is setup on top of an hadoop system), it makes sense to have it bundled too.

I can propose a PR to add kerberos as optional dependency to hdfs and impala ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay. @eladkal could you please help us here on what could be the appropriate way?

Copy link
Member

@potiuk potiuk Jul 3, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See example in "amazon/provider.yaml" (additional-extras)

additional-extras:
  - name: pandas
    dependencies:
      - pandas>=0.17.1
  # There is conflict between boto3 and aiobotocore dependency botocore.
  # TODO: We can remove it once boto3 and aiobotocore both have compatible botocore version or
  # boto3 have native async support and we move away from aio aiobotocore
  - name: aiobotocore
    dependencies:
      - aiobotocore[boto3]>=2.2.0
  - name: cncf.kubernetes
    dependencies:
      - apache-airflow-providers-cncf-kubernetes>=7.2.0

Copy link
Contributor Author

@Joffreybvn Joffreybvn Jul 3, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Edited the PR to make it optional. I'll open a PR to update the docs on airflow-website asap.

@Joffreybvn Joffreybvn force-pushed the fix/add-kerberos-impala branch 2 times, most recently from 104b8f4 to 6413aff Compare July 3, 2023 19:40
@Joffreybvn Joffreybvn force-pushed the fix/add-kerberos-impala branch from 6413aff to a5c6678 Compare July 4, 2023 17:48
@Joffreybvn Joffreybvn requested a review from pankajkoti July 4, 2023 17:49
@potiuk potiuk merged commit bc3b2d1 into apache:main Jul 4, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants