Skip to content

Cherrypick python dependencies from GP to Cloudberry#1067

Closed
tuhaihe wants to merge 7 commits intoapache:mainfrom
tuhaihe:cherrypick-python-ext
Closed

Cherrypick python dependencies from GP to Cloudberry#1067
tuhaihe wants to merge 7 commits intoapache:mainfrom
tuhaihe:cherrypick-python-ext

Conversation

@tuhaihe
Copy link
Copy Markdown
Member

@tuhaihe tuhaihe commented Apr 27, 2025

greenplum-db/gpdb-archive@6f9d85b
greenplum-db/gpdb-archive@bd54207
greenplum-db/gpdb-archive@52c7e0a95c
greenplum-db/gpdb-archive@6592485
greenplum-db/gpdb-archive@b5920e061b
greenplum-db/gpdb-archive@0d1e4d644e
greenplum-db/gpdb-archive@411fd01083

What does this PR do?

Type of Change

  • Bug fix (non-breaking change)
  • New feature (non-breaking change)
  • Breaking change (fix or feature with breaking changes)
  • Documentation update

Breaking Changes

Test Plan

  • Unit tests added/updated
  • Integration tests added/updated
  • Passed make installcheck
  • Passed make -C src/test installcheck-cbdb-parallel

Impact

Performance:

User-facing changes:

Dependencies:

Checklist

Additional Context

CI Skip Instructions


zhrt123 and others added 2 commits April 27, 2025 11:27
Remove the following packages from gpdb7:
- psutil and pyyaml: Use the corresponding packages from distro's repo.
- pygresql: pygresql is replaced by psycopg2 which will be installed from
  distro's repo.

Co-authored-by: Chen Mulong <chenmulong@gmail.com>
Co-authored-by: Xing Guo <admin@higuoxing.com>
Due to the chaos of the python versions in DITROs, we plan to release
gpdb7 with:

- Use the ditro`s default python version (3.6 in el8) for the management
  utilities.
- Use a more advanced python version (3.9) for plpython, so the users
  can benefit from the active python community.

To do that:

- Use `python3` instead of `python` in the build/test scripts, since
  `python` executable is not guaranteed to exist after installation of
  python3.x package.
- Remove mock 1.0.1. mock is specified in the
  `gpMgmt/requirements-dev.txt`. There is no need to have a egg file in
  the repo. And the mock 1.0.1 will fail unit tests with python3.6:

```
======================================================================
ERROR: Test Suite Name|commands.test.unit.test_unit_gp|Test Case Name|test_is_gprecoverseg_running_succeeds|Test Details|
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/tmp/build/94afc7d4/gpdb_src/gpMgmt/bin/pythonSrc/ext/install/lib/python3.6/site-packages/mock-1.0.1-py3.6.egg/mock.py", line 1201, in patched
    return func(*args, **keywargs)
  File "/tmp/build/94afc7d4/gpdb_src/gpMgmt/bin/gppylib/commands/test/unit/test_unit_gp.py", line 191, in test_is_gprecoverseg_running_succeeds
    result = is_gprecoverseg_running()
  File "/tmp/build/94afc7d4/gpdb_src/gpMgmt/bin/gppylib/commands/gp.py", line 1635, in is_gprecoverseg_running
    gprecoverseg_pidfile = os.path.join(get_coordinatordatadir(), `gprecoverseg.lock`, `PID`)
  File "/usr/lib64/python3.6/posixpath.py", line 80, in join
    a = os.fspath(a)
TypeError: expected str, bytes or os.PathLike object, not MagicMock
```
tuhaihe added a commit to tuhaihe/cloudberry-devops-release that referenced this pull request Apr 27, 2025
Cherrypicks the Python dependencies changes from Greenplum to
Cloudberry in this PR: apache/cloudberry#1067.

So we need to update the configure script and install the Python
dependencies from the distro's repos.
@tuhaihe
Copy link
Copy Markdown
Member Author

tuhaihe commented Apr 27, 2025

This PR relies on apache/cloudberry-devops-release#17. Will test to verify if it can work on the local machine.

@tuhaihe tuhaihe marked this pull request as draft April 27, 2025 07:26
zhrt123 and others added 5 commits April 27, 2025 16:37
We're going to replace pygresql with psycopg2 in Greenplum, since it's
maintained actively and offered by many system package managers which
eases our pain in packing python packages. Besides, psycopg2 provides
real status message returned by the server and we don't need to fake
it ourselves.

Co-Authored-By: Hao Zhang <hzhang2@vmware.com>
Co-Authored-By: Hao Zhang <zhrt1446384557@gmail.com>
Co-authored-by: Yongtao Huang <yongtaoh@vmware.com>
Co-authored-by: Xing Guo <higuoxing@gmail.com>
This is the last patch for replacing pygresql with psycopg2 in Greenplum. This
patch mainly targets the gpload.

Benefits for replacing pygresql with psycopg2.
- Psycopg2 is maintained actively we have encountered bugs that haven't been
  fixed by the upstream yet, e.g., https://github.com/greenplum-db/gpdb/pull/13953.
- Psycopg2 is provided by Rocky Linux and Ubuntu. That is to say, we don't
  need to vendor it ourselves.
- Possibly remove the `PYTHONPATH` from `greenplum_path.sh`, which is good
  for users that they don't need to worry about their Python environment being
  overwritten by Greenplum.

Co-authored-by: Chen Mulong <chenmulong@gmail.com>
Co-authored-by: Xiaoxiao He <hxiaoxiao@vmware.com>
This is the last patch for replacing pygresql with psycopg2 in Greenplum. This patch mainly targets the gpMgmt tools.

Benefits for replacing pygresql with psycopg2.
- Psycopg2 is maintained actively we have encountered bugs that haven't been fixed by the upstream yet, e.g., https://github.com/greenplum-db/gpdb/pull/13953.
- Psycopg2 is provided by Rocky Linux and Ubuntu. That is to say, we don't need to vendor it ourselves.
- Last but not least, we got a chance to clean up leacy codes during the removal process, e.g., https://github.com/greenplum-db/gpdb/pull/15983.

After this patch, we need to do the following things.
- Add psycopg2 as a dependency of the rpm/deb package.
- Remove the pygresql source code tarball from the gpdb repo.
- Tidy up READMEs and requirements.txt files.

---------

Co-authored-by: Chen Mulong <chenmulong@gmail.com>
Co-authored-by: Xiaoxiao He <hxiaoxiao@vmware.com>
Co-authored-by: zhrt123 <hzhang2@vmware.com>
Co-authored-by: Piyush Chandwadkar <pchandwadkar@vmware.com>
Co-authored-by: Praveen Kumar <36772398+kpraveen457@users.noreply.github.com>
psycopg2's `getquoted()` API returns `latin-1` encoded binary string by default which is causing unexpected failures to some gpMgmt tools including gpload, analyzedb, minirepro. This patch helps fix it by teaching psycopg2's `QuotedString` adapter use `utf-8` encoding.

Reproducing steps for `analyzedb`:

Create a table with special name.

```sql
postgres=# create table spiegelungssätze(i int);
```

Run `analyzedb` against the postgres db.

```bash
$ analyzedb -d postgres
```

Backtrace:

```
➜ analyzedb -d postgres
20230910:21:51:11:689552 analyzedb:laptop:v-[INFO]:-Starting analyzedb with args: -d postgres
20230910:21:51:11:689552 analyzedb:laptop:v-[INFO]:-Getting and verifying input tables...
20230910:21:51:11:689552 analyzedb:laptop:v-[INFO]:-Checking for tables with stale stats...
20230910:21:51:11:689552 analyzedb:laptop:v-[ERROR]:-'utf-8' codec can't decode byte 0xe4 in position 21: invalid continuation byte
Traceback (most recent call last):
  File "/home/v/.local/gpdb7/bin/analyzedb", line 376, in execute
    heap_partitions = get_heap_tables_set(self.conn, input_tables_set)  # set((schema1,table1), ...])
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/v/.local/gpdb7/bin/analyzedb", line 1004, in get_heap_tables_set
    oid_str = get_oid_str(input_tables_set)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/v/.local/gpdb7/bin/analyzedb", line 976, in get_oid_str
    return ','.join(map((lambda x: regclass_schema_tbl(x[0], x[1])), table_list))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/v/.local/gpdb7/bin/analyzedb", line 976, in <lambda>
    return ','.join(map((lambda x: regclass_schema_tbl(x[0], x[1])), table_list))
                                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/v/.local/gpdb7/bin/analyzedb", line 984, in regclass_schema_tbl
    return "to_regclass('%s')" % (escape_string(schema_tbl))
                                  ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/v/.local/gpdb7/lib/python/gppylib/utils.py", line 515, in escape_string
    return adapted.getquoted().decode()[1:-1]
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe4 in position 21: invalid continuation byte
20230910:21:51:11:689552 analyzedb:laptop:v-[CRITICAL]:-analyzedb failed. (Reason=''utf-8' codec can't decode byte 0xe4 in position 21: invalid continuation byte') exiting...
```
Error occurs when we issue command `gpcheckcat -C pg_class`.
Reported error is "[ERROR] executing: Cross consistency check for pg_class\n  Execution error: name 'db' is not defined".
This is because of use of an undefined variable 'db'.
This commit fixes the issue by removing its usage.

Authored-by: vrhappy <songlong88@126.com>
@tuhaihe tuhaihe mentioned this pull request Apr 28, 2025
12 tasks
@tuhaihe tuhaihe closed this Jul 17, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants