Mysql embeddings #35393

claudevdm · 2025-06-22T23:47:01Z

This PR adds a MySQL vector writer implementation following the same pattern as Postgres vector writer.

sdks/python/apache_beam/ml/rag/ingestion/mysql.py contains the base MySQL vector writer logic that can be used for any MySQL instance
sdks/python/apache_beam/ml/rag/ingestion/mysql_common.py contains MySQL dialect utilities for configuring the vector writer, including ColumnSpecs for specifying how Chunk should be mapped to a MySQL schema and ConflictResolution for how inserts that violate uniqueness constraints should be handled
sdks/python/apache_beam/ml/rag/ingestion/cloudsql.py is a wrapper around the base MySQL vector writer that allows connecting to a CloudSQL instance via the cloudsql socket factory
sdks/python/apache_beam/ml/rag/ingestion/cloudsql_it_test.py contains tests that cover the base mysql logic. I did not add a base MySQL test to avoid having to spin up mysql database containers and bloating the test infrastructure.

I tried to strike a balance between maintainability of cloudsql specific dialects by not mixing postgres and mysql language utilities while sharing jdbc common utilities and cloudsql common utilities that do not change depending on the dialect.

Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:

Mention the appropriate issue in your description (for example: addresses #123), if applicable. This will automatically add a link to the pull request in the issue. If you would like the issue to automatically close on merging the pull request, comment fixes #<ISSUE NUMBER> instead.
Update CHANGES.md with noteworthy changes.
If this contribution is large, please file an Apache Individual Contributor License Agreement.

See the Contributor Guide for more tips on how to make review process smoother.

To check the build health, please visit https://github.com/apache/beam/blob/master/.test-infra/BUILD_STATUS.md

GitHub Actions Tests Status (on master branch)

See CI.md for more information about GitHub Actions CI or the workflows README to see a list of phrases to trigger workflows.

claudevdm · 2025-06-27T19:16:23Z

R: @damccorm

Sorry for the long PR. I am happy to split it into multiple, maybe?

Add default schema + tests
Add columnspecs (non default schema) + tests
Add conflict resolution + tests

github-actions · 2025-06-27T19:17:28Z

Stopping reviewer notifications for this pull request: review requested by someone other than the bot, ceding control. If you'd like to restart, comment assign set of reviewers

damccorm

Thanks - mostly looks good (and very in line with other pieces, which is good)

sdks/python/apache_beam/ml/rag/ingestion/mysql.py

sdks/python/apache_beam/ml/rag/ingestion/mysql_common.py

sdks/python/apache_beam/ml/rag/ingestion/test_utils.py

claudevdm · 2025-07-01T20:00:29Z

Test failures are related to test containers version change

sdks/python/apache_beam/ml/rag/ingestion/mysql_common.py

Co-authored-by: Danny McCormick <dannymccormick@google.com>

Abacn · 2025-07-11T02:25:53Z

This has increased the number of test of https://github.com/apache/beam/actions/workflows/beam_PostCommit_Python_Xlang_Gcp_Dataflow.yml?query=event%3Aschedule from 45 -> 62 causing tests often run exceeds 3h timeout

claudevdm · 2025-07-11T16:40:48Z

https://github.com/apache/beam/actions/workflows/beam_PostCommit_Python_Xlang_Gcp_Dataflow.yml?query=event%3Aschedule

@Abacn where can I see the list of tests that actually run? I set all of these new tests to be skip on dataflow runner. In the publish test section I see before this change

67 tests found
45 skipped tests found
So that means 22 tests ran?

After this change
84 tests found
62 skipped tests found
Also 22 tests ran?

claudevdm · 2025-07-11T16:41:40Z

Actually looking back at the test history, it was timing out a lot before this change too.

github-actions bot added python java extensions labels Jun 22, 2025

claudevdm force-pushed the mysql-embeddings branch from 4b52206 to f68ee53 Compare June 23, 2025 03:02

claudevdm added 2 commits June 27, 2025 15:03

Add MySQL vector writer.

a1995e1

Trigger tests again.

a829328

claudevdm force-pushed the mysql-embeddings branch from 653675c to a829328 Compare June 27, 2025 19:14

claudevdm marked this pull request as ready for review June 27, 2025 19:16

claudevdm requested a review from damccorm June 27, 2025 19:17

damccorm reviewed Jun 27, 2025

View reviewed changes

sdks/python/apache_beam/ml/rag/ingestion/mysql.py Outdated Show resolved Hide resolved

sdks/python/apache_beam/ml/rag/ingestion/mysql_common.py Outdated Show resolved Hide resolved

sdks/python/apache_beam/ml/rag/ingestion/test_utils.py Outdated Show resolved Hide resolved

claudevdm and others added 3 commits July 1, 2025 12:09

Comments.

5f602db

Fix lints etc.

566bba9

Comment.

15a4d3d

claudevdm and others added 2 commits July 2, 2025 09:52

Fix typo

ed23b53

Lint fix.

ab82c60

damccorm reviewed Jul 2, 2025

View reviewed changes

sdks/python/apache_beam/ml/rag/ingestion/mysql_common.py Outdated Show resolved Hide resolved

Update sdks/python/apache_beam/ml/rag/ingestion/mysql_common.py

97b7381

Co-authored-by: Danny McCormick <dannymccormick@google.com>

damccorm approved these changes Jul 7, 2025

View reviewed changes

damccorm merged commit 28fd2b2 into apache:master Jul 7, 2025
112 of 115 checks passed

damccorm mentioned this pull request Sep 23, 2025

[Tracking]: Beam 3.0.0 - Milestone 1 Key Features #36173

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Mysql embeddings #35393

Mysql embeddings #35393

Uh oh!

claudevdm commented Jun 22, 2025 •

edited

Loading

Uh oh!

claudevdm commented Jun 27, 2025

Uh oh!

github-actions bot commented Jun 27, 2025

Uh oh!

damccorm left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

claudevdm commented Jul 1, 2025

Uh oh!

Uh oh!

Uh oh!

Abacn commented Jul 11, 2025

Uh oh!

claudevdm commented Jul 11, 2025

Uh oh!

claudevdm commented Jul 11, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Mysql embeddings #35393

Mysql embeddings #35393

Uh oh!

Conversation

claudevdm commented Jun 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

GitHub Actions Tests Status (on master branch)

Uh oh!

claudevdm commented Jun 27, 2025

Uh oh!

github-actions bot commented Jun 27, 2025

Uh oh!

damccorm left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

claudevdm commented Jul 1, 2025

Uh oh!

Uh oh!

Uh oh!

Abacn commented Jul 11, 2025

Uh oh!

claudevdm commented Jul 11, 2025

Uh oh!

claudevdm commented Jul 11, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

claudevdm commented Jun 22, 2025 •

edited

Loading