Added "ProxySQL-aware" reconnect logic for LHM #112

EtienneBerubeShopify · 2021-10-13T15:47:00Z

What am I trying to achieve

Currently, if there is a loss of connection between LHM and the MySQL server, LHM will retry but with the closed connection. This PR addresses the problem + goes deeper with underlying potential data loss problems when used with Proxy SQL.

Why?

A reconnection can happen whenever the connection between LHM --> MySQL server or LHM --> ProxySQL is lost. However, in some scenarios involving failovers, the writer will change and with the right amount of replication lag, this can cause data loss.

However, in cases where the connection is loss and the writer is the same, then it is perfectly safe to reconnect and retry the last SQL statement.

How

The solution implemented goes as follows: When an LHM is created, it will save the current writer's hostname (select @@global.hostname). Before every chunk insert, it will check that the hostname is the same as the initial one.

If it is the same, it will continue
If it is different, it will abort.
In the event of a connection loss, LHM will then start the reconnection procedure.
If it can, it will then check the hostname and ensure that the writer is the same. If it is, then it will retry the SQL statement. - If it can't reconnect. Or if the hostname doesn't match the initial one, then it will abort to avoid data-loss.

Note: if the hostname is localhost, then it will check for the server's id (i.e. select @@global.server_id)

Complete algorithm:

What changed?

Most of the new logic resides in lib/lhm/sql_retry.rb. There are other changes in other files like chunker.rb and chunk_insert.rb, but they don't add any features. However, since the new logic change the retry logic quite drastically, a lot of the tests had to be adapted. This means that a lot of the tests files had to modified in some way.

What to check?

Code cleanliness
Possible edge cases

xliang6 · 2021-10-13T17:48:25Z

Great work on putting this PR together! 🎉 Thanks for being resourceful and talking with other teams and working on different components to fix the LHM reconnect issue!

A reconnection can happen whenever the connection between LHM --> MySQL server or LHM --> ProxySQL is lost.
To be precise, the reconnect happens when there is a connection error between LHM -> ProxySQL or ProxySQL -> MySQL. Please refer to the original issues in your PR description so the full context is not lost: https://github.com/Shopify/db-engineering/issues/90 and https://github.com/Shopify/db-engineering/issues/98.

One concern I have is that since we are making these changes in the Shopify LHM gem, what will happen if we get reconnected during an LHM for a CloudSQL app that uses ProxySQL?

EtienneBerubeShopify · 2021-10-13T18:03:03Z

One concern I have is that since we are making these changes in the Shopify LHM gem, what will happen if we get reconnected during an LHM for a CloudSQL app that uses ProxySQL?

Great question! If we query select @@global.hostname on a CloudSQL we get localhost, which is the reason I added the server_id in the first place. So it should work, but CloudSQL could failover on Google's side without LHM realizing, but if that happens I'm unsure how to catch it...

xliang6 · 2021-10-13T18:26:42Z

Have we enabled the ProxySQL dedicated hostgroup feature for CloudSQL apps? One path forward is to detect whether we are talking to KateSQL or core, and only reconnect for those two?

lib/lhm/sql_retry.rb

lib/lhm/chunker.rb

lib/lhm/sql_retry.rb

EtienneBerubeShopify · 2021-10-13T20:27:35Z

Have we enabled the ProxySQL dedicated hostgroup feature for CloudSQL apps?

I don't know what's the topology when it comes to CloudSQL. Some simple changes to the ProxySQL config would make the feature work. I remember people from connection management talking about it, but weren't sure if it was going to be added.

One path forward is to detect whether we are talking to KateSQL or core, and only reconnect for those two?

That would be a good idea for a future PR. A fairly simple way of approaching it would be to add a feature flag that could be enabled or disabled depending on the app.

xliang6 · 2021-10-14T15:01:18Z

One path forward is to detect whether we are talking to KateSQL or core, and only reconnect for those two?

That would be a good idea for a future PR. A fairly simple way of approaching it would be to add a feature flag that could be enabled or disabled depending on the app.

With this PR, do we abort an LHM if it encounters a connection error for a CloudSQL app? If so, I'm okay with delaying this check till later.

EtienneBerubeShopify · 2021-10-14T15:03:50Z

With this PR, do we abort an LHM if it encounters a connection error for a CloudSQL app? If so, I'm okay with delaying this check till later.

Not for now, doing select @@global.hostname on a cloudSQL will return "localhost", so instead we check to see if the server_id matches. I'm unsure how reliable this can be since Google could failover CloudSQLs on their own without it being a problem for us, which would change the server id, causing an "Abort".

shuhaowu · 2021-10-15T12:11:08Z

I'm unsure how reliable this can be since Google could failover CloudSQLs on their own without it being a problem for us, which would change the server id, causing an "Abort".

Conversely, what if they re-use server id like they reuse hostname?

lib/lhm/sql_retry.rb

EtienneBerubeShopify · 2021-10-19T16:49:43Z

Conversely, what if they re-use server id like they reuse hostname?

This is a VERY good question. I did run a few failovers myself with a test app and found that they are changing servers for every failover (i.e. they don't "dedicate" a MySQL server to your app). So the server id did change, which mean we can catch "Google-level" failovers, but I didn't find anything online on assessing the true uniqueness of server_ids. MySQL does state that the number must be between 1 and 2^32, which is big but not insanely big.

However, the goal is to catch and abort on server_id change. So whenever a failover happens, the server_id will change, since I don't think they would failover to the same instance. So even if they do re-use a server_id that was previously used, LHM should catch the difference and abort.

EDIT:

Checking the MySQL docs about server_id it does state that it must be unique within a replication topology. So any failover with CloudSQL should be between MySQL servers with different ids.

tiwilliam · 2021-10-20T08:36:38Z

Checking the MySQL docs about server_id it does state that it must be unique within a replication topology. So any failover with CloudSQL should be between MySQL servers with different ids.

Have we looked into using server_uuid instead of hostname and server_id?

sys.admin_ro.1@mysql> SELECT @@server_uuid;
+--------------------------------------+
| @@server_uuid                        |
+--------------------------------------+
| 941e5058-30b6-11ec-9f0a-ee57386cefaa |
+--------------------------------------+
1 row in set (0.00 sec)

Beginning with MySQL 5.6, the server generates a true UUID in addition to the server_id value supplied by the user. This is available as the global, read-only server_uuid system variable.

https://dev.mysql.com/doc/refman/5.6/en/replication-options.html#sysvar_server_uuid

tiwilliam · 2021-10-20T09:50:19Z

Since this work is attempting to make LHM ProxySQL aware, I think it should also be aware if it's actually connected to a ProxySQL. This could help to clean-up the code a bit and avoid the server ID check if we are connected to MySQL directly. I had a read in the ProxySQL code and they do hijack a global to expose this:

> select @@version_comment limit 1;
+-------------------+
| @@version_comment |
+-------------------+
| (ProxySQL)        |
+-------------------+
1 row in set (0.000 sec)

Notice that this is case sensitive though...

> SELECT @@version_comment LIMIT 1;
+--------------------------------------------------------+
| @@version_comment                                      |
+--------------------------------------------------------+
| Percona Server (GPL), Release '21', Revision '2a37e4e' |
+--------------------------------------------------------+
1 row in set (0.032 sec)

lib/lhm/chunk_insert.rb

README.md

lib/lhm/sql_retry.rb

EtienneBerubeShopify · 2021-10-20T14:48:41Z

Checking the MySQL docs about server_id it does state that it must be unique within a replication topology. So any failover with CloudSQL should be between MySQL servers with different ids.

Have we looked into using server_uuid instead of hostname and server_id?
sys.admin_ro.1@mysql> SELECT @@server_uuid;
+--------------------------------------+
| @@server_uuid                        |
+--------------------------------------+
| 941e5058-30b6-11ec-9f0a-ee57386cefaa |
+--------------------------------------+
1 row in set (0.00 sec)
Beginning with MySQL 5.6, the server generates a true UUID in addition to the server_id value supplied by the user. This is available as the global, read-only server_uuid system variable.

https://dev.mysql.com/doc/refman/5.6/en/replication-options.html#sysvar_server_uuid

Hmm, I REALLY like the idea of uniqueness. I think I will add the server_uuid after I check with CloudSQL that it works as intended. I'll probably keep hostname as for logging it makes it much more readable than a UUID for observability to debug what happened (or on which host did the migration start).

…fail fast and which should retry

shuhaowu

Usually I would provide a summary of my review, but unfortunately this one is too big for me to read through in one sitting, so I don't remember all the details. That said, there are a few points that stood out to me:

There's no action item associated for this for now, but it needs to be mentioned: the way the code is structured now is with a sql retry block within a sql retry block. The control flow of the code is based on exceptions that's raised from the blocks. As a result, the code weaves in and out of multiple files, making it relatively difficult to understand. The good parts is at least we have reasonably descriptive comments to help the reader out. The root cause of this is not the retriable library is too simplistic and we've outgrown it. I believe the code should be significantly simpler if we just wrote our own retry routines. That said, doing this now will be even a bigger rabbit hole, so let's not worry about it for now.
Downstream apps already uses ActiveRecord::QueryLogs to set the maintenance:lhm marginalia comments. I see we added a way to tag the queries. How do these interact?

I think I have a number of other questions in the comments. For those, I don't quite know the answer as I didn't have enough time to fully think it through. As long as you thought it through and can answer it, I think those can be resolved.

I also have some suggestions of renaming and what not. Up to you if you want to take it or not.

For the next steps of this PR: I would like to merge it ASAP if possible. I'll take a look at this PR hopefully one more time after all the comments are addressed. Hopefully that will be good enough to merge and we can continue working in smaller changes.

.github/workflows/test.yml

README.md

docker-compose.yml

lib/lhm.rb

spec/integration/sql_retry/retry_with_proxysql_spec.rb

shuhaowu · 2021-11-10T13:49:35Z

spec/integration/sql_retry/retry_with_proxysql_spec.rb

+    assert logs.last.include?("Lost connection to MySQL server at 'reading initial communication packet")
+  end
+
+  it "Will abort if failover happens (mimicked with proxySQL)" do


How is this test different from "will abort if new writer is not same host"? The other test appears to use mocks to change the host, but this one does it for real? Isn't this test better than the other test and therefore the other one doesn't need to be there?

They are similar, yes. They take 2 approaches to ensure that the code behaves as intended. I'm not against having more tests than not, but I can understand that it could be seen as boilerplate and unnecessary. I kept it thinking it would not cause any harm, but it can be removed. What do you think?

Keeping is fine.

spec/integration/toxiproxy_helper.rb

spec/test_helper.rb

shuhaowu

Let's NOT merge this for now due to infrastructure freeze as I want to keep the master branch to the latest version, in case we have to do any emergency fixes. If we merge this to master, doing emergency fixes might be more challenging.

shuhaowu · 2021-11-16T15:23:53Z

lib/lhm.rb

+      raise 'Please call Lhm.setup' unless defined?(ActiveRecord)
+      @@connection = Connection.new(connection: ActiveRecord::Base.connection, options: connection_options || {})
+    else
+      @@connection.options = connection_options unless connection_options.nil?


Wait, I thought the idea of this method is to just return @@connection if it exist and error if it doesn't? This kind of overwrite feels like it might be fragile.

This is also what I thought, but there was an assignment in there. I didn't want to break the "developer facing" interface, so I kept it as is.

Original code:

def connection @@connection ||= begin raise 'Please call Lhm.setup' unless defined?(ActiveRecord) ActiveRecord::Base.connection end end

The decision came from this thread that we had. [source]

shuhaowu · 2021-11-16T15:28:12Z

lib/lhm/connection.rb

+    def options=(options)
+      # If any other flags are added. Add the "processing" here
+      @sql_retry.reconnect_with_consistent_host = options[:reconnect_with_consistent_host] || false


So if we do connection.options = options, it will only update a single member variable in @sql_retry? Why not just do connections.reconnect_with_consistent_host = true/false instead of using a options hash?

I did because options could contain more than one values. I also added this so that there could be more feature flags in the future.

Moreover, I thought that performing the "logic" (i.e. extracting the desired option) would make more sense to be handled in the connection class than in the Lhm.Setup.

However, if you don't see any new features coming in the future, I can always make it more clear.

shuhaowu · 2021-11-16T15:32:05Z

lib/lhm/connection.rb

+    def update(query, should_retry: false, retry_options: {})
+      if should_retry
+        exec_with_retries(:update, query, retry_options)
+      else
+        exec(:update, query)
+      end
+    end
+
+    def select_value(query, should_retry: false, retry_options: {})
+      if should_retry
+        exec_with_retries(:select_value, query, retry_options)
+      else
+        exec(:select_value, query)
+      end
+    end
+
+    def select_values(query, should_retry: false, retry_options: {})
+      if should_retry
+        exec_with_retries(:select_values, query, retry_options)
+      else
+        exec(:select_values, query)
+      end
+    end
+
+    def select_one(query, should_retry: false, retry_options: {})
+      if should_retry
+        exec_with_retries(:select_one, query, retry_options)
+      else
+        exec(:select_one, query)
+      end


Fine for now, it occured to me that AR allows for more than just queries to be passed in here (and possibly also bind arguments?), not that we use this feature..

I see what you mean. Right now we don't accept AR calls from "outside" this gem. So for now this is solid.

shuhaowu · 2021-11-16T15:35:18Z

lib/lhm/connection.rb

    private

+    def exec(method, sql)
+      connection.public_send(method, Lhm::ProxySQLHelper.tagged(sql))


For the QueryLog, what I mean is if that is set, will the query be double tagged? Is that OK?

For method_missing, we don't need it now, but as we override more and more methods, we might consider using it. Let's defer that for the future tho as maybe at some point we can switch over to mysql2 instead of the AR connection.

EtienneBerubeShopify · 2021-11-16T19:14:20Z

For the QueryLog, what I mean is if that is set, will the query be double tagged? Is that OK?
For method_missing, we don't need it now, but as we override more and more methods, we might consider using it. Let's
defer that for the future tho as maybe at some point we can switch over to mysql2 instead of the AR connection.

For some reason, I could not reply. So yes, as soon as this is merged, I'll open a PR in core to remove the QueryLogs tags (since it's handled by ourselves). As for a potential switch, I agree, it would be ideal to have something a bit more complete. However, we'll still have to find a way to determine which methods being delegated will require being tagged. For instance, if we receive a call from #current_database(database_name), tagging the argument blindly will result in problems. So We'll need some lists somewhere of which methods can/cannot be tagged.

Also, the only queries who REALLY require the tags are the inserts. As for normal, queries can be performed on a read-only replica without any issues. There's a pretty major subset of the queries who could be ignored without DRAMATIC consequences.
But this can be addressed when we'll face this problem.

EtienneBerubeShopify requested review from coding-chimp, shuhaowu, tiwilliam and xliang6 October 13, 2021 15:47

EtienneBerubeShopify force-pushed the etienne-feature-proxisql-retry branch from 3846361 to 32a68c8 Compare October 13, 2021 15:49

shuhaowu reviewed Oct 13, 2021

View reviewed changes

lib/lhm/sql_retry.rb Outdated Show resolved Hide resolved

shuhaowu reviewed Oct 13, 2021

View reviewed changes

lib/lhm/chunker.rb Outdated Show resolved Hide resolved

shuhaowu reviewed Oct 13, 2021

View reviewed changes

lib/lhm/sql_retry.rb Show resolved Hide resolved

shuhaowu reviewed Oct 13, 2021

View reviewed changes

lib/lhm/sql_retry.rb Outdated Show resolved Hide resolved

shuhaowu reviewed Oct 13, 2021

View reviewed changes

lib/lhm/sql_retry.rb Outdated Show resolved Hide resolved

shuhaowu reviewed Oct 13, 2021

View reviewed changes

lib/lhm/sql_retry.rb Outdated Show resolved Hide resolved

shuhaowu reviewed Oct 15, 2021

View reviewed changes

lib/lhm/sql_retry.rb Outdated Show resolved Hide resolved

shuhaowu reviewed Oct 15, 2021

View reviewed changes

lib/lhm/sql_retry.rb Outdated Show resolved Hide resolved

shuhaowu reviewed Oct 15, 2021

View reviewed changes

lib/lhm/sql_retry.rb Outdated Show resolved Hide resolved

tiwilliam reviewed Oct 20, 2021

View reviewed changes

lib/lhm/chunk_insert.rb Outdated Show resolved Hide resolved

tiwilliam reviewed Oct 20, 2021

View reviewed changes

lib/lhm/chunk_insert.rb Outdated Show resolved Hide resolved

tiwilliam reviewed Oct 20, 2021

View reviewed changes

README.md Outdated Show resolved Hide resolved

tiwilliam reviewed Oct 20, 2021

View reviewed changes

lib/lhm/sql_retry.rb Show resolved Hide resolved

tiwilliam reviewed Oct 20, 2021

View reviewed changes

lib/lhm/sql_retry.rb Outdated Show resolved Hide resolved

EtienneBerubeShopify added 3 commits November 8, 2021 13:27

Worked on final integration test

43cd61d

Modified the README to add details and rework the wording

351a6f2

Removed unnecessary changes from pr (i.e. minor linting)

3950b1a

EtienneBerubeShopify force-pushed the etienne-feature-proxisql-retry branch 5 times, most recently from 670903d to f0020bb Compare November 8, 2021 20:41

Added retriable flage for execute to control which statements should …

aeeaafe

…fail fast and which should retry

EtienneBerubeShopify force-pushed the etienne-feature-proxisql-retry branch from f0020bb to aeeaafe Compare November 8, 2021 20:52

EtienneBerubeShopify requested review from Manan007224, shuhaowu and tiwilliam November 8, 2021 20:54

EtienneBerubeShopify mentioned this pull request Nov 9, 2021

fix docs and local env scripts #117

Open

shuhaowu reviewed Nov 10, 2021

View reviewed changes

Addressed PR comments V2

b3da1ea

EtienneBerubeShopify force-pushed the etienne-feature-proxisql-retry branch from 0ce11af to b3da1ea Compare November 11, 2021 19:39

shuhaowu approved these changes Nov 16, 2021

View reviewed changes

EtienneBerubeShopify mentioned this pull request Dec 3, 2021

Bumping version to 3.5.1 #121

Merged

EtienneBerubeShopify merged commit d37e2b1 into master Dec 3, 2021

shopify-shipit bot had a problem deploying to production December 3, 2021 20:07 Failure

shopify-shipit bot had a problem deploying to production December 6, 2021 20:34 Failure

shopify-shipit bot had a problem deploying to production December 6, 2021 21:57 Failure

shopify-shipit bot had a problem deploying to production December 7, 2021 21:00 Failure

shopify-shipit bot had a problem deploying to production December 7, 2021 21:03 Failure

shopify-shipit bot had a problem deploying to production December 7, 2021 21:50 Failure

shopify-shipit bot temporarily deployed to production December 7, 2021 21:53 Inactive

Sinclert deleted the etienne-feature-proxisql-retry branch October 20, 2023 14:43

Added "ProxySQL-aware" reconnect logic for LHM #112

Added "ProxySQL-aware" reconnect logic for LHM #112

Uh oh!

Conversation

EtienneBerubeShopify commented Oct 13, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What am I trying to achieve

Why?

How

Complete algorithm:

What changed?

What to check?

Uh oh!

xliang6 commented Oct 13, 2021

Uh oh!

EtienneBerubeShopify commented Oct 13, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

xliang6 commented Oct 13, 2021

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

EtienneBerubeShopify commented Oct 13, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

xliang6 commented Oct 14, 2021

Uh oh!

EtienneBerubeShopify commented Oct 14, 2021

Uh oh!

shuhaowu commented Oct 15, 2021

Uh oh!

Uh oh!

Uh oh!

Uh oh!

EtienneBerubeShopify commented Oct 19, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tiwilliam commented Oct 20, 2021

Uh oh!

tiwilliam commented Oct 20, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

EtienneBerubeShopify commented Oct 20, 2021

Uh oh!

shuhaowu left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

shuhaowu left a comment

Choose a reason for hiding this comment

Uh oh!

shuhaowu Nov 16, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

EtienneBerubeShopify commented Oct 13, 2021 •

edited

Loading

EtienneBerubeShopify commented Oct 13, 2021 •

edited

Loading

EtienneBerubeShopify commented Oct 13, 2021 •

edited

Loading

EtienneBerubeShopify commented Oct 19, 2021 •

edited

Loading

tiwilliam commented Oct 20, 2021 •

edited

Loading

shuhaowu Nov 16, 2021 •

edited

Loading

EtienneBerubeShopify Nov 16, 2021 •

edited

Loading