Skip to content

Conversation

@EtienneBerubeShopify
Copy link

@EtienneBerubeShopify EtienneBerubeShopify commented Oct 13, 2021

What am I trying to achieve

Currently, if there is a loss of connection between LHM and the MySQL server, LHM will retry but with the closed connection. This PR addresses the problem + goes deeper with underlying potential data loss problems when used with Proxy SQL.

Why?

A reconnection can happen whenever the connection between LHM --> MySQL server or LHM --> ProxySQL is lost. However, in some scenarios involving failovers, the writer will change and with the right amount of replication lag, this can cause data loss.

However, in cases where the connection is loss and the writer is the same, then it is perfectly safe to reconnect and retry the last SQL statement.

How

The solution implemented goes as follows: When an LHM is created, it will save the current writer's hostname (select @@global.hostname). Before every chunk insert, it will check that the hostname is the same as the initial one.

  • If it is the same, it will continue
  • If it is different, it will abort.
    In the event of a connection loss, LHM will then start the reconnection procedure.
  • If it can, it will then check the hostname and ensure that the writer is the same. If it is, then it will retry the SQL statement. - If it can't reconnect. Or if the hostname doesn't match the initial one, then it will abort to avoid data-loss.

Note: if the hostname is localhost, then it will check for the server's id (i.e. select @@global.server_id)

Complete algorithm:

image

What changed?

Most of the new logic resides in lib/lhm/sql_retry.rb. There are other changes in other files like chunker.rb and chunk_insert.rb, but they don't add any features. However, since the new logic change the retry logic quite drastically, a lot of the tests had to be adapted. This means that a lot of the tests files had to modified in some way.

What to check?

  • Code cleanliness
  • Possible edge cases

@xliang6
Copy link

xliang6 commented Oct 13, 2021

Great work on putting this PR together! 🎉 Thanks for being resourceful and talking with other teams and working on different components to fix the LHM reconnect issue!

A reconnection can happen whenever the connection between LHM --> MySQL server or LHM --> ProxySQL is lost.
To be precise, the reconnect happens when there is a connection error between LHM -> ProxySQL or ProxySQL -> MySQL. Please refer to the original issues in your PR description so the full context is not lost: https://github.com/Shopify/db-engineering/issues/90 and https://github.com/Shopify/db-engineering/issues/98.

One concern I have is that since we are making these changes in the Shopify LHM gem, what will happen if we get reconnected during an LHM for a CloudSQL app that uses ProxySQL?

@EtienneBerubeShopify
Copy link
Author

EtienneBerubeShopify commented Oct 13, 2021

One concern I have is that since we are making these changes in the Shopify LHM gem, what will happen if we get reconnected during an LHM for a CloudSQL app that uses ProxySQL?

Great question! If we query select @@global.hostname on a CloudSQL we get localhost, which is the reason I added the server_id in the first place. So it should work, but CloudSQL could failover on Google's side without LHM realizing, but if that happens I'm unsure how to catch it...

@xliang6
Copy link

xliang6 commented Oct 13, 2021

Have we enabled the ProxySQL dedicated hostgroup feature for CloudSQL apps? One path forward is to detect whether we are talking to KateSQL or core, and only reconnect for those two?

@EtienneBerubeShopify
Copy link
Author

EtienneBerubeShopify commented Oct 13, 2021

Have we enabled the ProxySQL dedicated hostgroup feature for CloudSQL apps?

I don't know what's the topology when it comes to CloudSQL. Some simple changes to the ProxySQL config would make the feature work. I remember people from connection management talking about it, but weren't sure if it was going to be added.

One path forward is to detect whether we are talking to KateSQL or core, and only reconnect for those two?

That would be a good idea for a future PR. A fairly simple way of approaching it would be to add a feature flag that could be enabled or disabled depending on the app.

@xliang6
Copy link

xliang6 commented Oct 14, 2021

One path forward is to detect whether we are talking to KateSQL or core, and only reconnect for those two?

That would be a good idea for a future PR. A fairly simple way of approaching it would be to add a feature flag that could be enabled or disabled depending on the app.

With this PR, do we abort an LHM if it encounters a connection error for a CloudSQL app? If so, I'm okay with delaying this check till later.

@EtienneBerubeShopify
Copy link
Author

With this PR, do we abort an LHM if it encounters a connection error for a CloudSQL app? If so, I'm okay with delaying this check till later.

Not for now, doing select @@global.hostname on a cloudSQL will return "localhost", so instead we check to see if the server_id matches. I'm unsure how reliable this can be since Google could failover CloudSQLs on their own without it being a problem for us, which would change the server id, causing an "Abort".

@shuhaowu
Copy link

I'm unsure how reliable this can be since Google could failover CloudSQLs on their own without it being a problem for us, which would change the server id, causing an "Abort".

Conversely, what if they re-use server id like they reuse hostname?

@EtienneBerubeShopify
Copy link
Author

EtienneBerubeShopify commented Oct 19, 2021

Conversely, what if they re-use server id like they reuse hostname?

This is a VERY good question. I did run a few failovers myself with a test app and found that they are changing servers for every failover (i.e. they don't "dedicate" a MySQL server to your app). So the server id did change, which mean we can catch "Google-level" failovers, but I didn't find anything online on assessing the true uniqueness of server_ids. MySQL does state that the number must be between 1 and 2^32, which is big but not insanely big.

However, the goal is to catch and abort on server_id change. So whenever a failover happens, the server_id will change, since I don't think they would failover to the same instance. So even if they do re-use a server_id that was previously used, LHM should catch the difference and abort.

EDIT:

Checking the MySQL docs about server_id it does state that it must be unique within a replication topology. So any failover with CloudSQL should be between MySQL servers with different ids.

@tiwilliam
Copy link

Checking the MySQL docs about server_id it does state that it must be unique within a replication topology. So any failover with CloudSQL should be between MySQL servers with different ids.

Have we looked into using server_uuid instead of hostname and server_id?

sys.admin_ro.1@mysql> SELECT @@server_uuid;
+--------------------------------------+
| @@server_uuid                        |
+--------------------------------------+
| 941e5058-30b6-11ec-9f0a-ee57386cefaa |
+--------------------------------------+
1 row in set (0.00 sec)

Beginning with MySQL 5.6, the server generates a true UUID in addition to the server_id value supplied by the user. This is available as the global, read-only server_uuid system variable.

https://dev.mysql.com/doc/refman/5.6/en/replication-options.html#sysvar_server_uuid

@tiwilliam
Copy link

tiwilliam commented Oct 20, 2021

Since this work is attempting to make LHM ProxySQL aware, I think it should also be aware if it's actually connected to a ProxySQL. This could help to clean-up the code a bit and avoid the server ID check if we are connected to MySQL directly. I had a read in the ProxySQL code and they do hijack a global to expose this:

> select @@version_comment limit 1;
+-------------------+
| @@version_comment |
+-------------------+
| (ProxySQL)        |
+-------------------+
1 row in set (0.000 sec)

Notice that this is case sensitive though...

> SELECT @@version_comment LIMIT 1;
+--------------------------------------------------------+
| @@version_comment                                      |
+--------------------------------------------------------+
| Percona Server (GPL), Release '21', Revision '2a37e4e' |
+--------------------------------------------------------+
1 row in set (0.032 sec)

@EtienneBerubeShopify
Copy link
Author

Checking the MySQL docs about server_id it does state that it must be unique within a replication topology. So any failover with CloudSQL should be between MySQL servers with different ids.

Have we looked into using server_uuid instead of hostname and server_id?

sys.admin_ro.1@mysql> SELECT @@server_uuid;
+--------------------------------------+
| @@server_uuid                        |
+--------------------------------------+
| 941e5058-30b6-11ec-9f0a-ee57386cefaa |
+--------------------------------------+
1 row in set (0.00 sec)

Beginning with MySQL 5.6, the server generates a true UUID in addition to the server_id value supplied by the user. This is available as the global, read-only server_uuid system variable.

https://dev.mysql.com/doc/refman/5.6/en/replication-options.html#sysvar_server_uuid

Hmm, I REALLY like the idea of uniqueness. I think I will add the server_uuid after I check with CloudSQL that it works as intended. I'll probably keep hostname as for logging it makes it much more readable than a UUID for observability to debug what happened (or on which host did the migration start).

@EtienneBerubeShopify EtienneBerubeShopify force-pushed the etienne-feature-proxisql-retry branch 5 times, most recently from 670903d to f0020bb Compare November 8, 2021 20:41
Copy link

@shuhaowu shuhaowu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Usually I would provide a summary of my review, but unfortunately this one is too big for me to read through in one sitting, so I don't remember all the details. That said, there are a few points that stood out to me:

  1. There's no action item associated for this for now, but it needs to be mentioned: the way the code is structured now is with a sql retry block within a sql retry block. The control flow of the code is based on exceptions that's raised from the blocks. As a result, the code weaves in and out of multiple files, making it relatively difficult to understand. The good parts is at least we have reasonably descriptive comments to help the reader out. The root cause of this is not the retriable library is too simplistic and we've outgrown it. I believe the code should be significantly simpler if we just wrote our own retry routines. That said, doing this now will be even a bigger rabbit hole, so let's not worry about it for now.
  2. Downstream apps already uses ActiveRecord::QueryLogs to set the maintenance:lhm marginalia comments. I see we added a way to tag the queries. How do these interact?

I think I have a number of other questions in the comments. For those, I don't quite know the answer as I didn't have enough time to fully think it through. As long as you thought it through and can answer it, I think those can be resolved.

I also have some suggestions of renaming and what not. Up to you if you want to take it or not.

For the next steps of this PR: I would like to merge it ASAP if possible. I'll take a look at this PR hopefully one more time after all the comments are addressed. Hopefully that will be good enough to merge and we can continue working in smaller changes.

assert logs.last.include?("Lost connection to MySQL server at 'reading initial communication packet")
end

it "Will abort if failover happens (mimicked with proxySQL)" do

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is this test different from "will abort if new writer is not same host"? The other test appears to use mocks to change the host, but this one does it for real? Isn't this test better than the other test and therefore the other one doesn't need to be there?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They are similar, yes. They take 2 approaches to ensure that the code behaves as intended. I'm not against having more tests than not, but I can understand that it could be seen as boilerplate and unnecessary. I kept it thinking it would not cause any harm, but it can be removed. What do you think?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Keeping is fine.

@EtienneBerubeShopify EtienneBerubeShopify force-pushed the etienne-feature-proxisql-retry branch from 0ce11af to b3da1ea Compare November 11, 2021 19:39
Copy link

@shuhaowu shuhaowu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's NOT merge this for now due to infrastructure freeze as I want to keep the master branch to the latest version, in case we have to do any emergency fixes. If we merge this to master, doing emergency fixes might be more challenging.

raise 'Please call Lhm.setup' unless defined?(ActiveRecord)
@@connection = Connection.new(connection: ActiveRecord::Base.connection, options: connection_options || {})
else
@@connection.options = connection_options unless connection_options.nil?
Copy link

@shuhaowu shuhaowu Nov 16, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wait, I thought the idea of this method is to just return @@connection if it exist and error if it doesn't? This kind of overwrite feels like it might be fragile.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is also what I thought, but there was an assignment in there. I didn't want to break the "developer facing" interface, so I kept it as is.

Original code:

  def connection
    @@connection ||=
      begin
        raise 'Please call Lhm.setup' unless defined?(ActiveRecord)
        ActiveRecord::Base.connection
      end
  end

The decision came from this thread that we had. [source]

Comment on lines +25 to +27
def options=(options)
# If any other flags are added. Add the "processing" here
@sql_retry.reconnect_with_consistent_host = options[:reconnect_with_consistent_host] || false

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So if we do connection.options = options, it will only update a single member variable in @sql_retry? Why not just do connections.reconnect_with_consistent_host = true/false instead of using a options hash?

Copy link
Author

@EtienneBerubeShopify EtienneBerubeShopify Nov 16, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did because options could contain more than one values. I also added this so that there could be more feature flags in the future.

Moreover, I thought that performing the "logic" (i.e. extracting the desired option) would make more sense to be handled in the connection class than in the Lhm.Setup.

However, if you don't see any new features coming in the future, I can always make it more clear.

Comment on lines +38 to +67
def update(query, should_retry: false, retry_options: {})
if should_retry
exec_with_retries(:update, query, retry_options)
else
exec(:update, query)
end
end

def select_value(query, should_retry: false, retry_options: {})
if should_retry
exec_with_retries(:select_value, query, retry_options)
else
exec(:select_value, query)
end
end

def select_values(query, should_retry: false, retry_options: {})
if should_retry
exec_with_retries(:select_values, query, retry_options)
else
exec(:select_values, query)
end
end

def select_one(query, should_retry: false, retry_options: {})
if should_retry
exec_with_retries(:select_one, query, retry_options)
else
exec(:select_one, query)
end

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fine for now, it occured to me that AR allows for more than just queries to be passed in here (and possibly also bind arguments?), not that we use this feature..

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see what you mean. Right now we don't accept AR calls from "outside" this gem. So for now this is solid.

private

def exec(method, sql)
connection.public_send(method, Lhm::ProxySQLHelper.tagged(sql))

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the QueryLog, what I mean is if that is set, will the query be double tagged? Is that OK?

For method_missing, we don't need it now, but as we override more and more methods, we might consider using it. Let's defer that for the future tho as maybe at some point we can switch over to mysql2 instead of the AR connection.

@EtienneBerubeShopify
Copy link
Author

For the QueryLog, what I mean is if that is set, will the query be double tagged? Is that OK?
For method_missing, we don't need it now, but as we override more and more methods, we might consider using it. Let's
defer that for the future tho as maybe at some point we can switch over to mysql2 instead of the AR connection.

For some reason, I could not reply. So yes, as soon as this is merged, I'll open a PR in core to remove the QueryLogs tags (since it's handled by ourselves). As for a potential switch, I agree, it would be ideal to have something a bit more complete. However, we'll still have to find a way to determine which methods being delegated will require being tagged. For instance, if we receive a call from #current_database(database_name), tagging the argument blindly will result in problems. So We'll need some lists somewhere of which methods can/cannot be tagged.

Also, the only queries who REALLY require the tags are the inserts. As for normal, queries can be performed on a read-only replica without any issues. There's a pretty major subset of the queries who could be ignored without DRAMATIC consequences.
But this can be addressed when we'll face this problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants