-
Notifications
You must be signed in to change notification settings - Fork 24
Added "ProxySQL-aware" reconnect logic for LHM #112
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
3846361 to
32a68c8
Compare
|
Great work on putting this PR together! 🎉 Thanks for being resourceful and talking with other teams and working on different components to fix the LHM reconnect issue!
One concern I have is that since we are making these changes in the Shopify LHM gem, what will happen if we get reconnected during an LHM for a CloudSQL app that uses ProxySQL? |
Great question! If we query |
|
Have we enabled the ProxySQL dedicated hostgroup feature for CloudSQL apps? One path forward is to detect whether we are talking to KateSQL or core, and only reconnect for those two? |
I don't know what's the topology when it comes to CloudSQL. Some simple changes to the ProxySQL config would make the feature work. I remember people from connection management talking about it, but weren't sure if it was going to be added.
That would be a good idea for a future PR. A fairly simple way of approaching it would be to add a feature flag that could be enabled or disabled depending on the app. |
With this PR, do we abort an LHM if it encounters a connection error for a CloudSQL app? If so, I'm okay with delaying this check till later. |
Not for now, doing |
Conversely, what if they re-use server id like they reuse hostname? |
This is a VERY good question. I did run a few failovers myself with a test app and found that they are changing servers for every failover (i.e. they don't "dedicate" a MySQL server to your app). So the server id did change, which mean we can catch "Google-level" failovers, but I didn't find anything online on assessing the true uniqueness of However, the goal is to catch and abort on server_id change. So whenever a failover happens, the EDIT: Checking the MySQL docs about |
Have we looked into using
https://dev.mysql.com/doc/refman/5.6/en/replication-options.html#sysvar_server_uuid |
|
Since this work is attempting to make LHM ProxySQL aware, I think it should also be aware if it's actually connected to a ProxySQL. This could help to clean-up the code a bit and avoid the server ID check if we are connected to MySQL directly. I had a read in the ProxySQL code and they do hijack a global to expose this: Notice that this is case sensitive though... |
Hmm, I REALLY like the idea of uniqueness. I think I will add the |
670903d to
f0020bb
Compare
…fail fast and which should retry
f0020bb to
aeeaafe
Compare
shuhaowu
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Usually I would provide a summary of my review, but unfortunately this one is too big for me to read through in one sitting, so I don't remember all the details. That said, there are a few points that stood out to me:
- There's no action item associated for this for now, but it needs to be mentioned: the way the code is structured now is with a sql retry block within a sql retry block. The control flow of the code is based on exceptions that's raised from the blocks. As a result, the code weaves in and out of multiple files, making it relatively difficult to understand. The good parts is at least we have reasonably descriptive comments to help the reader out. The root cause of this is not the retriable library is too simplistic and we've outgrown it. I believe the code should be significantly simpler if we just wrote our own retry routines. That said, doing this now will be even a bigger rabbit hole, so let's not worry about it for now.
- Downstream apps already uses
ActiveRecord::QueryLogsto set the maintenance:lhm marginalia comments. I see we added a way to tag the queries. How do these interact?
I think I have a number of other questions in the comments. For those, I don't quite know the answer as I didn't have enough time to fully think it through. As long as you thought it through and can answer it, I think those can be resolved.
I also have some suggestions of renaming and what not. Up to you if you want to take it or not.
For the next steps of this PR: I would like to merge it ASAP if possible. I'll take a look at this PR hopefully one more time after all the comments are addressed. Hopefully that will be good enough to merge and we can continue working in smaller changes.
| assert logs.last.include?("Lost connection to MySQL server at 'reading initial communication packet") | ||
| end | ||
|
|
||
| it "Will abort if failover happens (mimicked with proxySQL)" do |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How is this test different from "will abort if new writer is not same host"? The other test appears to use mocks to change the host, but this one does it for real? Isn't this test better than the other test and therefore the other one doesn't need to be there?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They are similar, yes. They take 2 approaches to ensure that the code behaves as intended. I'm not against having more tests than not, but I can understand that it could be seen as boilerplate and unnecessary. I kept it thinking it would not cause any harm, but it can be removed. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Keeping is fine.
0ce11af to
b3da1ea
Compare
shuhaowu
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's NOT merge this for now due to infrastructure freeze as I want to keep the master branch to the latest version, in case we have to do any emergency fixes. If we merge this to master, doing emergency fixes might be more challenging.
| raise 'Please call Lhm.setup' unless defined?(ActiveRecord) | ||
| @@connection = Connection.new(connection: ActiveRecord::Base.connection, options: connection_options || {}) | ||
| else | ||
| @@connection.options = connection_options unless connection_options.nil? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wait, I thought the idea of this method is to just return @@connection if it exist and error if it doesn't? This kind of overwrite feels like it might be fragile.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is also what I thought, but there was an assignment in there. I didn't want to break the "developer facing" interface, so I kept it as is.
Original code:
def connection
@@connection ||=
begin
raise 'Please call Lhm.setup' unless defined?(ActiveRecord)
ActiveRecord::Base.connection
end
endThe decision came from this thread that we had. [source]
| def options=(options) | ||
| # If any other flags are added. Add the "processing" here | ||
| @sql_retry.reconnect_with_consistent_host = options[:reconnect_with_consistent_host] || false |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So if we do connection.options = options, it will only update a single member variable in @sql_retry? Why not just do connections.reconnect_with_consistent_host = true/false instead of using a options hash?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did because options could contain more than one values. I also added this so that there could be more feature flags in the future.
Moreover, I thought that performing the "logic" (i.e. extracting the desired option) would make more sense to be handled in the connection class than in the Lhm.Setup.
However, if you don't see any new features coming in the future, I can always make it more clear.
| def update(query, should_retry: false, retry_options: {}) | ||
| if should_retry | ||
| exec_with_retries(:update, query, retry_options) | ||
| else | ||
| exec(:update, query) | ||
| end | ||
| end | ||
|
|
||
| def select_value(query, should_retry: false, retry_options: {}) | ||
| if should_retry | ||
| exec_with_retries(:select_value, query, retry_options) | ||
| else | ||
| exec(:select_value, query) | ||
| end | ||
| end | ||
|
|
||
| def select_values(query, should_retry: false, retry_options: {}) | ||
| if should_retry | ||
| exec_with_retries(:select_values, query, retry_options) | ||
| else | ||
| exec(:select_values, query) | ||
| end | ||
| end | ||
|
|
||
| def select_one(query, should_retry: false, retry_options: {}) | ||
| if should_retry | ||
| exec_with_retries(:select_one, query, retry_options) | ||
| else | ||
| exec(:select_one, query) | ||
| end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fine for now, it occured to me that AR allows for more than just queries to be passed in here (and possibly also bind arguments?), not that we use this feature..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see what you mean. Right now we don't accept AR calls from "outside" this gem. So for now this is solid.
| private | ||
|
|
||
| def exec(method, sql) | ||
| connection.public_send(method, Lhm::ProxySQLHelper.tagged(sql)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the QueryLog, what I mean is if that is set, will the query be double tagged? Is that OK?
For method_missing, we don't need it now, but as we override more and more methods, we might consider using it. Let's defer that for the future tho as maybe at some point we can switch over to mysql2 instead of the AR connection.
For some reason, I could not reply. So yes, as soon as this is merged, I'll open a PR in core to remove the QueryLogs tags (since it's handled by ourselves). As for a potential switch, I agree, it would be ideal to have something a bit more complete. However, we'll still have to find a way to determine which methods being delegated will require being tagged. For instance, if we receive a call from Also, the only queries who REALLY require the tags are the inserts. As for normal, queries can be performed on a read-only replica without any issues. There's a pretty major subset of the queries who could be ignored without DRAMATIC consequences. |
What am I trying to achieve
Currently, if there is a loss of connection between LHM and the MySQL server, LHM will retry but with the closed connection. This PR addresses the problem + goes deeper with underlying potential data loss problems when used with Proxy SQL.
Why?
A reconnection can happen whenever the connection between LHM --> MySQL server or LHM --> ProxySQL is lost. However, in some scenarios involving failovers, the writer will change and with the right amount of replication lag, this can cause data loss.
However, in cases where the connection is loss and the writer is the same, then it is perfectly safe to reconnect and retry the last SQL statement.
How
The solution implemented goes as follows: When an LHM is created, it will save the current writer's hostname (
select @@global.hostname). Before every chunk insert, it will check that the hostname is the same as the initial one.In the event of a connection loss, LHM will then start the reconnection procedure.
Note: if the hostname is localhost, then it will check for the server's id (i.e.
select @@global.server_id)Complete algorithm:
What changed?
Most of the new logic resides in
lib/lhm/sql_retry.rb. There are other changes in other files likechunker.rbandchunk_insert.rb, but they don't add any features. However, since the new logic change the retry logic quite drastically, a lot of the tests had to be adapted. This means that a lot of the tests files had to modified in some way.What to check?