-
Notifications
You must be signed in to change notification settings - Fork 113
Session Pool Leaks & Investigation #192
Description
Environment details
- OS: CoreOS
- Node.js version: v8.9.6
- npm version: v6.0.0
@google-cloud/spannerversion: v1.4.1 (kinda - see below)
The issue
As I've mentioned in my other issues (#183, #177) we're (@honeyscience) attempting to apply a fairly large amount of load to our new Spanner backed APIs but are running into a slew of session pooling issues. The main issues are:
- Time spent waiting for a session is too high
- Session pool grows seemingly without bound (Session pool max not respected #177)
- When the session max is hit you either need to spin up even more nodes ($$$) or wait ~an hour for sessions to be deleted by the server
As per our previous issue (#134) & PR (#135) into the session pooling code the module is setup in such a way which makes it quite prone to race conditions. On top of that the session pool management responsibilities are not 100% contained within the session pool module itself which has made understanding it a larger challenge. Without knowing exactly what the authors were aiming for, and not sure we’d be able to confidently ensure no leaks with the current architecture, we decided to replace it rather than try to fix it.
Our first attempt at replacing it was simply a rudimentary create-upon-request and destroy after 60 seconds setup. You can see this here (https://github.com/honeyscience/nodejs-spanner/commit/c76df11457e3b72fa0c1f32088af780f9e64e730#diff-0c4114554752f1c79d35d17a91563062R292).
Although this setup suffered from some write failures and increased latency overall it worked amazingly. We were able to run the API overnight with 120 replicas of the API in total never consuming more than 100 Spanner sessions.
This is of course non-optimal as you can see from the following chart where the purple line is session destruction and the blue line is session creation:

The impact of this approach is increased latency and some occasional failures. However as a proof of concept we were more than pleased with the results.
We then decided we’d try to implement our own pool implementation leveraging this library for the main pool management functionality: https://github.com/coopernurse/node-pool
The code for this is available here https://github.com/honeyscience/nodejs-spanner/blob/869871e45a034af2ba9158466912dc4f17e8e32b/src/session-pool.js
It still has some issues (particularly with write sessions) but so far it has been doing fairly well.
The main issue we’ve seen with our current setup is everything seems to be only using write sessions which are sometimes slow to create or be released back into the pool. We suspect there is still a location not releasing back to the pool because with the new setup the pool properly enforces the max session count but all are marked borrowed so everything is queued and thus times out.
Spanner API requests for the same period:

When the limit for sessions was much higher we saw failed writes causes spikes in session create requests (upwards of a few thousand per second - see screenshot) but does not solve write failures, they still timeout. Which seems like a similar issue if not the same one.
Looking around we found this PR which addressed an issue with write sessions googleapis/google-cloud-node#2561. Could be relevant to the write session speed issue.
We're currently looking into if this change f5897d6 will resolve the issue with sessions getting stuck in the borrowed state.
If there is any other information you can give us on how sessions work which could be relevant to our efforts we're all ears.
We’re still digging at this from our side and will report back with any more findings. We're more than happy to work with anyone on your side to get a speedy and solid resolution to this.

