fix: pause request stream on backpressure by olavloite · Pull Request #936 · googleapis/nodejs-spanner

olavloite · 2020-05-05T20:12:07Z

The request stream should be paused if the downstream is indicating that it cannot handle any more data at the moment. The request stream should be resumed once the downstream does accept data again. This reduces memory consumption and potentially out of memory errors when
a result stream is piped into a slow writer.

Fixes #934

The request stream should be paused if the downstream is indicating that it cannot handle any more data at the moment. The request stream should be resumed once the downstream does accept data again. This reduces memory consumption and potentially out of memory errors when a result stream is piped into a slow writer. Fixes googleapis#934

skuruppu · 2020-05-06T02:42:34Z

This change looks good to me @olavloite. Thanks for working on this. I can't say I have the expertise to review it properly. @bcoe could I kindly ask you to take a look to see if this is reasonable?

skuruppu · 2020-05-06T04:43:31Z

I'm also very confused by these test failures that are happening on all PRs. I can't reproduce it locally.

olavloite · 2020-05-06T08:57:21Z

I've been doing some additional testing with and without this fix using the function below. It does a select from a Spanner database, transforms the rows into JSON strings and writes that to a file. There is also a custom transformer in the pipeline that artificially slows the write progress every 50 rows to simulate a slow flush.

Running this script with/without the change in this PR on mid-sized (87MB) and a huge (2GB) result set gives the following results for maximum memory usage:

Normal result set (87MB)

Without this fix: 324MB
With this fix: 330MB

Huge result set (2GB)

Without this fix: out-of-memory error (max heap size was set to 1.5GB)
With this fix: 918MB

The exact effect of this fix will depend a lot on multiple factors:

The size of the result set that is streamed.
The size of the individual partial result sets that Cloud Spanner generates.
The delay in the pipeline.

Larger values for any of the above will result in higher memory usage in all cases, and without this fix it can cause the entire result set to be loaded into memory.

Without slow flush

It should also be noted that streaming result sets will in most cases be extremely efficient. Running the same test without the simulated slow flushes show the same results both with and without this fix. It also shows the effect of how Spanner chunks the result set. The memory consumption when streaming the huge result set is a lot lower than when streaming the normal result set (test results are equal with/without this fix):

Normal result set: 299MB
Huge result set: 33MB

Test script

async function queryWithMemUsage(instanceId, databaseId, projectId) {
  // Imports the Google Cloud client library
  const {Spanner} = require('@google-cloud/spanner');

  // eslint-disable-next-line node/no-unpublished-require
  // const {Spanner} = require('../build/src');
  const fs = require('fs');
  const stream = require('stream');
  const util = require('util');
  // eslint-disable-next-line node/no-unsupported-features/node-builtins
  const pipeline = util.promisify(stream.pipeline);

  // Creates a client
  const spanner = new Spanner({
    projectId: projectId,
  });

  // Gets a reference to a Cloud Spanner instance and database
  const instance = spanner.instance(instanceId);
  const database = instance.database(databaseId);

  const query = {
    sql: `SELECT *
          FROM TableWithAllColumnTypes
          ORDER BY ColInt64`,
  };

  let count = 0;
  let maxMemMeasured = 0;
  const fileStream = fs.createWriteStream('/home/loite/rs.txt');
  const rs = database.runStream(query);

  console.time('process result set');
  // eslint-disable-next-line node/no-unsupported-features/node-builtins
  await pipeline(
    rs,
    new stream.Transform({
      objectMode: true,
      highWaterMark: 100,
      transform(chunk, encoding, callback) {
        count++;
        if (count % 100 === 0) {
          console.log(`Processed ${count} rows so far`);
          global.gc();
          const used = process.memoryUsage().heapUsed / 1024 / 1024;
          const memUsed = Math.round(used * 100) / 100;
          console.log(`Current mem usage: ${memUsed} MB`);
          maxMemMeasured = Math.max(maxMemMeasured, memUsed);
        }
        this.push(`${JSON.stringify(chunk.toJSON({wrapNumbers: true}))}\n`);
        callback();
      },
    }),
    // Create an artificially slow transformer to simulate network latency.
    new stream.Transform({
      highWaterMark: 100,
      transform(chunk, encoding, callback) {
        // Simulate a slow flush every 50 records.
        if (count % 50 === 0) {
          setTimeout(() => {
            this.push(chunk, encoding);
            callback();
          }, Math.random() * 200 + 100);
        } else {
          this.push(chunk, encoding);
          callback();
        }
      },
    }),
    fileStream
  );
  console.timeEnd('process result set');
  console.log(`Max memory used: ${maxMemMeasured} MB`);
  console.log('Finished writing file');
  await database.close();
}

stephenplusplus · 2020-05-14T13:54:18Z

@olavloite this is pretty great. We've been carrying around the subtle issue where we have a single data event that needs to be split apart and forwarded to the next stream as multiple data events. That's the core issue we're attacking here, right?

One thing I noticed-- it looks like the transform stream would indefinitely try if the consumer never becomes ready. We probably want a cap on the maximum amount of attempts to avoid that.

I have a module "split-array-stream" which currently has the same problem. We use split-array-stream throughout various libraries, maybe even this one, although it looks like not from this file. I haven't merged this change yet, but would this class either be plug-and-playable here, or be useful in some way to incorporate in PartialResultStream? The description of this PR shows how it can be used: stephenplusplus/split-array-stream#4

Note that it's not merged and released, as it has not yet had a formal review. However, at the time, I had put it through similar tests as you did for this change. Since it seems like we're attacking the same issue, I thought it could be worth checking out.

Let me know what you think!

olavloite · 2020-05-14T16:32:05Z

@stephenplusplus
Thanks for having a look at this.

We've been carrying around the subtle issue where we have a single data event that needs to be split apart and forwarded to the next stream as multiple data events. That's the core issue we're attacking here, right?

Correct. Cloud Spanner returns a stream of PartialResultSets which each contains a set of rows. The PartialResultSets are split into the individual rows which are forwarded to the next stream.

One thing I noticed-- it looks like the transform stream would indefinitely try if the consumer never becomes ready. We probably want a cap on the maximum amount of attempts to avoid that.

Good point. I'll add an escape for that possibility.

Regarding split-array-stream: That's interesting. And also quite a bit to digest to really understand what it does :-) I'll look into it.

It does however seem that it's solving much of the same problem that we are having here, but there's one thing that might be different (or that I'm missing): During my testing of the Spanner client library without this PR I was able to make it go out of memory. That seems to have been caused by request stream that kept pushing PartialResultSets into the PartialResultSetStream. To prevent that, this change also explicitly pauses and resumes the request stream as needed. Does the split-array-stream also support that, either in this way or some other way?

PartialResultSetStream should stop retrying to push data into the stream after a configurable number of retries have failed.

codecov · 2020-05-18T09:23:12Z

Codecov Report

Merging #936 into master will increase coverage by 0.00%.
The diff coverage is 100.00%.

@@           Coverage Diff           @@
##           master     #936   +/-   ##
=======================================
  Coverage   98.26%   98.26%           
=======================================
  Files          21       21           
  Lines       20356    20423   +67     
  Branches     1084     1096   +12     
=======================================
+ Hits        20002    20069   +67     
  Misses        351      351           
  Partials        3        3

Impacted Files	Coverage Δ
src/partial-result-stream.ts	`100.00% <100.00%> (ø)`
src/transaction.ts	`100.00% <100.00%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 6aa745a...e8b831b. Read the comment docs.

olavloite · 2020-05-27T14:31:28Z

@stephenplusplus Would you mind taking a second look at this?

src/partial-result-stream.ts

stephenplusplus · 2020-05-27T16:25:17Z

@olavloite looks good to me. Just a thought that I know would be inconvenient to implement, and potentially not worth it because of that-- after this change, the stream philosophy of "each stream doesn't concern itself with another stream" is broken when we pass PartialResultStream needs to know about requestStream.

Possibly a way around this would be to have the PRS emit events to indicate it needs a break, and concerned streams could react as a result.

Something like:

values.forEach(value => {
  res = this._addValue(value) && res;
  if (!res && !this._requestStream.isPaused()) {
-    this._requestStream.pause();
+    this.emit('paused');
  }
});

requestsStream
  .pipe(batchAndSplitOnTokenStream)
  // If we get this error, the checkpoint stream has flushed any rows
  // it had queued. We can now destroy the user's stream, as our retry
  // attempts are over.
  .on('error', (err: Error) => userStream.destroy(err))
  .on('checkpoint', (row: google.spanner.v1.PartialResultSet) => {
    lastResumeToken = row.resumeToken;
  })
  .pipe(userStream)
+  .on('paused', () => {
+    requestsStream.pause();
+  })

In fact, writing that made me realize that should work by default, shouldn't it? The only reason we don't get built-in backpressure is because we do the one-to-many split of data events. But now that we push each data event singularly, you should be able to pause the PRS stream itself and have the streams before it react properly automatically.

olavloite · 2020-05-29T18:43:48Z

Just a thought that I know would be inconvenient to implement, and potentially not worth it because of that-- after this change, the stream philosophy of "each stream doesn't concern itself with another stream" is broken when we pass PartialResultStream needs to know about requestStream.

I like the idea of this. It does make it more idiomatic while the end result is the same. We do need two events, though: paused and resumed. The implementation is straightforward.

stephenplusplus · 2020-05-29T19:24:06Z

Thanks for doing that!

googlebot added the cla: yes This human has signed the Contributor License Agreement. label May 5, 2020

olavloite added the kokoro:force-run Add this label to force Kokoro to re-run the tests. label May 5, 2020

yoshi-kokoro removed the kokoro:force-run Add this label to force Kokoro to re-run the tests. label May 5, 2020

skuruppu requested a review from bcoe May 6, 2020 02:42

olavloite mentioned this pull request May 6, 2020

Spanner *runStream* cause a lot memory consumption #934

Closed

olavloite added the kokoro:force-run Add this label to force Kokoro to re-run the tests. label May 6, 2020

yoshi-kokoro removed the kokoro:force-run Add this label to force Kokoro to re-run the tests. label May 6, 2020

skuruppu requested review from stephenplusplus and removed request for bcoe May 13, 2020 10:08

olavloite added 2 commits May 18, 2020 10:22

Merge branch 'master' into runstream-mem-consumption

ebd9fa5

fix: do not retry stream indefinitely

155bc08

PartialResultSetStream should stop retrying to push data into the stream after a configurable number of retries have failed.

Merge branch 'master' into runstream-mem-consumption

8e070ad

stephenplusplus reviewed May 27, 2020

View reviewed changes

src/partial-result-stream.ts Outdated Show resolved Hide resolved

src/partial-result-stream.ts Show resolved Hide resolved

src/partial-result-stream.ts Outdated Show resolved Hide resolved

olavloite added 2 commits May 29, 2020 20:31

fix: process review comments

ac21dd7

fix: remove unused code

b2488d0

olavloite added 3 commits May 30, 2020 13:27

tests: add test for pause/resume

060aa37

fix: return after giving up retrying + add test

884a896

Merge branch 'master' into runstream-mem-consumption

e8b831b

olavloite requested a review from stephenplusplus May 31, 2020 07:32

stephenplusplus approved these changes Jun 1, 2020

View reviewed changes

olavloite merged commit 558692f into googleapis:master Jun 3, 2020

olavloite deleted the runstream-mem-consumption branch June 3, 2020 12:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: pause request stream on backpressure#936

fix: pause request stream on backpressure#936
olavloite merged 9 commits intogoogleapis:masterfrom
olavloite:runstream-mem-consumption

olavloite commented May 5, 2020

Uh oh!

skuruppu commented May 6, 2020

Uh oh!

skuruppu commented May 6, 2020

Uh oh!

olavloite commented May 6, 2020

Uh oh!

stephenplusplus commented May 14, 2020

Uh oh!

olavloite commented May 14, 2020

Uh oh!

codecov bot commented May 18, 2020 •

edited

Loading

Uh oh!

olavloite commented May 27, 2020

Uh oh!

Uh oh!

Uh oh!

Uh oh!

stephenplusplus commented May 27, 2020

Uh oh!

olavloite commented May 29, 2020

Uh oh!

stephenplusplus commented May 29, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

olavloite commented May 5, 2020

Uh oh!

skuruppu commented May 6, 2020

Uh oh!

skuruppu commented May 6, 2020

Uh oh!

olavloite commented May 6, 2020

Normal result set (87MB)

Huge result set (2GB)

Without slow flush

Test script

Uh oh!

stephenplusplus commented May 14, 2020

Uh oh!

olavloite commented May 14, 2020

Uh oh!

codecov bot commented May 18, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

olavloite commented May 27, 2020

Uh oh!

Uh oh!

Uh oh!

Uh oh!

stephenplusplus commented May 27, 2020

Uh oh!

olavloite commented May 29, 2020

Uh oh!

stephenplusplus commented May 29, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

codecov bot commented May 18, 2020 •

edited

Loading