Skip to content
This repository was archived by the owner on Mar 4, 2026. It is now read-only.

fix: pause request stream on backpressure#936

Merged
olavloite merged 9 commits intogoogleapis:masterfrom
olavloite:runstream-mem-consumption
Jun 3, 2020
Merged

fix: pause request stream on backpressure#936
olavloite merged 9 commits intogoogleapis:masterfrom
olavloite:runstream-mem-consumption

Conversation

@olavloite
Copy link
Copy Markdown
Contributor

The request stream should be paused if the downstream is indicating that it cannot handle any more data at the moment. The request stream should be resumed once the downstream does accept data again. This reduces memory consumption and potentially out of memory errors when
a result stream is piped into a slow writer.

Fixes #934

The request stream should be paused if the downstream is indicating
that it cannot handle any more data at the moment. The request stream
should be resumed once the downstream does accept data again. This
reduces memory consumption and potentially out of memory errors when
a result stream is piped into a slow writer.

Fixes googleapis#934
@googlebot googlebot added the cla: yes This human has signed the Contributor License Agreement. label May 5, 2020
@olavloite olavloite added the kokoro:force-run Add this label to force Kokoro to re-run the tests. label May 5, 2020
@yoshi-kokoro yoshi-kokoro removed the kokoro:force-run Add this label to force Kokoro to re-run the tests. label May 5, 2020
@skuruppu
Copy link
Copy Markdown
Contributor

skuruppu commented May 6, 2020

This change looks good to me @olavloite. Thanks for working on this. I can't say I have the expertise to review it properly. @bcoe could I kindly ask you to take a look to see if this is reasonable?

@skuruppu skuruppu requested a review from bcoe May 6, 2020 02:42
@skuruppu
Copy link
Copy Markdown
Contributor

skuruppu commented May 6, 2020

I'm also very confused by these test failures that are happening on all PRs. I can't reproduce it locally.

@olavloite
Copy link
Copy Markdown
Contributor Author

I've been doing some additional testing with and without this fix using the function below. It does a select from a Spanner database, transforms the rows into JSON strings and writes that to a file. There is also a custom transformer in the pipeline that artificially slows the write progress every 50 rows to simulate a slow flush.

Running this script with/without the change in this PR on mid-sized (87MB) and a huge (2GB) result set gives the following results for maximum memory usage:

Normal result set (87MB)

  • Without this fix: 324MB
  • With this fix: 330MB

Huge result set (2GB)

  • Without this fix: out-of-memory error (max heap size was set to 1.5GB)
  • With this fix: 918MB

The exact effect of this fix will depend a lot on multiple factors:

  • The size of the result set that is streamed.
  • The size of the individual partial result sets that Cloud Spanner generates.
  • The delay in the pipeline.

Larger values for any of the above will result in higher memory usage in all cases, and without this fix it can cause the entire result set to be loaded into memory.

Without slow flush

It should also be noted that streaming result sets will in most cases be extremely efficient. Running the same test without the simulated slow flushes show the same results both with and without this fix. It also shows the effect of how Spanner chunks the result set. The memory consumption when streaming the huge result set is a lot lower than when streaming the normal result set (test results are equal with/without this fix):

  • Normal result set: 299MB
  • Huge result set: 33MB

Test script

async function queryWithMemUsage(instanceId, databaseId, projectId) {
  // Imports the Google Cloud client library
  const {Spanner} = require('@google-cloud/spanner');

  // eslint-disable-next-line node/no-unpublished-require
  // const {Spanner} = require('../build/src');
  const fs = require('fs');
  const stream = require('stream');
  const util = require('util');
  // eslint-disable-next-line node/no-unsupported-features/node-builtins
  const pipeline = util.promisify(stream.pipeline);

  // Creates a client
  const spanner = new Spanner({
    projectId: projectId,
  });

  // Gets a reference to a Cloud Spanner instance and database
  const instance = spanner.instance(instanceId);
  const database = instance.database(databaseId);

  const query = {
    sql: `SELECT *
          FROM TableWithAllColumnTypes
          ORDER BY ColInt64`,
  };

  let count = 0;
  let maxMemMeasured = 0;
  const fileStream = fs.createWriteStream('/home/loite/rs.txt');
  const rs = database.runStream(query);

  console.time('process result set');
  // eslint-disable-next-line node/no-unsupported-features/node-builtins
  await pipeline(
    rs,
    new stream.Transform({
      objectMode: true,
      highWaterMark: 100,
      transform(chunk, encoding, callback) {
        count++;
        if (count % 100 === 0) {
          console.log(`Processed ${count} rows so far`);
          global.gc();
          const used = process.memoryUsage().heapUsed / 1024 / 1024;
          const memUsed = Math.round(used * 100) / 100;
          console.log(`Current mem usage: ${memUsed} MB`);
          maxMemMeasured = Math.max(maxMemMeasured, memUsed);
        }
        this.push(`${JSON.stringify(chunk.toJSON({wrapNumbers: true}))}\n`);
        callback();
      },
    }),
    // Create an artificially slow transformer to simulate network latency.
    new stream.Transform({
      highWaterMark: 100,
      transform(chunk, encoding, callback) {
        // Simulate a slow flush every 50 records.
        if (count % 50 === 0) {
          setTimeout(() => {
            this.push(chunk, encoding);
            callback();
          }, Math.random() * 200 + 100);
        } else {
          this.push(chunk, encoding);
          callback();
        }
      },
    }),
    fileStream
  );
  console.timeEnd('process result set');
  console.log(`Max memory used: ${maxMemMeasured} MB`);
  console.log('Finished writing file');
  await database.close();
}

@olavloite olavloite added the kokoro:force-run Add this label to force Kokoro to re-run the tests. label May 6, 2020
@yoshi-kokoro yoshi-kokoro removed the kokoro:force-run Add this label to force Kokoro to re-run the tests. label May 6, 2020
@skuruppu skuruppu requested review from stephenplusplus and removed request for bcoe May 13, 2020 10:08
@stephenplusplus
Copy link
Copy Markdown
Contributor

@olavloite this is pretty great. We've been carrying around the subtle issue where we have a single data event that needs to be split apart and forwarded to the next stream as multiple data events. That's the core issue we're attacking here, right?

One thing I noticed-- it looks like the transform stream would indefinitely try if the consumer never becomes ready. We probably want a cap on the maximum amount of attempts to avoid that.

I have a module "split-array-stream" which currently has the same problem. We use split-array-stream throughout various libraries, maybe even this one, although it looks like not from this file. I haven't merged this change yet, but would this class either be plug-and-playable here, or be useful in some way to incorporate in PartialResultStream? The description of this PR shows how it can be used: stephenplusplus/split-array-stream#4

Note that it's not merged and released, as it has not yet had a formal review. However, at the time, I had put it through similar tests as you did for this change. Since it seems like we're attacking the same issue, I thought it could be worth checking out.

Let me know what you think!

@olavloite
Copy link
Copy Markdown
Contributor Author

@stephenplusplus
Thanks for having a look at this.

We've been carrying around the subtle issue where we have a single data event that needs to be split apart and forwarded to the next stream as multiple data events. That's the core issue we're attacking here, right?

Correct. Cloud Spanner returns a stream of PartialResultSets which each contains a set of rows. The PartialResultSets are split into the individual rows which are forwarded to the next stream.

One thing I noticed-- it looks like the transform stream would indefinitely try if the consumer never becomes ready. We probably want a cap on the maximum amount of attempts to avoid that.

Good point. I'll add an escape for that possibility.

Regarding split-array-stream: That's interesting. And also quite a bit to digest to really understand what it does :-) I'll look into it.

It does however seem that it's solving much of the same problem that we are having here, but there's one thing that might be different (or that I'm missing): During my testing of the Spanner client library without this PR I was able to make it go out of memory. That seems to have been caused by request stream that kept pushing PartialResultSets into the PartialResultSetStream. To prevent that, this change also explicitly pauses and resumes the request stream as needed. Does the split-array-stream also support that, either in this way or some other way?

olavloite added 2 commits May 18, 2020 10:22
PartialResultSetStream should stop retrying to push data into the
stream after a configurable number of retries have failed.
@codecov
Copy link
Copy Markdown

codecov bot commented May 18, 2020

Codecov Report

Merging #936 into master will increase coverage by 0.00%.
The diff coverage is 100.00%.

Impacted file tree graph

@@           Coverage Diff           @@
##           master     #936   +/-   ##
=======================================
  Coverage   98.26%   98.26%           
=======================================
  Files          21       21           
  Lines       20356    20423   +67     
  Branches     1084     1096   +12     
=======================================
+ Hits        20002    20069   +67     
  Misses        351      351           
  Partials        3        3           
Impacted Files Coverage Δ
src/partial-result-stream.ts 100.00% <100.00%> (ø)
src/transaction.ts 100.00% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 6aa745a...e8b831b. Read the comment docs.

@olavloite
Copy link
Copy Markdown
Contributor Author

@stephenplusplus Would you mind taking a second look at this?

@stephenplusplus
Copy link
Copy Markdown
Contributor

@olavloite looks good to me. Just a thought that I know would be inconvenient to implement, and potentially not worth it because of that-- after this change, the stream philosophy of "each stream doesn't concern itself with another stream" is broken when we pass PartialResultStream needs to know about requestStream.

Possibly a way around this would be to have the PRS emit events to indicate it needs a break, and concerned streams could react as a result.

Something like:

values.forEach(value => {
  res = this._addValue(value) && res;
  if (!res && !this._requestStream.isPaused()) {
-    this._requestStream.pause();
+    this.emit('paused');
  }
});
requestsStream
  .pipe(batchAndSplitOnTokenStream)
  // If we get this error, the checkpoint stream has flushed any rows
  // it had queued. We can now destroy the user's stream, as our retry
  // attempts are over.
  .on('error', (err: Error) => userStream.destroy(err))
  .on('checkpoint', (row: google.spanner.v1.PartialResultSet) => {
    lastResumeToken = row.resumeToken;
  })
  .pipe(userStream)
+  .on('paused', () => {
+    requestsStream.pause();
+  })

In fact, writing that made me realize that should work by default, shouldn't it? The only reason we don't get built-in backpressure is because we do the one-to-many split of data events. But now that we push each data event singularly, you should be able to pause the PRS stream itself and have the streams before it react properly automatically.

@olavloite
Copy link
Copy Markdown
Contributor Author

Just a thought that I know would be inconvenient to implement, and potentially not worth it because of that-- after this change, the stream philosophy of "each stream doesn't concern itself with another stream" is broken when we pass PartialResultStream needs to know about requestStream.

I like the idea of this. It does make it more idiomatic while the end result is the same. We do need two events, though: paused and resumed. The implementation is straightforward.

@stephenplusplus
Copy link
Copy Markdown
Contributor

Thanks for doing that!

@olavloite olavloite requested a review from stephenplusplus May 31, 2020 07:32
@olavloite olavloite merged commit 558692f into googleapis:master Jun 3, 2020
@olavloite olavloite deleted the runstream-mem-consumption branch June 3, 2020 12:43
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

cla: yes This human has signed the Contributor License Agreement.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Spanner *runStream* cause a lot memory consumption

5 participants