npm install fails intermittently due to race condition in Redirect.prototype.onResponse

### Summary
We spent the past 3 months investigating an issue where "npm install" would fail intermittently. 
 NPM 4 shows an error like this:

```
npm ERR! tar.unpack unzip error C:\example-project\temp\npm-tmp\npm-13780-b27f9707\onedrive.pkgs.visualstudio.com\_packaging\team\npm\registry\cldrjs\-\cldrjs-0.4.8.tgz
```

When we inspected the NPM cache, we found that the downloaded tarball was truncated.  This mostly seems to happen in repos that have lots of multi-megabyte dependencies, and in configurations where the NPM registry redirects the HTTP request for the tarball.  It was extremely difficult to investigate because it's a race condition that will stop happening if you change any number of things (the computer, the time of day, the proxy settings, etc).

The issue also occurs with the NPM 5 client, but the error message is different (although still ultimately due to a corrupted tarball):

```
npm ERR! code EINTEGRITY
npm ERR! sha1-20E+D1MrUwyeZvYOmVLkQnQ+o1Y= integrity checksum failed when using sha1: wanted sha1-20E+D1MrUwyeZvYOmVLkQnQ+o1Y= but got sha1-ygU0cyjTuZdxezDk6ofWXCGJbqI=. (867808 bytes)
```

The issue also occurs with the [PNPM client](https://github.com/pnpm/pnpm), which relies on the **Request** library as well.  

> ***UPDATE:** PNPM is no longer affected by this issue.  See comments.*

In this case, the symptom is an unhandled stream error:

```
              events.js:160
                               throw er; // Unhandled 'error' event
                               ^
 
              Error: read ECONNRESET
                             at exports._errnoException (util.js:1018:11)
                             at TLSWrap.onread (net.js:572:26)
```

Wireshark and the VSTS server logs show that the server is correctly streaming the tarball.  In fact, we can repro the issue using [Fiddler](https://www.telerik.com/fiddler) on the local machine, and Fiddler downloads the full tarball, even though NPM still fails.  The explanation is that the NPM client is aborting the connection due to a bug in the **Request** library.

### Simplest Example to Reproduce

We ultimately debugged this using a physical PC that consistently repros the problem.  Unfortunately we cannot find any other environment that consistently encounters the errors, even though 10 or more people have reported these issues to us over the past few months.  In some cases it fails consistently on the person's computer for a day or so, then suddenly starts working again.  It was really difficult to pin down.

### Problem Analysis

The problem happens when the request for a tarball encounters an HTTP redirect to an Azure blob storage URL.  It goes like this:

- `Request.init()` initializes a bunch of data structures
- `Request.start()` creates a connection, obtains a socket, wires up event handlers
- Stream events are fired such as "socket" and "request", and processed by the calling application
- An HTTP response is received and passed to `Request.onRequestResponse()`

At this point, the [lib/redirect.js](https://github.com/request/request/blob/253c5e507ddb95dd88622087b6387655bd0ff935/lib/redirect.js#L79) module kicks in with a kind of hacky way of handling this:

- It tries to carefully delete all the state that was set up so far:
```javascript
  delete request.src
  delete request.req
  delete request._started
  if (response.statusCode !== 401 && response.statusCode !== 307) {
    // Remove parameters from the previous response, unless this is the second request
    // for a server that requires digest authentication.
    delete request.body
    delete request._form
    if (request.headers) {
      request.removeHeader('host')
      request.removeHeader('content-type')
      request.removeHeader('content-length')
      if (request.uri.hostname !== request.originalHost.split(':')[0]) {
        // Remove authorization if changing hostnames (but not if just
        // changing ports or protocols).  This matches the behavior of curl:
        // https://github.com/bagder/curl/blob/6beb0eee/lib/http.c#L710
        request.removeHeader('authorization')
      }
    }
  }
```
- ...but note that it doesn't unwire any event handlers, and it obviously cannot undo any stream events that were already processed by the caller
- Then it calls `Request.init()` again on the same object
- Which causes `Request.start()` to create a new connection and a new a socket, and wire up the same event handlers to these new objects
- Stream events are fired such as "socket" and "request" are fired a second time
- The second HTTP response is received and passed to `Request.onRequestResponse()`, which starts streaming bytes to the calling application

Even though this design seems super brittle, it does work most of the time.  In most cases, the tarball is successfully streamed and unzipped.

But every once in a while, [onRequestError()](https://github.com/request/request/blob/253c5e507ddb95dd88622087b6387655bd0ff935/request.js#L862) will get called with an ECONNRESET error that crashes the install.  Here's the details:

- The error occurs after the above sequence, and always on the second socket for that **Request** instance
- Inside `onRequestError()`, we find that `self.req.socket.destroyed` is true.  It seems to be an old socket that was supposed to be returned to the connection pool.
- The Wireshark timings show that this ECONNRESET error is for an unused HTTP connection that has been idle for a while, and is now timing out due to keepalive
- But we never find a socket in that state [where it's first obtained](https://github.com/request/request/blob/253c5e507ddb95dd88622087b6387655bd0ff935/request.js#L778) -- seems like something is destroying the socket later
- For the PNPM repro, the application's error handler is missing because it's a "once" handler that has already fired once before.  We tried working around this by attaching a second handler, but it just crashes later because [npm-registry-client](https://github.com/npm/npm-registry-client/blob/c03a77aba3e86e191b4025e34da47afb452c684a/lib/fetch.js#L40) tries to push the error back down onto the response's stream, creating an infinite loop

### Possible Solution



I believe the ideal solution would be to rethink the design of **lib/redirect.js**, to avoid trying to reconstruct the Request instance, and to avoid mixing together stream events from two different HTTP requests.  That's probably a lot of work.

For PNPM (which we are using now), we found a workaround which simply ignores the error and prevents it from tripping up the installation:

```javascript
Request.prototype.onRequestError = function (error) {
  var self = this
  if (self._aborted) {
    return
  }
  if (self.req && self.req._reusedSocket && error.code === 'ECONNRESET' &&
    self.agent.addRequestNoreuse) {
    self.agent = { addRequest: self.agent.addRequestNoreuse.bind(self.agent) }
    self.start()
    self.req.end()
    return
  }
  if (self.timeout && self.timeoutTimer) {
    clearTimeout(self.timeoutTimer)
    self.timeoutTimer = null
  }

  // PROPOSED WORKAROUND:
  if (self.req && self.req.socket && self.req.socket.destroyed) {
    return;
  }

  self.emit('error', error)
}
```

It's interesting that a few lines above, someone else had written some logic that seems to be handling a similar issue for the [ForeverAgent](https://github.com/request/forever-agent) case.  However that code was disabled [forever ago](https://github.com/request/request/blob/253c5e507ddb95dd88622087b6387655bd0ff935/request.js#L471).

This workaround unfortunately does not fix NPM, even though the `self.req.socket.destroyed` condition does get executed many times during the install.


### Context


I'm not sure why the race condition suddenly became so frequent this summer.  We've tried upgrading/downgrading NodeJS and NPM and **request** and **npm-registry-client** and it has no effect.  8 different experts from various teams (including Azure and Visual Studio Team Services) all worked together on this investigation.

It's been a huge headache for us. I just received another problem report from someone **as I was typing up this issue**.  :-P  It would be awesome if it could get some attention.  You can find [lots](https://github.com/facebookincubator/create-react-app/issues/1427) and [lots](https://github.com/cncjs/cncjs/issues/172) of [examples](https://github.com/mithunsatheesh/node-rules/issues/32) of people struggling with these symptoms.  Like our experiences, people figure out a black magic "solution" that merely shuffles the race condition around, and seems to work, but then another person hits the same issue.

### Your Environment


| software         | version
| ---------------- | -------
| request          | 2.83.0
| node             | 6.10.2
| npm              | 5.5.1
| Operating System | Windows 10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

npm install fails intermittently due to race condition in Redirect.prototype.onResponse #2807

Summary

Simplest Example to Reproduce

Problem Analysis

Possible Solution

Context

Your Environment

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

npm install fails intermittently due to race condition in Redirect.prototype.onResponse #2807

Description

Summary

Simplest Example to Reproduce

Problem Analysis

Possible Solution

Context

Your Environment

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions