-
Notifications
You must be signed in to change notification settings - Fork 3.2k
Description
Summary
We spent the past 3 months investigating an issue where "npm install" would fail intermittently.
NPM 4 shows an error like this:
npm ERR! tar.unpack unzip error C:\example-project\temp\npm-tmp\npm-13780-b27f9707\onedrive.pkgs.visualstudio.com\_packaging\team\npm\registry\cldrjs\-\cldrjs-0.4.8.tgz
When we inspected the NPM cache, we found that the downloaded tarball was truncated. This mostly seems to happen in repos that have lots of multi-megabyte dependencies, and in configurations where the NPM registry redirects the HTTP request for the tarball. It was extremely difficult to investigate because it's a race condition that will stop happening if you change any number of things (the computer, the time of day, the proxy settings, etc).
The issue also occurs with the NPM 5 client, but the error message is different (although still ultimately due to a corrupted tarball):
npm ERR! code EINTEGRITY
npm ERR! sha1-20E+D1MrUwyeZvYOmVLkQnQ+o1Y= integrity checksum failed when using sha1: wanted sha1-20E+D1MrUwyeZvYOmVLkQnQ+o1Y= but got sha1-ygU0cyjTuZdxezDk6ofWXCGJbqI=. (867808 bytes)
The issue also occurs with the PNPM client, which relies on the Request library as well.
UPDATE: PNPM is no longer affected by this issue. See comments.
In this case, the symptom is an unhandled stream error:
events.js:160
throw er; // Unhandled 'error' event
^
Error: read ECONNRESET
at exports._errnoException (util.js:1018:11)
at TLSWrap.onread (net.js:572:26)
Wireshark and the VSTS server logs show that the server is correctly streaming the tarball. In fact, we can repro the issue using Fiddler on the local machine, and Fiddler downloads the full tarball, even though NPM still fails. The explanation is that the NPM client is aborting the connection due to a bug in the Request library.
Simplest Example to Reproduce
We ultimately debugged this using a physical PC that consistently repros the problem. Unfortunately we cannot find any other environment that consistently encounters the errors, even though 10 or more people have reported these issues to us over the past few months. In some cases it fails consistently on the person's computer for a day or so, then suddenly starts working again. It was really difficult to pin down.
Problem Analysis
The problem happens when the request for a tarball encounters an HTTP redirect to an Azure blob storage URL. It goes like this:
Request.init()initializes a bunch of data structuresRequest.start()creates a connection, obtains a socket, wires up event handlers- Stream events are fired such as "socket" and "request", and processed by the calling application
- An HTTP response is received and passed to
Request.onRequestResponse()
At this point, the lib/redirect.js module kicks in with a kind of hacky way of handling this:
- It tries to carefully delete all the state that was set up so far:
delete request.src
delete request.req
delete request._started
if (response.statusCode !== 401 && response.statusCode !== 307) {
// Remove parameters from the previous response, unless this is the second request
// for a server that requires digest authentication.
delete request.body
delete request._form
if (request.headers) {
request.removeHeader('host')
request.removeHeader('content-type')
request.removeHeader('content-length')
if (request.uri.hostname !== request.originalHost.split(':')[0]) {
// Remove authorization if changing hostnames (but not if just
// changing ports or protocols). This matches the behavior of curl:
// https://github.com/bagder/curl/blob/6beb0eee/lib/http.c#L710
request.removeHeader('authorization')
}
}
}- ...but note that it doesn't unwire any event handlers, and it obviously cannot undo any stream events that were already processed by the caller
- Then it calls
Request.init()again on the same object - Which causes
Request.start()to create a new connection and a new a socket, and wire up the same event handlers to these new objects - Stream events are fired such as "socket" and "request" are fired a second time
- The second HTTP response is received and passed to
Request.onRequestResponse(), which starts streaming bytes to the calling application
Even though this design seems super brittle, it does work most of the time. In most cases, the tarball is successfully streamed and unzipped.
But every once in a while, onRequestError() will get called with an ECONNRESET error that crashes the install. Here's the details:
- The error occurs after the above sequence, and always on the second socket for that Request instance
- Inside
onRequestError(), we find thatself.req.socket.destroyedis true. It seems to be an old socket that was supposed to be returned to the connection pool. - The Wireshark timings show that this ECONNRESET error is for an unused HTTP connection that has been idle for a while, and is now timing out due to keepalive
- But we never find a socket in that state where it's first obtained -- seems like something is destroying the socket later
- For the PNPM repro, the application's error handler is missing because it's a "once" handler that has already fired once before. We tried working around this by attaching a second handler, but it just crashes later because npm-registry-client tries to push the error back down onto the response's stream, creating an infinite loop
Possible Solution
I believe the ideal solution would be to rethink the design of lib/redirect.js, to avoid trying to reconstruct the Request instance, and to avoid mixing together stream events from two different HTTP requests. That's probably a lot of work.
For PNPM (which we are using now), we found a workaround which simply ignores the error and prevents it from tripping up the installation:
Request.prototype.onRequestError = function (error) {
var self = this
if (self._aborted) {
return
}
if (self.req && self.req._reusedSocket && error.code === 'ECONNRESET' &&
self.agent.addRequestNoreuse) {
self.agent = { addRequest: self.agent.addRequestNoreuse.bind(self.agent) }
self.start()
self.req.end()
return
}
if (self.timeout && self.timeoutTimer) {
clearTimeout(self.timeoutTimer)
self.timeoutTimer = null
}
// PROPOSED WORKAROUND:
if (self.req && self.req.socket && self.req.socket.destroyed) {
return;
}
self.emit('error', error)
}It's interesting that a few lines above, someone else had written some logic that seems to be handling a similar issue for the ForeverAgent case. However that code was disabled forever ago.
This workaround unfortunately does not fix NPM, even though the self.req.socket.destroyed condition does get executed many times during the install.
Context
I'm not sure why the race condition suddenly became so frequent this summer. We've tried upgrading/downgrading NodeJS and NPM and request and npm-registry-client and it has no effect. 8 different experts from various teams (including Azure and Visual Studio Team Services) all worked together on this investigation.
It's been a huge headache for us. I just received another problem report from someone as I was typing up this issue. :-P It would be awesome if it could get some attention. You can find lots and lots of examples of people struggling with these symptoms. Like our experiences, people figure out a black magic "solution" that merely shuffles the race condition around, and seems to work, but then another person hits the same issue.
Your Environment
| software | version |
|---|---|
| request | 2.83.0 |
| node | 6.10.2 |
| npm | 5.5.1 |
| Operating System | Windows 10 |