Changed the way that stuck NFS mounts are handled. by mknapphrt · Pull Request #997 · prometheus/node_exporter

mknapphrt · 2018-07-11T14:53:43Z

This PR is meant to handle some of the issues mentioned in #868 and #244. What this attempts to do is stop monitoring an NFS mount if it doesn't return from the stat call. For each a mount a channel is created that acts like a lock for the mount point. If, when scraped, the channel can't accept, it means that the previous call to stat the mount never returned, so the mount point is skipped over. Once that "stuck" mount point recovers it will write to the channel and monitoring will resume for that mount point.

I know one of biggest issues with this kind of approach was that the exporter is supposed to be stateless and this will introduce some state. But I feel the benefits of being able to monitor the mounts are worth it, but I'm not prometheus exporter so any opinions would be appreciated.

Signed-off-by: Mark Knapp mknapp@hudson-trading.com

SuperQ · 2018-07-11T15:30:31Z

At first glance, this doesn't look threadsafe. The exporter needs to be able to handle concurrent simultaneous requests for metrics.

mknapphrt · 2018-07-11T15:45:42Z

Could you elaborate why it doesn't look thread safe? Because I kept that in mind while I was putting this together and as far as my understanding goes, channels are thread safe.

SuperQ · 2018-07-11T16:03:15Z

Normally we do these kinds of things with a mutex, I'm not an expert in channels. /cc @juliusv What do you think about this?

mknapphrt · 2018-07-11T16:12:41Z

I had originally done it with mutex's, but i wanted to avoid any blocking if possible, which I think can be done using channels instead. If it matches standard practices better i can create a new PR with mutex's instead.

SuperQ · 2018-07-11T16:34:11Z

I will take a look at this soon. For now, please DCO sign-off (git commit -s) your commits. The buildkite error was an intermittent issue with github, retrying.

juliusv · 2018-07-11T16:38:25Z

Yeah, this doesn't seem quite right to me. Besides philosophical questions of introducing such state, there can be two scrapes at the same time and if they both try to stat the FS at roughly the same time, only one of them will return metrics for it, although nothing is "stuck".

You'd have to explicitly track statfs calls that are really stuck, i.e. taking longer than some timeout value, and only block those.

But even then I don't think it's a good practice to just fail silently and just drop metrics. There should be at least some metric that indicates that a given FS is experiencing collection errors, or something like that.

mknapphrt · 2018-07-11T17:40:56Z

Sorry about the DCO signoff, I made a quick change on github and didn't do it there.

What would be an appropriate timeout value to use? And the metric isn't dropped completely silently, it's reported as a device error.

mknapphrt · 2018-07-11T17:47:19Z

As far as mutex vs channels, would this approach be better? https://github.com/mknapphrt/node_exporter/blob/stuckmountmutex/collector/filesystem_linux.go

juliusv · 2018-07-11T20:04:09Z

What would be an appropriate timeout value to use?

Not sure, maybe 30s?

And the metric isn't dropped completely silently, it's reported as a device error.

Ah sorry, I missed that. Great.

As far as mutex vs channels, would this approach be better?

Yeah, I don't think we need channels here.

But we should first define on a high-level how to treat "stuck" FSes before talking about the exact implementation. So let's say one statfs call takes >30s, then it would globally mark that filesystem as stuck, and others would avoid stat-ing it. When would it be marked unstuck, if ever? When the Collect() call that initially marked it stuck does complete after some time? Or would it be marked unstuck unconditionally after some time, even if that particular Collect() is still stuck on stat-ing it? Probably the former? In that case, it could be implemented in the following way:

In Collect(), skip devices that are already marked as stuck (in a set of type map[string]struct{} protected by a mutex, both of which unfortunately have to be global because collectors are newly constructed on every scrape).
In parallel to doing the statfs call, spin off a watcher goroutine that either terminates without action when the statfs call finishes on time or marks the FS as stuck when it takes too long. (good termination communicated via closed channel, bad via timeout timer channel, and differentiated by select).
In the main goroutine, always mark an FS as unstuck after the statfs call has succeeded (delete from stuck set).

There's a race here between the second and third step where you want to ensure that if you detect a timeout, but the statfs call just finishes at that moment and marks it unstuck, you don't then still mark it as stuck, because then it will never be marked unstuck again. That would also have to be coordinated / locked.

mknapphrt · 2018-07-12T13:17:45Z

@juliusv Would you recommend making a different PR with those specs or just modifying this one?

SuperQ · 2018-07-12T13:45:55Z

You can just force-push your branch to do any fixups you need, no need for a separate PR.

mknapphrt · 2018-07-12T18:32:29Z

Would it be best to just hardcode the 30 second timeout, or add it as a flag?

grobie · 2018-07-12T18:48:17Z

I'd not add flags unless there are divergent user needs requiring them. The goal should be that exporters work out of the box for everyone.

…

On Thu, Jul 12, 2018, 20:32 mknapphrt ***@***.***> wrote: Would it be best to just hardcode the 30 second timeout, or add it as a flag? — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#997 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAANaOM483de3j_Qdupy8EQ9s40XNFaGks5uF5Y-gaJpZM4VLTDH> .

juliusv · 2018-07-13T08:35:10Z

collector/filesystem_linux.go

Rather than giving it a generic type name, name a mutex after what it is protecting, like stuckMountsMtx.

juliusv · 2018-07-13T08:36:22Z

collector/filesystem_linux.go

is use -> is used

juliusv · 2018-07-13T08:38:09Z

collector/filesystem_linux.go

"at the same time the timeout procs"... hmm somehow that sentence fragment doesn't parse for me?

juliusv · 2018-07-13T08:41:13Z

collector/filesystem_linux.go

I'd call this stuckMountWatcher or something that makes its purpose clearer.

juliusv · 2018-07-13T08:43:21Z

collector/filesystem_linux.go

I'm not sure we need this second lock channel. Since we are already holding the mutex here, can't we just check again under the mutex that the success channel hasn't been closed yet?

(Though I think that requires moving the closing of the success channel into the mutex-protected section in GetStats() too)

I think you're right, for some reason I think I convinced myself of a case where it wouldn't work like that but I don't remember what it was and I don't see a way it wouldn't work now.

juliusv · 2018-07-13T15:06:39Z

collector/filesystem_linux.go

nit: ".The" -> ". The"

juliusv · 2018-07-13T15:07:14Z

collector/filesystem_linux.go

style nit: add space after "//"

juliusv · 2018-07-13T15:07:36Z

Looks great to me now besides last nits.

juliusv · 2018-07-13T15:21:57Z

@mknapphrt Ah sorry, the DCO check is still not passing because some of the commits don't have a signoff line. Could you squash it all into one, with signoff line?

…turn, it will stop being queried until it returns. Fixed spelling mistakes. Update transport_generic.go Changed to a mutex approach instead of channels and added a timeout before declaring a mount stuck. Removed unnecessary lock channel and clarified some var names. Fixed style nits. Signed-off-by: Mark Knapp <mknapp@hudson-trading.com>

mknapphrt · 2018-07-13T17:50:37Z

How would I go about checking why buildkite failed?

SuperQ · 2018-07-14T07:51:37Z

Buildkite is failing because it's out of disk space on a couple of the platform builds.

juliusv · 2018-07-14T09:10:17Z

Yeah, I believe we can ignore the buildkite error here.

👍 Thanks!

SuperQ · 2018-07-15T06:32:52Z

Next up we might want some metrics here to provide a stuck status.

juliusv · 2018-07-15T20:19:38Z

@SuperQ This is already tracked in a node_filesystem_device_error metric with value 1 and FS labels. Is that sufficient?

SuperQ · 2018-07-15T20:21:14Z

Sounds fine.

…turn, it will stop being queried until it returns. (prometheus#997) Fixed spelling mistakes. Update transport_generic.go Changed to a mutex approach instead of channels and added a timeout before declaring a mount stuck. Removed unnecessary lock channel and clarified some var names. Fixed style nits. Signed-off-by: Mark Knapp <mknapp@hudson-trading.com>

SuperQ requested a review from juliusv July 11, 2018 16:32