Changed the way that stuck NFS mounts are handled.#997
Changed the way that stuck NFS mounts are handled.#997juliusv merged 1 commit intoprometheus:masterfrom
Conversation
|
At first glance, this doesn't look threadsafe. The exporter needs to be able to handle concurrent simultaneous requests for metrics. |
|
Could you elaborate why it doesn't look thread safe? Because I kept that in mind while I was putting this together and as far as my understanding goes, channels are thread safe. |
|
Normally we do these kinds of things with a mutex, I'm not an expert in channels. /cc @juliusv What do you think about this? |
|
I had originally done it with mutex's, but i wanted to avoid any blocking if possible, which I think can be done using channels instead. If it matches standard practices better i can create a new PR with mutex's instead. |
|
I will take a look at this soon. For now, please DCO sign-off ( |
|
Yeah, this doesn't seem quite right to me. Besides philosophical questions of introducing such state, there can be two scrapes at the same time and if they both try to stat the FS at roughly the same time, only one of them will return metrics for it, although nothing is "stuck". You'd have to explicitly track statfs calls that are really stuck, i.e. taking longer than some timeout value, and only block those. But even then I don't think it's a good practice to just fail silently and just drop metrics. There should be at least some metric that indicates that a given FS is experiencing collection errors, or something like that. |
|
Sorry about the DCO signoff, I made a quick change on github and didn't do it there. What would be an appropriate timeout value to use? And the metric isn't dropped completely silently, it's reported as a device error. |
|
As far as mutex vs channels, would this approach be better? https://github.com/mknapphrt/node_exporter/blob/stuckmountmutex/collector/filesystem_linux.go |
Not sure, maybe 30s?
Ah sorry, I missed that. Great.
Yeah, I don't think we need channels here. But we should first define on a high-level how to treat "stuck" FSes before talking about the exact implementation. So let's say one statfs call takes >30s, then it would globally mark that filesystem as stuck, and others would avoid stat-ing it. When would it be marked unstuck, if ever? When the
There's a race here between the second and third step where you want to ensure that if you detect a timeout, but the statfs call just finishes at that moment and marks it unstuck, you don't then still mark it as stuck, because then it will never be marked unstuck again. That would also have to be coordinated / locked. |
|
@juliusv Would you recommend making a different PR with those specs or just modifying this one? |
|
You can just force-push your branch to do any fixups you need, no need for a separate PR. |
|
Would it be best to just hardcode the 30 second timeout, or add it as a flag? |
|
I'd not add flags unless there are divergent user needs requiring them. The
goal should be that exporters work out of the box for everyone.
…On Thu, Jul 12, 2018, 20:32 mknapphrt ***@***.***> wrote:
Would it be best to just hardcode the 30 second timeout, or add it as a
flag?
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#997 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAANaOM483de3j_Qdupy8EQ9s40XNFaGks5uF5Y-gaJpZM4VLTDH>
.
|
collector/filesystem_linux.go
Outdated
There was a problem hiding this comment.
Rather than giving it a generic type name, name a mutex after what it is protecting, like stuckMountsMtx.
collector/filesystem_linux.go
Outdated
collector/filesystem_linux.go
Outdated
There was a problem hiding this comment.
"at the same time the timeout procs"... hmm somehow that sentence fragment doesn't parse for me?
collector/filesystem_linux.go
Outdated
There was a problem hiding this comment.
I'd call this stuckMountWatcher or something that makes its purpose clearer.
collector/filesystem_linux.go
Outdated
There was a problem hiding this comment.
I'm not sure we need this second lock channel. Since we are already holding the mutex here, can't we just check again under the mutex that the success channel hasn't been closed yet?
There was a problem hiding this comment.
(Though I think that requires moving the closing of the success channel into the mutex-protected section in GetStats() too)
There was a problem hiding this comment.
I think you're right, for some reason I think I convinced myself of a case where it wouldn't work like that but I don't remember what it was and I don't see a way it wouldn't work now.
collector/filesystem_linux.go
Outdated
collector/filesystem_linux.go
Outdated
There was a problem hiding this comment.
style nit: add space after "//"
|
Looks great to me now besides last nits. |
|
@mknapphrt Ah sorry, the DCO check is still not passing because some of the commits don't have a signoff line. Could you squash it all into one, with signoff line? |
…turn, it will stop being queried until it returns. Fixed spelling mistakes. Update transport_generic.go Changed to a mutex approach instead of channels and added a timeout before declaring a mount stuck. Removed unnecessary lock channel and clarified some var names. Fixed style nits. Signed-off-by: Mark Knapp <mknapp@hudson-trading.com>
|
How would I go about checking why buildkite failed? |
|
Buildkite is failing because it's out of disk space on a couple of the platform builds. |
|
Yeah, I believe we can ignore the buildkite error here. 👍 Thanks! |
|
Next up we might want some metrics here to provide a stuck status. |
|
@SuperQ This is already tracked in a |
|
Sounds fine. |
…turn, it will stop being queried until it returns. (prometheus#997) Fixed spelling mistakes. Update transport_generic.go Changed to a mutex approach instead of channels and added a timeout before declaring a mount stuck. Removed unnecessary lock channel and clarified some var names. Fixed style nits. Signed-off-by: Mark Knapp <mknapp@hudson-trading.com>
This PR is meant to handle some of the issues mentioned in #868 and #244. What this attempts to do is stop monitoring an NFS mount if it doesn't return from the stat call. For each a mount a channel is created that acts like a lock for the mount point. If, when scraped, the channel can't accept, it means that the previous call to stat the mount never returned, so the mount point is skipped over. Once that "stuck" mount point recovers it will write to the channel and monitoring will resume for that mount point.
I know one of biggest issues with this kind of approach was that the exporter is supposed to be stateless and this will introduce some state. But I feel the benefits of being able to monitor the mounts are worth it, but I'm not prometheus exporter so any opinions would be appreciated.
Signed-off-by: Mark Knapp mknapp@hudson-trading.com