Skip to content

Comments

Release v0.14.0-rc.1.#423

Merged
SuperQ merged 1 commit intomasterfrom
superq/v0.14.0-rc.1
Jan 16, 2017
Merged

Release v0.14.0-rc.1.#423
SuperQ merged 1 commit intomasterfrom
superq/v0.14.0-rc.1

Conversation

@SuperQ
Copy link
Member

@SuperQ SuperQ commented Jan 15, 2017

  • Update CHANGELOG
  • Update VERSION

Changes:
NOTE: We are deprecating several collectors in this release.

  • gmond - Out of scope.
  • megacli - Requires forking, moved to textfile collection.
  • ntp - Out of scope.

* Update CHANGELOG
* Update VERSION
@mdlayher
Copy link
Contributor

Wifi collector is Linux only for now, by the way.

Stoked!

@mdlayher
Copy link
Contributor

Also, thoughts on enabling more collectors by default?

I can speak for the wifi and mountstats collectors, at least, being useful to have enabled by default.

If the machine isn't using WiFi or NFS, neither will report any metrics.

@SuperQ
Copy link
Member Author

SuperQ commented Jan 15, 2017

I don't object, as long as they behave well when nothing is enabled on the node.

@discordianfish
Copy link
Member

There is still #216 open which we wanted to get it. I'm flying out tomorrow, so won't have time this week. If you think we should get this out now, we can post-pone it IMO.

Copy link
Member

@discordianfish discordianfish left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Member

@discordianfish discordianfish left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

err LGTM

@SuperQ
Copy link
Member Author

SuperQ commented Jan 16, 2017

Yea, I wanted to see that get in, but there's been no progress on #390 in 2 weeks. I'm not sure it's worth waiting for.

@SuperQ SuperQ merged commit 5a07f41 into master Jan 16, 2017
@SuperQ SuperQ deleted the superq/v0.14.0-rc.1 branch January 16, 2017 15:55
@jcberthon
Copy link

Would it be possible to share with us why ntp is considered out of scope? Is it because it relies on the ntpd software being installed and therefore should get a specific exporter? If yes, will the current code be reused to provide a ntpd_exporter?

@matthiasr
Copy link
Contributor

@jcberthon yes, that is the reason. Right now nothing is being removed, and we'd like to only do so once alternatives are available. Would you like to take the code and throw together a standalone exporter?

@SuperQ
Copy link
Member Author

SuperQ commented Feb 9, 2017

The reason we decided ntp was out of scope is because it functions as a blackbox probe. The collector does a real-time NTP probe against an external server. This could be very high traffic if someone were to point a large number of servers at pool.ntp.org or similar. It uses a golang implementation of the NTP protocol which is totally fine, but we didn't feel like it was a good fit for keeping in the node_exporter

I personally found this probe to be very useful as an additional check against ntpd or other such time syncing client software running on servers, it did produce a lot of jitter, and a lot of extra packets to our NTP server pools in production. A typical NTP client only sends one probe every ~15 minutes per server, not every 15-30sec like a node_exporter being scraped.

It's also useful for nodes that are not running NTP clients for whatever reason.

There are a few ways we can replace this functionality.

  • The code would be easy to adapt to a stand-alone blackbox prober.
  • We would like to add a node_system_clock_milliseconds metric to the node_exporter, and a function in Prometheus to compare the metric to the scrape time of the sample. This would give us +- 1ms diff of the clock vs the clock of the Prometheus server without having to actually probe anything external to the node.

As for monitoring ntpd and other ntp clients, this is something we could easily add as a textfile helper tool. This would export the real metrics provided by NTP client software running on nodes. I have already written a couple, but they're currently not open source. I will attempt to re-implement them and publish them sometime soon. Maybe in Python this time instead of shell. 😄

@discordianfish
Copy link
Member

THere is already a node_system_clock_milliseconds metric. Not sure how it's named. I think node_time or something.

@SuperQ
Copy link
Member Author

SuperQ commented Feb 9, 2017

@discordianfish Seems like node_time is seconds resolution, so not really sufficient for monitoring node offsets.

@discordianfish
Copy link
Member

@SuperQ Right but how would you figure out the precise timestamp of the scrape? The best I came up with was just time() - node_time which might be off by the scrape interval anyway.

@SuperQ
Copy link
Member Author

SuperQ commented Feb 9, 2017

@brian-brazil was talking about a specific function to compare a sample values with their collection timestamps at FOSDEM. Not implemented yet.

@discordianfish
Copy link
Member

Ah, yes something like that would be great.

@jcberthon
Copy link

@matthiasr "would like" yes, but with 4 very young kids at home in Winter and a full time job, I have very limited time available for that. In addition, I haven't yet installed prometheus (no time) but I was looking into it to see if I could use it to monitor my Raspberry Pi NTP server, hence my initial interest ;-)

@SuperQ
Copy link
Member Author

SuperQ commented Feb 14, 2017

If you want to monitor an NTP server, you definitely want the NTP metrics helper script, not the ntp collector plugin.

See: #458

@jcberthon
Copy link

@SuperQ thanks for the hint. However, ntpq -pn gives you a view of the sync status per sources, it is useful information but not enough to know the "quality" of your NTP server, especially because those sources might change (if you use the NTP pool project for instance.

IMHO, It is better to use ntpq -c rv in order to get the ntp server kernel status (sync or not, stratum, rootdisp+rootdelay/2 (which is for me the maximum time difference to true UTC in ms), and offset (also in ms, but I'm still unsure of what that is exactly, ntp documentation is very unclear)). By far I'm not a NTP specialist, but I think those values gives you a better view of the "quality" of your NTP server than what ntpq -pn returns.

I'm also monitoring the frequency and sys_jitter returned value of ntpq -c rv but I'm not sure how to interpret them correctly.

For monitoring this, I'm using a simple script at the moment which loop around that command:

echo "ntpstats,host=$(hostname) $(ntpq -c "rv 0 offset,sys_jitter" | sed 's/ //g'),$(ntpq -c "rv 0 rootdisp,rootdelay" | sed 's/ //g')" | curl --silent --show-error -i -XPOST 'https://influxdb.lan:8086/write?db=telegraf' -u ${username}:${password} --data-binary  @- >/dev/null

Note: as one can see, I'm using the telegraf line protocol with an influxdb database, and grafana for display. But I want to investigate alternatives, hence my interest in prometheus.

@SuperQ
Copy link
Member Author

SuperQ commented Feb 14, 2017

@jcberthon, That's super useful information! I have filed #462 to add additional metrics.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Panic with DBus

5 participants