-
Notifications
You must be signed in to change notification settings - Fork 2.6k
Description
PR #2963 seems to have introduced a bug that breaks the CPU collector on illumos by trying to retrieve a named kstat statistic that does not exist; specifically cpu_nsec_wait.
See cpu.c in the illumos source code which implements these kstats, and while cpu_ticks_wait exists in there cpu_nsec_wait does not.
This results in the following error message being printed:
ts=2025-02-02T14:36:26.162Z caller=collector.go:169 level=error msg="collector failed" name=cpu duration_seconds=0.000767328 err="no such file or directory"
And in the /metrics endpoint only the node_cpu_seconds_total{cpu="0",mode="user"} value shows up (no other modes or CPU's); presumably to do with the order in which the iteration works and this one gets saved before we error out on the non-existant kstat?
@discordianfish in #2963 (comment):
Can you (...) provide some more insight in the differences between solaris versions? None of the maintainers have solaris systems at hand afaik, so its kinda harder to support and not break these things
illumos and Oracle Solaris are generally quite compatible given their common ancestry, but over the years have drifted apart a bit. For example the exact kstats implemented in both does differ a bit. We can simply query the kstat values from the command line to see which ones are available:
On illumos:
# uname -a && kstat -c misc -m cpu -i 0 | egrep 'cpu_(ticks|nsec)_'
SunOS wookiee 5.11 joyent_20250123T000246Z i86pc i386 i86pc
cpu_nsec_dtrace 0
cpu_nsec_idle 603566651597601
cpu_nsec_intr 5759459488248
cpu_nsec_kernel 295601044993590
cpu_nsec_user 44010238329188
cpu_ticks_idle 603566651
cpu_ticks_kernel 295601044
cpu_ticks_user 44010238
cpu_ticks_wait 0
On Oracle Solaris:
# uname -a && kstat -c misc -m cpu -i 0 | egrep 'cpu_(ticks|nsec)_'
SunOS solaris 5.11 11.4.0.15.0 i86pc i386 i86pc
cpu_nsec_idle 6427553705227
cpu_nsec_intr 6201354153
cpu_nsec_kernel 19453658924
cpu_nsec_stolen 0
cpu_nsec_user 10548394436
cpu_ticks_idle 642755
cpu_ticks_kernel 1945
cpu_ticks_stolen 0
cpu_ticks_user 1054
cpu_ticks_wait 0
So it appears that in neither illumos nor Oracle Solaris the cpu_nsec_wait kstat exists.
Seems like @rexagod assumed cpu_nsec_wait existed based on the previous issue description:
@davepacheco in #1837 (comment):
The straightforward solution would be to use the cpu_nsec_{idle,kernel,user,wait} kstats instead of the cpu_ticks_{idle,kernel,user,wait} kstats.
Question remains if/how to implement wait if the nsec counter for it does not exist. I see it's zero on both systems I tested; and indeed in the illumos source code we can see it's just always set to zero and it's been that way since the fork from Solaris. So perhaps the best way forward is to just remove it / hardcode it to zero.