Refactor perf collector to better handle failures and warn on opens#2190
Refactor perf collector to better handle failures and warn on opens#2190hodgesds wants to merge 1 commit intoprometheus:masterfrom
Conversation
collector/perf_linux.go
Outdated
There was a problem hiding this comment.
I may consider using AllCacheProfilers for clearer code but that may enable some profilers that aren't used and cause more multiplexing of perf events on the CPU.
94a0c8d to
a18a787
Compare
collector/perf_linux.go
Outdated
There was a problem hiding this comment.
This function is just a wrapper around unix.Getrlimit.
360ae91 to
e78e5d7
Compare
collector/perf_linux.go
Outdated
There was a problem hiding this comment.
In theory these could come from an object pool so extra allocations aren't required.
|
After this diff I'd like to add stalled frontend/backend cycles as it can be used to give a better picture of instruction performance. Here's a post that has some background info as well as some of the ways that the |
|
Is it possible to just attempt opening the profiler handles and fail with a meaningful error? Not sure how I feel about checking open files. All(?) other collectors just attempt opening stuff and fail if they can't. |
Potentially, the problem is some of the profilers could be unavailable due to kernel configurations or even physical hardware (ex: CPU doesn't implement all hardware counters). The other problem is there is a large combination of different profiler handle types so having one flag to enable them all is a little problematic.
That's reasonable if setting up the profiler handles fail fast. I was almost thinking that check would be better as a info check during start up to let users know if they are using default configs that may need tuned (IIRC |
843f661 to
dcd2665
Compare
But then the call can fail and we return a meaningful error to the user, right?
All in all, I might still miss something but I don't understand why we can't just open the handlers and return an error that is meaningful if it fails (like 'out of fds' vs 'cpu doesn't support hw counters' vs 'perf not enabled in kernel'). That'd be ideal I think? |
Here's a simple example, if running in a virtual you may not have access to the underlying hardware counters, but still may have access to software counters from the kernel. We can make it fail fast, but that makes the perf collector basically worthless for use inside a VM. Alternatively, it could do something like a best effort of setting up all the relevant profilers and fail if none are available. The other tricky part with errors return from |
Yeah, that, plus logging the ones that can't be setup, sounds reasonable. I'd get rid of the file handlers check here though and return an error if out of fds. @SuperQ wdyt? |
dcd2665 to
0827fab
Compare
|
I think this needs rebasing to pick up the macos fix. |
…le limits Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>
0827fab to
f9ee4de
Compare
…le limits Origin: prometheus#2190 Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>
…le limits Origin: prometheus#2190 Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>
This changes the perf collector to handle partial failures even better by checking the
HasProfilersmethod on the various profilers interfaces. The configuration of the profiler interfaces have changed so that an optional bitmask of profilers can be set. In theory this could make for much finer control of which profilers are enabled, but that is beyond the scope of this change. The profiler interfaces are also changed so that in theory object pools could be used to reduce memory allocations.Tested running locally and then:
Signed-off-by: Daniel Hodges hodges.daniel.scott@gmail.com