Generate CPU Enhanced Metrics#430
Generate CPU Enhanced Metrics#430shreyamalpani merged 10 commits intojordan.gonzalez/bottlecap/universal-instrumentationfrom
Conversation
|
|
||
| if let Some(offsets) = enhanced_metric_data { | ||
| lambda_enhanced_metrics.set_cpu_utilization_enhanced_metrics(offsets.cpu_offset, offsets.uptime_offset); | ||
| } | ||
|
|
||
| break; |
There was a problem hiding this comment.
Why all the way here? It looks unnecessary to do this, what's the difference between doing it here, and all the way where its created?
There was a problem hiding this comment.
When these metrics are sent after the PlatformReport event it includes a lot of idle time after the actual invocation, so the cpu utilization ratio becomes very low relative to the idle time. I tested this out with a function, and for the first invocation it was showing around 10% but for all the subsequent invocations it reported varied numbers sometimes less than 1% because of the idle time between invocations. When I tried out sending after the PlatformRuntimeDone event, it consistently reported around 20%, which is similar to the numbers in the go agent as well.
| if let Some(duration) = post_runtime_duration_ms { | ||
| lambda_enhanced_metrics.set_post_runtime_duration_metric(duration); | ||
| } | ||
| if let Some(offsets) = enhanced_metric_data { | ||
| lambda_enhanced_metrics.set_network_enhanced_metrics(offsets.network_offset); | ||
| lambda_enhanced_metrics.set_cpu_time_enhanced_metrics(offsets.cpu_offset); | ||
| } | ||
| drop(p); |
There was a problem hiding this comment.
Not sure I understand why are we not doing the some as before?
There was a problem hiding this comment.
I realized that we were returning None for both post_runtime_duration_ms and enhanced_metric_data if we could not calculate post_runtime_duration_ms because of context.runtime_duration_ms being 0. Changed it to handle so that even if we can't calculate post_runtime_duration_ms, we still try to send enhanced metrics.
| use libc; | ||
| use std::io; |
There was a problem hiding this comment.
What's the justification on this crate? Why libc clock and not the standard clock?
There was a problem hiding this comment.
I think we need libc to accurately get the number of clock ticks per second so that the cpu time differences can be converted to milliseconds, not sure how we could get this using the standard clock
|
|
||
| for line in reader.lines() { | ||
| let line = line?; | ||
| let mut values = line.split_whitespace(); |
There was a problem hiding this comment.
Can't you just do line?.split_whitespace() ?
There was a problem hiding this comment.
I get an error that the temporary value of line? gets dropped while borrowing and to use a let binding
There was a problem hiding this comment.
As a tip for next time, usually this blocks of code could be written in a functional way. Asking chatgpt, the refactor is instantaneous:
let cpu_data = reader.lines()
.filter_map(|line| line.ok())
.filter_map(|line| {
let mut values = line.split_whitespace();
values.next().map(|label| (label, values))
})
.fold(CPUData {
total_user_time_ms: 0.0,
total_system_time_ms: 0.0,
total_idle_time_ms: 0.0,
individual_cpu_idle_times: HashMap::new(),
}, |mut cpu_data, (label, mut values)| {
match label {
"cpu" => {
let user: Option<f64> = values.next().and_then(|s| s.parse().ok());
values.next(); // skip "nice"
let system: Option<f64> = values.next().and_then(|s| s.parse().ok());
let idle: Option<f64> = values.next().and_then(|s| s.parse().ok());
if let (Some(user_val), Some(system_val), Some(idle_val)) = (user, system, idle) {
cpu_data.total_user_time_ms = (1000.0 * user_val) / clktck;
cpu_data.total_system_time_ms = (1000.0 * system_val) / clktck;
cpu_data.total_idle_time_ms = (1000.0 * idle_val) / clktck;
}
}
label if label.starts_with("cpu") => {
let idle: Option<f64> = values.nth(3).and_then(|s| s.parse().ok());
if let Some(idle_val) = idle {
cpu_data.individual_cpu_idle_times.insert(label.to_string(), (1000.0 * idle_val) / clktck);
}
}
_ => {}
}
cpu_data
});
the pros are:
- usually more legible (one filter/action at time)
- more robust (fn once ensure things are used just once)
- it's a 0 cost abstraction
- less indentation
| enhanced_metric_data = p.on_platform_runtime_done( | ||
| &request_id, | ||
| metrics.duration_ms, | ||
| config.clone(), |
There was a problem hiding this comment.
all these stuff like configs and tags provider could be set once at initialization of the invocation_processor, rather than cloned continuously
There was a problem hiding this comment.
True, I'll be doing some refactoring in my next PR, so I can look into that then
| #[allow(clippy::cast_sign_loss)] | ||
| #[cfg(not(target_os = "windows"))] | ||
| pub fn get_clk_tck() -> Result<u64, io::Error> { | ||
| let clk_tck = unsafe { libc::sysconf(libc::_SC_CLK_TCK) }; |
There was a problem hiding this comment.
does this change over time during an invocation? Or can we read it only once at the start?
| #[allow(clippy::cast_sign_loss)] | ||
| #[cfg(not(target_os = "windows"))] | ||
| pub fn get_clk_tck() -> Result<u64, io::Error> { | ||
| let clk_tck = unsafe { libc::sysconf(libc::_SC_CLK_TCK) }; |
There was a problem hiding this comment.
I'm kinda not sure if this code is going to cause problems, we are only checking if it's -1, there's no comments here for unexpected/other behavior. I'm blocking the PR so we can talk about this in person.
duncanista
left a comment
There was a problem hiding this comment.
Blocking so we don't merge this and can talk more about it in person, would like to see if there are other alternatives we could try.
Let's ask Rust experts what we can do around the clock code and thoroughly test it with an alpine container too!
…to shreya.malpani/cpu-enhanced-metrics
| #[allow(clippy::cast_sign_loss)] | ||
| #[cfg(not(target_os = "windows"))] | ||
| pub fn get_clk_tck() -> Result<u64, io::Error> { | ||
| match sysconf(SysconfVar::CLK_TCK) { | ||
| Ok(Some(clk_tck)) if clk_tck > 0 => Ok(clk_tck as u64), | ||
| _ => Err(io::Error::new( | ||
| io::ErrorKind::NotFound, | ||
| "Could not find system clock ticks per second", | ||
| )), | ||
| } | ||
| } | ||
|
|
||
| #[cfg(target_os = "windows")] | ||
| pub fn get_clk_tck() -> Result<u64, io::Error> { | ||
| // Windows does not have this concept | ||
| Ok(1) | ||
| } |
There was a problem hiding this comment.
Do we still need the same type of constructor here now that you're not using the libc crate?
There was a problem hiding this comment.
The nix package is a safer wrapper to the libc crate, so we would still need a similar structure
There was a problem hiding this comment.
Cool, just want to make sure if we still needed the windows os, the clippy cast sign loss allow and etc
839d76f
into
jordan.gonzalez/bottlecap/universal-instrumentation
* send cpu metrics * clippy fixes * fixes * set utilization metrics before flushing & format fixes * added comment to explain utilization metrics calculation timing * use nix instead of libc to get system clock * update LICENSE-3rdparty.yml * added comments to explain calculations * clippy
* send cpu metrics * clippy fixes * fixes * set utilization metrics before flushing & format fixes * added comment to explain utilization metrics calculation timing * use nix instead of libc to get system clock * update LICENSE-3rdparty.yml * added comments to explain calculations * clippy
What does this PR do?
This PR introduces eight new enhanced lambda metrics related to CPU usage. Each of the metrics are emitted once per invocation.
Three of the metrics represent CPU time spent running the lambda function:
aws.lambda.enhanced.cpu_system_time- time spent by the CPU running in kernel modeaws.lambda.enhanced.cpu_user_time- time spent by the CPU running in user modeaws.lambda.enhanced.cpu_total_time- sum ofaws.lambda.enhanced.cpu_system_timeandaws.lambda.enhanced.cpu_user_timeThe other five metrics represent CPU utilization by the lambda function:
aws.lambda.enhanced.cpu_total_utilization_pct- total CPU utilization of the function, as a normalized percent (e.g. 35%)aws.lambda.enhanced.cpu_total_utilization- total CPU utilization of the function, as normalized cores (e.g. 1.2 cores)aws.lambda.enhanced.num_cores- total number of cores available in the environment (e.g. 4 cores)aws.lambda.enhanced.cpu_min_utilization- CPU utilization on the least utilized core, as a normalized percent (e.g. 10%)aws.lambda.enhanced.cpu_max_utilization- CPU utilization on the most utilized core, as a normalized percent (e.g. 80%)Additional Notes
PlatformReportevent (similar to how the network metrics are being sent) to match the metric values reported by Lambda InsightsPlatformRuntimeDoneevent (similar to how these metrics are being sent in the go agent) to avoid accounting for additional idle time after this event.Testing