Skip to content

Generate CPU Enhanced Metrics#430

Merged
shreyamalpani merged 10 commits intojordan.gonzalez/bottlecap/universal-instrumentationfrom
shreya.malpani/cpu-enhanced-metrics
Nov 6, 2024
Merged

Generate CPU Enhanced Metrics#430
shreyamalpani merged 10 commits intojordan.gonzalez/bottlecap/universal-instrumentationfrom
shreya.malpani/cpu-enhanced-metrics

Conversation

@shreyamalpani
Copy link
Copy Markdown
Contributor

What does this PR do?

This PR introduces eight new enhanced lambda metrics related to CPU usage. Each of the metrics are emitted once per invocation.

Three of the metrics represent CPU time spent running the lambda function:

  • aws.lambda.enhanced.cpu_system_time - time spent by the CPU running in kernel mode
  • aws.lambda.enhanced.cpu_user_time - time spent by the CPU running in user mode
  • aws.lambda.enhanced.cpu_total_time - sum of aws.lambda.enhanced.cpu_system_time and aws.lambda.enhanced.cpu_user_time

The other five metrics represent CPU utilization by the lambda function:

  • aws.lambda.enhanced.cpu_total_utilization_pct - total CPU utilization of the function, as a normalized percent (e.g. 35%)
  • aws.lambda.enhanced.cpu_total_utilization - total CPU utilization of the function, as normalized cores (e.g. 1.2 cores)
  • aws.lambda.enhanced.num_cores - total number of cores available in the environment (e.g. 4 cores)
  • aws.lambda.enhanced.cpu_min_utilization - CPU utilization on the least utilized core, as a normalized percent (e.g. 10%)
  • aws.lambda.enhanced.cpu_max_utilization - CPU utilization on the most utilized core, as a normalized percent (e.g. 80%)

Additional Notes

  • The CPU time metrics are sent after the PlatformReport event (similar to how the network metrics are being sent) to match the metric values reported by Lambda Insights
  • The CPU utilization metrics are not based off Lambda Insights and were new metrics added to the go agent. These metrics are sent after the PlatformRuntimeDone event (similar to how these metrics are being sent in the go agent) to avoid accounting for additional idle time after this event.
  • The CPU utilization metrics are based on this metric format: CPU Utilization Metric Representation

Testing

  • Build the lambda extension
  • Create a lambda function with the extension built in previous step
  • Turn on lambda insights / add the lambda insights extension to your function
  • Invoke the function
  • View the new enhanced metrics in metrics explorer or this dashboard
  • Compare with lambda insight metrics (for the CPU time metrics)
  • Can verify the CPU utilization metrics by comparing the numbers of an invocation using the go agent vs. an invocation using the lambda extension to check that they are around the same range for that function

@shreyamalpani shreyamalpani requested a review from a team as a code owner October 28, 2024 21:54
Comment thread bottlecap/src/lifecycle/invocation/processor.rs Outdated
Comment on lines +466 to 471

if let Some(offsets) = enhanced_metric_data {
lambda_enhanced_metrics.set_cpu_utilization_enhanced_metrics(offsets.cpu_offset, offsets.uptime_offset);
}

break;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why all the way here? It looks unnecessary to do this, what's the difference between doing it here, and all the way where its created?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When these metrics are sent after the PlatformReport event it includes a lot of idle time after the actual invocation, so the cpu utilization ratio becomes very low relative to the idle time. I tested this out with a function, and for the first invocation it was showing around 10% but for all the subsequent invocations it reported varied numbers sometimes less than 1% because of the idle time between invocations. When I tried out sending after the PlatformRuntimeDone event, it consistently reported around 20%, which is similar to the numbers in the go agent as well.

Comment on lines +486 to 493
if let Some(duration) = post_runtime_duration_ms {
lambda_enhanced_metrics.set_post_runtime_duration_metric(duration);
}
if let Some(offsets) = enhanced_metric_data {
lambda_enhanced_metrics.set_network_enhanced_metrics(offsets.network_offset);
lambda_enhanced_metrics.set_cpu_time_enhanced_metrics(offsets.cpu_offset);
}
drop(p);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure I understand why are we not doing the some as before?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I realized that we were returning None for both post_runtime_duration_ms and enhanced_metric_data if we could not calculate post_runtime_duration_ms because of context.runtime_duration_ms being 0. Changed it to handle so that even if we can't calculate post_runtime_duration_ms, we still try to send enhanced metrics.

Comment thread bottlecap/src/proc/clock.rs Outdated
Comment on lines +1 to +2
use libc;
use std::io;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the justification on this crate? Why libc clock and not the standard clock?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need libc to accurately get the number of clock ticks per second so that the cpu time differences can be converted to milliseconds, not sure how we could get this using the standard clock

Comment thread bottlecap/src/proc/mod.rs Outdated
Comment thread bottlecap/src/proc/mod.rs

for line in reader.lines() {
let line = line?;
let mut values = line.split_whitespace();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't you just do line?.split_whitespace() ?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I get an error that the temporary value of line? gets dropped while borrowing and to use a let binding

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a tip for next time, usually this blocks of code could be written in a functional way. Asking chatgpt, the refactor is instantaneous:

let cpu_data = reader.lines()
    .filter_map(|line| line.ok())
    .filter_map(|line| {
        let mut values = line.split_whitespace();
        values.next().map(|label| (label, values))
    })
    .fold(CPUData {
        total_user_time_ms: 0.0,
        total_system_time_ms: 0.0,
        total_idle_time_ms: 0.0,
        individual_cpu_idle_times: HashMap::new(),
    }, |mut cpu_data, (label, mut values)| {
        match label {
            "cpu" => {
                let user: Option<f64> = values.next().and_then(|s| s.parse().ok());
                values.next(); // skip "nice"
                let system: Option<f64> = values.next().and_then(|s| s.parse().ok());
                let idle: Option<f64> = values.next().and_then(|s| s.parse().ok());

                if let (Some(user_val), Some(system_val), Some(idle_val)) = (user, system, idle) {
                    cpu_data.total_user_time_ms = (1000.0 * user_val) / clktck;
                    cpu_data.total_system_time_ms = (1000.0 * system_val) / clktck;
                    cpu_data.total_idle_time_ms = (1000.0 * idle_val) / clktck;
                }
            }
            label if label.starts_with("cpu") => {
                let idle: Option<f64> = values.nth(3).and_then(|s| s.parse().ok());

                if let Some(idle_val) = idle {
                    cpu_data.individual_cpu_idle_times.insert(label.to_string(), (1000.0 * idle_val) / clktck);
                }
            }
            _ => {}
        }
        cpu_data
    });

the pros are:

  • usually more legible (one filter/action at time)
  • more robust (fn once ensure things are used just once)
  • it's a 0 cost abstraction
  • less indentation

Comment thread bottlecap/src/lifecycle/invocation/context.rs
Copy link
Copy Markdown
Contributor

@duncanista duncanista left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some comments

Comment thread bottlecap/Cargo.toml Outdated
Comment thread bottlecap/src/bin/bottlecap/main.rs Outdated
Comment thread bottlecap/src/lifecycle/invocation/context.rs Outdated
enhanced_metric_data = p.on_platform_runtime_done(
&request_id,
metrics.duration_ms,
config.clone(),
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

all these stuff like configs and tags provider could be set once at initialization of the invocation_processor, rather than cloned continuously

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True, I'll be doing some refactoring in my next PR, so I can look into that then

Comment thread bottlecap/src/proc/clock.rs Outdated
#[allow(clippy::cast_sign_loss)]
#[cfg(not(target_os = "windows"))]
pub fn get_clk_tck() -> Result<u64, io::Error> {
let clk_tck = unsafe { libc::sysconf(libc::_SC_CLK_TCK) };
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this change over time during an invocation? Or can we read it only once at the start?

Comment thread bottlecap/src/proc/clock.rs Outdated
#[allow(clippy::cast_sign_loss)]
#[cfg(not(target_os = "windows"))]
pub fn get_clk_tck() -> Result<u64, io::Error> {
let clk_tck = unsafe { libc::sysconf(libc::_SC_CLK_TCK) };
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm kinda not sure if this code is going to cause problems, we are only checking if it's -1, there's no comments here for unexpected/other behavior. I'm blocking the PR so we can talk about this in person.

Copy link
Copy Markdown
Contributor

@duncanista duncanista left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Blocking so we don't merge this and can talk more about it in person, would like to see if there are other alternatives we could try.

Let's ask Rust experts what we can do around the clock code and thoroughly test it with an alpine container too!

Comment thread bottlecap/src/proc/mod.rs
Comment thread bottlecap/src/proc/mod.rs Outdated
Comment thread bottlecap/src/proc/mod.rs Outdated
Comment thread bottlecap/src/proc/clock.rs Outdated
Comment on lines +5 to +21
#[allow(clippy::cast_sign_loss)]
#[cfg(not(target_os = "windows"))]
pub fn get_clk_tck() -> Result<u64, io::Error> {
match sysconf(SysconfVar::CLK_TCK) {
Ok(Some(clk_tck)) if clk_tck > 0 => Ok(clk_tck as u64),
_ => Err(io::Error::new(
io::ErrorKind::NotFound,
"Could not find system clock ticks per second",
)),
}
}

#[cfg(target_os = "windows")]
pub fn get_clk_tck() -> Result<u64, io::Error> {
// Windows does not have this concept
Ok(1)
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we still need the same type of constructor here now that you're not using the libc crate?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The nix package is a safer wrapper to the libc crate, so we would still need a similar structure

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool, just want to make sure if we still needed the windows os, the clippy cast sign loss allow and etc

Comment thread bottlecap/src/metrics/enhanced/lambda.rs
@shreyamalpani shreyamalpani merged commit 839d76f into jordan.gonzalez/bottlecap/universal-instrumentation Nov 6, 2024
@shreyamalpani shreyamalpani deleted the shreya.malpani/cpu-enhanced-metrics branch November 6, 2024 16:26
duncanista pushed a commit that referenced this pull request Nov 15, 2024
* send cpu metrics

* clippy fixes

* fixes

* set utilization metrics before flushing & format fixes

* added comment to explain utilization metrics calculation timing

* use nix instead of libc to get system clock

* update LICENSE-3rdparty.yml

* added comments to explain calculations

* clippy
duncanista pushed a commit that referenced this pull request Nov 19, 2024
* send cpu metrics

* clippy fixes

* fixes

* set utilization metrics before flushing & format fixes

* added comment to explain utilization metrics calculation timing

* use nix instead of libc to get system clock

* update LICENSE-3rdparty.yml

* added comments to explain calculations

* clippy
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants