Generate CPU Enhanced Metrics by shreyamalpani · Pull Request #430 · DataDog/datadog-lambda-extension

shreyamalpani · 2024-10-28T21:54:13Z

What does this PR do?

This PR introduces eight new enhanced lambda metrics related to CPU usage. Each of the metrics are emitted once per invocation.

Three of the metrics represent CPU time spent running the lambda function:

aws.lambda.enhanced.cpu_system_time - time spent by the CPU running in kernel mode
aws.lambda.enhanced.cpu_user_time - time spent by the CPU running in user mode
aws.lambda.enhanced.cpu_total_time - sum of aws.lambda.enhanced.cpu_system_time and aws.lambda.enhanced.cpu_user_time

The other five metrics represent CPU utilization by the lambda function:

aws.lambda.enhanced.cpu_total_utilization_pct - total CPU utilization of the function, as a normalized percent (e.g. 35%)
aws.lambda.enhanced.cpu_total_utilization - total CPU utilization of the function, as normalized cores (e.g. 1.2 cores)
aws.lambda.enhanced.num_cores - total number of cores available in the environment (e.g. 4 cores)
aws.lambda.enhanced.cpu_min_utilization - CPU utilization on the least utilized core, as a normalized percent (e.g. 10%)
aws.lambda.enhanced.cpu_max_utilization - CPU utilization on the most utilized core, as a normalized percent (e.g. 80%)

Additional Notes

The CPU time metrics are sent after the PlatformReport event (similar to how the network metrics are being sent) to match the metric values reported by Lambda Insights
The CPU utilization metrics are not based off Lambda Insights and were new metrics added to the go agent. These metrics are sent after the PlatformRuntimeDone event (similar to how these metrics are being sent in the go agent) to avoid accounting for additional idle time after this event.
The CPU utilization metrics are based on this metric format: CPU Utilization Metric Representation

Testing

Build the lambda extension
Create a lambda function with the extension built in previous step
Turn on lambda insights / add the lambda insights extension to your function
Invoke the function
View the new enhanced metrics in metrics explorer or this dashboard
Compare with lambda insight metrics (for the CPU time metrics)
Can verify the CPU utilization metrics by comparing the numbers of an invocation using the go agent vs. an invocation using the lambda extension to check that they are around the same range for that function

duncanista · 2024-10-29T12:11:26Z

+
+                                    if let Some(offsets) = enhanced_metric_data {
+                                        lambda_enhanced_metrics.set_cpu_utilization_enhanced_metrics(offsets.cpu_offset, offsets.uptime_offset);
+                                    }
+
                                    break;


Why all the way here? It looks unnecessary to do this, what's the difference between doing it here, and all the way where its created?

When these metrics are sent after the PlatformReport event it includes a lot of idle time after the actual invocation, so the cpu utilization ratio becomes very low relative to the idle time. I tested this out with a function, and for the first invocation it was showing around 10% but for all the subsequent invocations it reported varied numbers sometimes less than 1% because of the idle time between invocations. When I tried out sending after the PlatformRuntimeDone event, it consistently reported around 20%, which is similar to the numbers in the go agent as well.

duncanista · 2024-10-29T12:12:09Z

+                                    if let Some(duration) = post_runtime_duration_ms {
+                                        lambda_enhanced_metrics.set_post_runtime_duration_metric(duration);
+                                    }
+                                    if let Some(offsets) = enhanced_metric_data {
+                                        lambda_enhanced_metrics.set_network_enhanced_metrics(offsets.network_offset);
+                                        lambda_enhanced_metrics.set_cpu_time_enhanced_metrics(offsets.cpu_offset);
                                    }
                                    drop(p);


Not sure I understand why are we not doing the some as before?

I realized that we were returning None for both post_runtime_duration_ms and enhanced_metric_data if we could not calculate post_runtime_duration_ms because of context.runtime_duration_ms being 0. Changed it to handle so that even if we can't calculate post_runtime_duration_ms, we still try to send enhanced metrics.

duncanista · 2024-10-29T12:17:47Z

+use libc;
+use std::io;


What's the justification on this crate? Why libc clock and not the standard clock?

I think we need libc to accurately get the number of clock ticks per second so that the cpu time differences can be converted to milliseconds, not sure how we could get this using the standard clock

duncanista · 2024-10-29T12:20:26Z

+
+    for line in reader.lines() {
+        let line = line?;
+        let mut values = line.split_whitespace();


Can't you just do line?.split_whitespace() ?

I get an error that the temporary value of line? gets dropped while borrowing and to use a let binding

As a tip for next time, usually this blocks of code could be written in a functional way. Asking chatgpt, the refactor is instantaneous:

let cpu_data = reader.lines() .filter_map(|line| line.ok()) .filter_map(|line| { let mut values = line.split_whitespace(); values.next().map(|label| (label, values)) }) .fold(CPUData { total_user_time_ms: 0.0, total_system_time_ms: 0.0, total_idle_time_ms: 0.0, individual_cpu_idle_times: HashMap::new(), }, |mut cpu_data, (label, mut values)| { match label { "cpu" => { let user: Option<f64> = values.next().and_then(|s| s.parse().ok()); values.next(); // skip "nice" let system: Option<f64> = values.next().and_then(|s| s.parse().ok()); let idle: Option<f64> = values.next().and_then(|s| s.parse().ok()); if let (Some(user_val), Some(system_val), Some(idle_val)) = (user, system, idle) { cpu_data.total_user_time_ms = (1000.0 * user_val) / clktck; cpu_data.total_system_time_ms = (1000.0 * system_val) / clktck; cpu_data.total_idle_time_ms = (1000.0 * idle_val) / clktck; } } label if label.starts_with("cpu") => { let idle: Option<f64> = values.nth(3).and_then(|s| s.parse().ok()); if let Some(idle_val) = idle { cpu_data.individual_cpu_idle_times.insert(label.to_string(), (1000.0 * idle_val) / clktck); } } _ => {} } cpu_data });

the pros are:

usually more legible (one filter/action at time)

more robust (fn once ensure things are used just once)

it's a 0 cost abstraction

less indentation

duncanista

Left some comments

alexgallotta · 2024-10-30T17:46:26Z

+                                        enhanced_metric_data = p.on_platform_runtime_done(
                                            &request_id,
                                            metrics.duration_ms,
                                            config.clone(),


all these stuff like configs and tags provider could be set once at initialization of the invocation_processor, rather than cloned continuously

True, I'll be doing some refactoring in my next PR, so I can look into that then

alexgallotta · 2024-11-01T14:08:03Z

+#[allow(clippy::cast_sign_loss)]
+#[cfg(not(target_os = "windows"))]
+pub fn get_clk_tck() -> Result<u64, io::Error> {
+    let clk_tck = unsafe { libc::sysconf(libc::_SC_CLK_TCK) };


does this change over time during an invocation? Or can we read it only once at the start?

duncanista · 2024-11-01T17:28:08Z

+#[allow(clippy::cast_sign_loss)]
+#[cfg(not(target_os = "windows"))]
+pub fn get_clk_tck() -> Result<u64, io::Error> {
+    let clk_tck = unsafe { libc::sysconf(libc::_SC_CLK_TCK) };


I'm kinda not sure if this code is going to cause problems, we are only checking if it's -1, there's no comments here for unexpected/other behavior. I'm blocking the PR so we can talk about this in person.

duncanista

Blocking so we don't merge this and can talk more about it in person, would like to see if there are other alternatives we could try.

Let's ask Rust experts what we can do around the clock code and thoroughly test it with an alpine container too!

…to shreya.malpani/cpu-enhanced-metrics

duncanista · 2024-11-06T15:26:02Z

+#[allow(clippy::cast_sign_loss)]
+#[cfg(not(target_os = "windows"))]
+pub fn get_clk_tck() -> Result<u64, io::Error> {
+    match sysconf(SysconfVar::CLK_TCK) {
+        Ok(Some(clk_tck)) if clk_tck > 0 => Ok(clk_tck as u64),
+        _ => Err(io::Error::new(
+            io::ErrorKind::NotFound,
+            "Could not find system clock ticks per second",
+        )),
+    }
+}
+
+#[cfg(target_os = "windows")]
+pub fn get_clk_tck() -> Result<u64, io::Error> {
+    // Windows does not have this concept
+    Ok(1)
+}


Do we still need the same type of constructor here now that you're not using the libc crate?

The nix package is a safer wrapper to the libc crate, so we would still need a similar structure

Cool, just want to make sure if we still needed the windows os, the clippy cast sign loss allow and etc

* send cpu metrics * clippy fixes * fixes * set utilization metrics before flushing & format fixes * added comment to explain utilization metrics calculation timing * use nix instead of libc to get system clock * update LICENSE-3rdparty.yml * added comments to explain calculations * clippy

shreyamalpani added 2 commits October 28, 2024 16:38

send cpu metrics

07ec029

clippy fixes

84e36ae

shreyamalpani requested a review from a team as a code owner October 28, 2024 21:54

duncanista reviewed Oct 29, 2024

View reviewed changes

Comment thread bottlecap/src/lifecycle/invocation/processor.rs Outdated

duncanista reviewed Oct 29, 2024

View reviewed changes

Comment thread bottlecap/src/proc/mod.rs Outdated

duncanista reviewed Oct 29, 2024

View reviewed changes

Comment thread bottlecap/src/lifecycle/invocation/context.rs

duncanista reviewed Oct 29, 2024

View reviewed changes

fixes

bd08c90

alexgallotta reviewed Oct 30, 2024

View reviewed changes

set utilization metrics before flushing & format fixes

50eb2d3

shreyamalpani requested a review from alexgallotta October 31, 2024 17:47

added comment to explain utilization metrics calculation timing

3423c57

alexgallotta reviewed Nov 1, 2024

View reviewed changes

alexgallotta approved these changes Nov 1, 2024

View reviewed changes

duncanista reviewed Nov 1, 2024

View reviewed changes

duncanista requested changes Nov 1, 2024

View reviewed changes

shreyamalpani added 3 commits November 2, 2024 10:03

use nix instead of libc to get system clock

d4ad790

update LICENSE-3rdparty.yml

16646ed

Merge branch 'jordan.gonzalez/bottlecap/universal-instrumentation' in…

4190332

…to shreya.malpani/cpu-enhanced-metrics