[AGENTRUN-558] Rust checks code - Add ability to run Rust-based checks through shared libraries#42351
[AGENTRUN-558] Rust checks code - Add ability to run Rust-based checks through shared libraries#42351
Conversation
Static quality checks✅ Please find below the results from static quality gates Successful checksInfo
On-wire sizes (compressed)
|
Regression DetectorRegression Detector ResultsMetrics dashboard Baseline: c7d9740 Optimization Goals: ✅ No significant changes detected
|
| perf | experiment | goal | Δ mean % | Δ mean % CI | trials | links |
|---|---|---|---|---|---|---|
| ➖ | docker_containers_cpu | % cpu utilization | +1.22 | [-1.74, +4.18] | 1 | Logs |
Fine details of change detection per experiment
| perf | experiment | goal | Δ mean % | Δ mean % CI | trials | links |
|---|---|---|---|---|---|---|
| ➖ | quality_gate_logs | % cpu utilization | +1.46 | [+0.01, +2.91] | 1 | Logs bounds checks dashboard |
| ➖ | docker_containers_cpu | % cpu utilization | +1.22 | [-1.74, +4.18] | 1 | Logs |
| ➖ | quality_gate_metrics_logs | memory utilization | +0.64 | [+0.43, +0.86] | 1 | Logs bounds checks dashboard |
| ➖ | ddot_metrics | memory utilization | +0.50 | [+0.29, +0.72] | 1 | Logs |
| ➖ | otlp_ingest_logs | memory utilization | +0.34 | [+0.25, +0.43] | 1 | Logs |
| ➖ | otlp_ingest_metrics | memory utilization | +0.31 | [+0.16, +0.46] | 1 | Logs |
| ➖ | file_tree | memory utilization | +0.19 | [+0.13, +0.25] | 1 | Logs |
| ➖ | file_to_blackhole_100ms_latency | egress throughput | +0.01 | [-0.03, +0.05] | 1 | Logs |
| ➖ | file_to_blackhole_0ms_latency | egress throughput | +0.01 | [-0.48, +0.49] | 1 | Logs |
| ➖ | ddot_metrics_sum_delta | memory utilization | +0.01 | [-0.20, +0.21] | 1 | Logs |
| ➖ | ddot_metrics_sum_cumulativetodelta_exporter | memory utilization | +0.00 | [-0.23, +0.24] | 1 | Logs |
| ➖ | file_to_blackhole_500ms_latency | egress throughput | +0.00 | [-0.39, +0.39] | 1 | Logs |
| ➖ | tcp_dd_logs_filter_exclude | ingress throughput | -0.00 | [-0.09, +0.09] | 1 | Logs |
| ➖ | uds_dogstatsd_to_api | ingress throughput | -0.00 | [-0.13, +0.12] | 1 | Logs |
| ➖ | file_to_blackhole_1000ms_latency | egress throughput | -0.01 | [-0.41, +0.40] | 1 | Logs |
| ➖ | uds_dogstatsd_to_api_v3 | ingress throughput | -0.02 | [-0.14, +0.10] | 1 | Logs |
| ➖ | quality_gate_idle_all_features | memory utilization | -0.04 | [-0.08, -0.00] | 1 | Logs bounds checks dashboard |
| ➖ | uds_dogstatsd_20mb_12k_contexts_20_senders | memory utilization | -0.13 | [-0.19, -0.07] | 1 | Logs |
| ➖ | ddot_logs | memory utilization | -0.25 | [-0.31, -0.18] | 1 | Logs |
| ➖ | ddot_metrics_sum_cumulative | memory utilization | -0.26 | [-0.43, -0.10] | 1 | Logs |
| ➖ | docker_containers_memory | memory utilization | -0.32 | [-0.39, -0.25] | 1 | Logs |
| ➖ | quality_gate_idle | memory utilization | -0.35 | [-0.39, -0.30] | 1 | Logs bounds checks dashboard |
| ➖ | tcp_syslog_to_blackhole | ingress throughput | -1.13 | [-1.19, -1.06] | 1 | Logs |
Bounds Checks: ✅ Passed
| perf | experiment | bounds_check_name | replicates_passed | links |
|---|---|---|---|---|
| ✅ | docker_containers_cpu | simple_check_run | 10/10 | |
| ✅ | docker_containers_memory | memory_usage | 10/10 | |
| ✅ | docker_containers_memory | simple_check_run | 10/10 | |
| ✅ | file_to_blackhole_0ms_latency | lost_bytes | 10/10 | |
| ✅ | file_to_blackhole_0ms_latency | memory_usage | 10/10 | |
| ✅ | file_to_blackhole_1000ms_latency | lost_bytes | 10/10 | |
| ✅ | file_to_blackhole_1000ms_latency | memory_usage | 10/10 | |
| ✅ | file_to_blackhole_100ms_latency | lost_bytes | 10/10 | |
| ✅ | file_to_blackhole_100ms_latency | memory_usage | 10/10 | |
| ✅ | file_to_blackhole_500ms_latency | lost_bytes | 10/10 | |
| ✅ | file_to_blackhole_500ms_latency | memory_usage | 10/10 | |
| ✅ | quality_gate_idle | intake_connections | 10/10 | bounds checks dashboard |
| ✅ | quality_gate_idle | memory_usage | 10/10 | bounds checks dashboard |
| ✅ | quality_gate_idle_all_features | intake_connections | 10/10 | bounds checks dashboard |
| ✅ | quality_gate_idle_all_features | memory_usage | 10/10 | bounds checks dashboard |
| ✅ | quality_gate_logs | intake_connections | 10/10 | bounds checks dashboard |
| ✅ | quality_gate_logs | lost_bytes | 10/10 | bounds checks dashboard |
| ✅ | quality_gate_logs | memory_usage | 10/10 | bounds checks dashboard |
| ✅ | quality_gate_metrics_logs | cpu_usage | 10/10 | bounds checks dashboard |
| ✅ | quality_gate_metrics_logs | intake_connections | 10/10 | bounds checks dashboard |
| ✅ | quality_gate_metrics_logs | lost_bytes | 10/10 | bounds checks dashboard |
| ✅ | quality_gate_metrics_logs | memory_usage | 10/10 | bounds checks dashboard |
Explanation
Confidence level: 90.00%
Effect size tolerance: |Δ mean %| ≥ 5.00%
Performance changes are noted in the perf column of each table:
- ✅ = significantly better comparison variant performance
- ❌ = significantly worse comparison variant performance
- ➖ = no significant change in performance
A regression test is an A/B test of target performance in a repeatable rig, where "performance" is measured as "comparison variant minus baseline variant" for an optimization goal (e.g., ingress throughput). Due to intrinsic variability in measuring that goal, we can only estimate its mean value for each experiment; we report uncertainty in that value as a 90.00% confidence interval denoted "Δ mean % CI".
For each experiment, we decide whether a change in performance is a "regression" -- a change worth investigating further -- if all of the following criteria are true:
-
Its estimated |Δ mean %| ≥ 5.00%, indicating the change is big enough to merit a closer look.
-
Its 90.00% confidence interval "Δ mean % CI" does not contain zero, indicating that if our statistical model is accurate, there is at least a 90.00% chance there is a difference in performance between baseline and comparison variants.
-
Its configuration does not mark it "erratic".
Replicate Execution Details
We run multiple replicates for each experiment/variant. However, we allow replicates to be automatically retried if there are any failures, up to 8 times, at which point the replicate is marked dead and we are unable to run analysis for the entire experiment. We call each of these attempts at running replicates a replicate execution. This section lists all replicate executions that failed due to the target crashing or being oom killed.
Note: In the below tables we bucket failures by experiment, variant, and failure type. For each of these buckets we list out the replicate indexes that failed with an annotation signifying how many times said replicate failed with the given failure mode. In the below example the baseline variant of the experiment named experiment_with_failures had two replicates that failed by oom kills. Replicate 0, which failed 8 executions, and replicate 1 which failed 6 executions, all with the same failure mode.
| Experiment | Variant | Replicates | Failure | Logs | Debug Dashboard |
|---|---|---|---|---|---|
| experiment_with_failures | baseline | 0 (x8) 1 (x6) | Oom killed | Debug Dashboard |
The debug dashboard links will take you to a debugging dashboard specifically designed to investigate replicate execution failures.
❌ Retried Profiling Replicate Execution Failures (target internal profiling)
Note: Profiling replicas may still be executing. See the debug dashboard for up to date status.
| Experiment | Variant | Replicates | Failure | Debug Dashboard |
|---|---|---|---|---|
| quality_gate_idle_all_features | baseline | 11 (x4) | Oom killed | Debug Dashboard |
| quality_gate_idle_all_features | comparison | 11 (x3) | Oom killed | Debug Dashboard |
CI Pass/Fail Decision
✅ Passed. All Quality Gates passed.
- quality_gate_idle_all_features, bounds check intake_connections: 10/10 replicas passed. Gate passed.
- quality_gate_idle_all_features, bounds check memory_usage: 10/10 replicas passed. Gate passed.
- quality_gate_logs, bounds check lost_bytes: 10/10 replicas passed. Gate passed.
- quality_gate_logs, bounds check memory_usage: 10/10 replicas passed. Gate passed.
- quality_gate_logs, bounds check intake_connections: 10/10 replicas passed. Gate passed.
- quality_gate_idle, bounds check intake_connections: 10/10 replicas passed. Gate passed.
- quality_gate_idle, bounds check memory_usage: 10/10 replicas passed. Gate passed.
- quality_gate_metrics_logs, bounds check intake_connections: 10/10 replicas passed. Gate passed.
- quality_gate_metrics_logs, bounds check memory_usage: 10/10 replicas passed. Gate passed.
- quality_gate_metrics_logs, bounds check cpu_usage: 10/10 replicas passed. Gate passed.
- quality_gate_metrics_logs, bounds check lost_bytes: 10/10 replicas passed. Gate passed.
…braries (#39676) <!-- * Contributors are encouraged to read our [CONTRIBUTING](/CONTRIBUTING.md) documentation. * Both Contributor and Reviewer Checklists are available at https://datadoghq.dev/datadog-agent/guidelines/contributing/#pull-requests. * The pull request: * Should only fix one issue or add one feature at a time. * Must update the test suite for the relevant functionality. * Should pass all status checks before being reviewed or merged. * Commit titles should be prefixed with general area of pull request's change. * Please fill the below sections if possible with relevant information or links. --> ### What does this PR do?⚠️ This new feature is experimental⚠️ This PR introduces a new way of running checks, through shared libraries. They are Rust-based and loaded at runtime by a new checks loader. You can read the documentation [here](https://datadoghq.atlassian.net/wiki/spaces/ARUN/pages/5479301643/Running+shared+library+checks+in+the+Agent). The Rust-based checks API is minimal for now, only the submit functions are available. The API can be expanded quite easily by adding new fields to the C structure `aggregator_t`. Also, this PR moves the Go submit functions (like `SubmitMetric`) from the collector Python package to a new package (named `aggregator`). That way both Python and shared library checks can use callbacks from this package. These Go submit functions were in the Python package just because Python checks were the only ones using them, but their scope are larger than just Python checks ### Motivation Provide a new way of writing checks to improve Agent performances and to rely a bit less on the Python runtime. ### Describe how you validated your changes <!-- Validate your changes before merge, ensuring that: * Your PR is tested by static / unit / integrations / e2e tests * Your PR description details which e2e tests cover your changes, if any * The PR description contains details of how you validated your changes. If you validated changes manually and not through automated tests, add context on why automated tests did not fit your changes validation. If you want additional validation by a second person, you can ask reviewers to do it. Describe how to set up an environment for manual tests in the PR description. Manual validation is expected to happen on every commit before merge. Any manual validation step should then map to an automated test. Manual validation should not substitute automation, minus exceptions not supported by test tooling yet. --> Few unit tests to test the new checks loader and the shared library checks implementation (for the Go part). An e2e test for Linux and Windows to load and run a simple shared library check that submits one metric. ### Possible Drawbacks / Trade-offs ### Additional Notes <!-- * Anything else we should know when reviewing? * Include benchmarking information here whenever possible. * Include info about alternatives that were considered and why the proposed version was chosen. --> The Rust part of this feature (where shared libraries are compiled from) is on this PR: - #42351 Co-authored-by: pgimalac <pierre.gimalac@datadoghq.com> Co-authored-by: maxime.chambre <maxime.chambre@datadoghq.com>
08f686b to
7f1f1c1
Compare
001168a to
8e6885e
Compare
8e6885e to
90af960
Compare
… sure that the content can't be overwritten
…e number of submitted payloads is print at the end
|
This pull request has been automatically marked as stale because it has not had activity in the past 15 days. It will be closed in 30 days if no further activity occurs. If this pull request is still relevant, adding a comment or pushing new commits will keep it open. Also, you can always reopen the pull request if you missed the window. Thank you for your contributions! |
|
This pull request was automatically closed because it has been stale for 15 days with no activity. If this pull request is still relevant, please reopen it or create a new pull request with updated information. Thanks! |
NOTE: This PR contains the draft for
http_checkin Rust. It's working well for few use cases.What does this PR do?
This PR adds code to write and compile Rust-based checks, with a simple Rust check as an example.
Motivation
Describe how you validated your changes
Additional Notes