[AGENTRUN-866] Add ability to run Rust-based checks through shared libraries#39676
[AGENTRUN-866] Add ability to run Rust-based checks through shared libraries#39676dd-mergequeue[bot] merged 144 commits intomainfrom
Conversation
… given, update cgo code in sharedlibrary package to pass the callbacks
Go Package Import DifferencesBaseline: bb81be1
|
Regression DetectorRegression Detector ResultsMetrics dashboard Baseline: bb81be1 Optimization Goals: ✅ No significant changes detected
|
| perf | experiment | goal | Δ mean % | Δ mean % CI | trials | links |
|---|---|---|---|---|---|---|
| ➖ | docker_containers_cpu | % cpu utilization | -1.63 | [-4.68, +1.42] | 1 | Logs |
Fine details of change detection per experiment
| perf | experiment | goal | Δ mean % | Δ mean % CI | trials | links |
|---|---|---|---|---|---|---|
| ➖ | ddot_metrics_sum_delta | memory utilization | +0.83 | [+0.62, +1.05] | 1 | Logs |
| ➖ | otlp_ingest_logs | memory utilization | +0.79 | [+0.68, +0.90] | 1 | Logs |
| ➖ | tcp_syslog_to_blackhole | ingress throughput | +0.71 | [+0.63, +0.79] | 1 | Logs |
| ➖ | file_tree | memory utilization | +0.58 | [+0.52, +0.64] | 1 | Logs |
| ➖ | docker_containers_memory | memory utilization | +0.24 | [+0.17, +0.32] | 1 | Logs |
| ➖ | uds_dogstatsd_20mb_12k_contexts_20_senders | memory utilization | +0.21 | [+0.15, +0.26] | 1 | Logs |
| ➖ | ddot_metrics_sum_cumulative | memory utilization | +0.13 | [-0.03, +0.29] | 1 | Logs |
| ➖ | quality_gate_idle_all_features | memory utilization | +0.10 | [+0.06, +0.14] | 1 | Logs bounds checks dashboard |
| ➖ | quality_gate_idle | memory utilization | +0.04 | [-0.00, +0.08] | 1 | Logs bounds checks dashboard |
| ➖ | file_to_blackhole_0ms_latency | egress throughput | +0.04 | [-0.35, +0.43] | 1 | Logs |
| ➖ | uds_dogstatsd_to_api_v3 | ingress throughput | +0.01 | [-0.11, +0.14] | 1 | Logs |
| ➖ | tcp_dd_logs_filter_exclude | ingress throughput | +0.00 | [-0.07, +0.08] | 1 | Logs |
| ➖ | file_to_blackhole_1000ms_latency | egress throughput | +0.00 | [-0.41, +0.41] | 1 | Logs |
| ➖ | uds_dogstatsd_to_api | ingress throughput | -0.01 | [-0.14, +0.13] | 1 | Logs |
| ➖ | file_to_blackhole_100ms_latency | egress throughput | -0.01 | [-0.06, +0.04] | 1 | Logs |
| ➖ | file_to_blackhole_500ms_latency | egress throughput | -0.05 | [-0.42, +0.32] | 1 | Logs |
| ➖ | otlp_ingest_metrics | memory utilization | -0.08 | [-0.23, +0.08] | 1 | Logs |
| ➖ | ddot_metrics | memory utilization | -0.11 | [-0.31, +0.10] | 1 | Logs |
| ➖ | ddot_metrics_sum_cumulativetodelta_exporter | memory utilization | -0.25 | [-0.48, -0.02] | 1 | Logs |
| ➖ | quality_gate_logs | % cpu utilization | -0.29 | [-1.77, +1.20] | 1 | Logs bounds checks dashboard |
| ➖ | quality_gate_metrics_logs | memory utilization | -0.63 | [-0.83, -0.43] | 1 | Logs bounds checks dashboard |
| ➖ | ddot_logs | memory utilization | -0.64 | [-0.71, -0.58] | 1 | Logs |
| ➖ | docker_containers_cpu | % cpu utilization | -1.63 | [-4.68, +1.42] | 1 | Logs |
Bounds Checks: ✅ Passed
| perf | experiment | bounds_check_name | replicates_passed | links |
|---|---|---|---|---|
| ✅ | docker_containers_cpu | simple_check_run | 10/10 | |
| ✅ | docker_containers_memory | memory_usage | 10/10 | |
| ✅ | docker_containers_memory | simple_check_run | 10/10 | |
| ✅ | file_to_blackhole_0ms_latency | lost_bytes | 10/10 | |
| ✅ | file_to_blackhole_0ms_latency | memory_usage | 10/10 | |
| ✅ | file_to_blackhole_1000ms_latency | lost_bytes | 10/10 | |
| ✅ | file_to_blackhole_1000ms_latency | memory_usage | 10/10 | |
| ✅ | file_to_blackhole_100ms_latency | lost_bytes | 10/10 | |
| ✅ | file_to_blackhole_100ms_latency | memory_usage | 10/10 | |
| ✅ | file_to_blackhole_500ms_latency | lost_bytes | 10/10 | |
| ✅ | file_to_blackhole_500ms_latency | memory_usage | 10/10 | |
| ✅ | quality_gate_idle | intake_connections | 10/10 | bounds checks dashboard |
| ✅ | quality_gate_idle | memory_usage | 10/10 | bounds checks dashboard |
| ✅ | quality_gate_idle_all_features | intake_connections | 10/10 | bounds checks dashboard |
| ✅ | quality_gate_idle_all_features | memory_usage | 10/10 | bounds checks dashboard |
| ✅ | quality_gate_logs | intake_connections | 10/10 | bounds checks dashboard |
| ✅ | quality_gate_logs | lost_bytes | 10/10 | bounds checks dashboard |
| ✅ | quality_gate_logs | memory_usage | 10/10 | bounds checks dashboard |
| ✅ | quality_gate_metrics_logs | cpu_usage | 10/10 | bounds checks dashboard |
| ✅ | quality_gate_metrics_logs | intake_connections | 10/10 | bounds checks dashboard |
| ✅ | quality_gate_metrics_logs | lost_bytes | 10/10 | bounds checks dashboard |
| ✅ | quality_gate_metrics_logs | memory_usage | 10/10 | bounds checks dashboard |
Explanation
Confidence level: 90.00%
Effect size tolerance: |Δ mean %| ≥ 5.00%
Performance changes are noted in the perf column of each table:
- ✅ = significantly better comparison variant performance
- ❌ = significantly worse comparison variant performance
- ➖ = no significant change in performance
A regression test is an A/B test of target performance in a repeatable rig, where "performance" is measured as "comparison variant minus baseline variant" for an optimization goal (e.g., ingress throughput). Due to intrinsic variability in measuring that goal, we can only estimate its mean value for each experiment; we report uncertainty in that value as a 90.00% confidence interval denoted "Δ mean % CI".
For each experiment, we decide whether a change in performance is a "regression" -- a change worth investigating further -- if all of the following criteria are true:
-
Its estimated |Δ mean %| ≥ 5.00%, indicating the change is big enough to merit a closer look.
-
Its 90.00% confidence interval "Δ mean % CI" does not contain zero, indicating that if our statistical model is accurate, there is at least a 90.00% chance there is a difference in performance between baseline and comparison variants.
-
Its configuration does not mark it "erratic".
CI Pass/Fail Decision
✅ Passed. All Quality Gates passed.
- quality_gate_idle_all_features, bounds check memory_usage: 10/10 replicas passed. Gate passed.
- quality_gate_idle_all_features, bounds check intake_connections: 10/10 replicas passed. Gate passed.
- quality_gate_logs, bounds check intake_connections: 10/10 replicas passed. Gate passed.
- quality_gate_logs, bounds check memory_usage: 10/10 replicas passed. Gate passed.
- quality_gate_logs, bounds check lost_bytes: 10/10 replicas passed. Gate passed.
- quality_gate_idle, bounds check intake_connections: 10/10 replicas passed. Gate passed.
- quality_gate_idle, bounds check memory_usage: 10/10 replicas passed. Gate passed.
- quality_gate_metrics_logs, bounds check intake_connections: 10/10 replicas passed. Gate passed.
- quality_gate_metrics_logs, bounds check lost_bytes: 10/10 replicas passed. Gate passed.
- quality_gate_metrics_logs, bounds check memory_usage: 10/10 replicas passed. Gate passed.
- quality_gate_metrics_logs, bounds check cpu_usage: 10/10 replicas passed. Gate passed.
Static quality checks✅ Please find below the results from static quality gates Successful checksInfo
|
…passed through json string now
…to prevent confusion between senders when submitting metrics
…de sense in the context of shared libraries
…hambre/shared-library-check-cgo
pgimalac
left a comment
There was a problem hiding this comment.
LGTM, mostly nitpicks !
Checking in the dll and so files for the e2e test is not great though, we should look for a workaround.
… in `aggregator` package util function signatures to be used in `python` package
… check loader as it wasn't used outside the package
|
/merge |
|
View all feedbacks in Devflow UI.
The expected merge time in
|
…ample check (#45815) ### What does this PR do? This PR adds the core part of Rust-based checks and an example of a Rust-based shared library check. The core part acts as the glue code between the check implementation in Rust and the Agent (Running the check, passing configuration, sending metrics...). Each check includes the core part in their implementation. ### Motivation Provide the source code for shared library checks used by the shared library loader implemented in: - #39676 ### Describe how you validated your changes - Compile the example check (package `example`) and have it loaded, scheduled and run by the Agent. - Verify that the metrics, service checks and events are correctly sent to the backend. They are defined in `pkg/collector/sharedlibrary/rustchecks/checks/example/src/check.rs` ### Additional Notes This code isn't used in the build or in any jobs, it's put here to experiment with Rust-based shared library checks. Co-authored-by: maxime.chambre <maxime.chambre@datadoghq.com>
What does this PR do?
This PR introduces a new way of running checks, through shared libraries. They are Rust-based and loaded at runtime by a new checks loader. You can read the documentation here.
The Rust-based checks API is minimal for now, only the submit functions are available. The API can be expanded quite easily by adding new fields to the C structure
aggregator_t.Also, this PR moves the Go submit functions (like
SubmitMetric) from the collector Python package to a new package (namedaggregator). That way both Python and shared library checks can use callbacks from this package. These Go submit functions were in the Python package just because Python checks were the only ones using them, but their scope are larger than just Python checksMotivation
Provide a new way of writing checks to improve Agent performances and to rely a bit less on the Python runtime.
Describe how you validated your changes
Few unit tests to test the new checks loader and the shared library checks implementation (for the Go part).
An e2e test for Linux and Windows to load and run a simple shared library check that submits one metric.
Possible Drawbacks / Trade-offs
Additional Notes
The Rust part of this feature (where shared libraries are compiled from) is on this PR: