Improving Disk Metrics: Distinguishing Real Disks from Pseudo-Filesystems#48766
Conversation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 77db8723d3
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
…l disk classification Address two review comments on PR #48766: - Classify bind mounts by device name so partitions sharing a physical device (different mountpoint) are tagged as physical instead of being silently dropped. - Handle partial success from the all-partitions syscall: when some partitions are returned alongside an error, classify the available results instead of discarding them. Add three tests covering the new error/edge-case paths.
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 88f8818eb5
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 3d26066249
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 64a81d68ce
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 64a81d68ce
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
d597365 to
8117521
Compare
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 8117521ccd
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: dba9fb8e69
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 2f4017d09a
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
|
@codex review |
|
Codex Review: Didn't find any major issues. Already looking forward to the next diff. ℹ️ About Codex in GitHubYour team has set up Codex to review pull requests in this repo. Reviews are triggered when you
If Codex has suggestions, it will comment; otherwise it will react with 👍. Codex can also answer questions or update the PR. Try commenting "@codex address that feedback". |
Files inventory check summaryFile checks results against ancestor 2996c913: Results for datadog-agent_7.79.0~devel.git.452.d14baab.pipeline.106150563-1_amd64.deb:No change detected |
Static quality checks✅ Please find below the results from static quality gates Successful checksInfo
10 successful checks with minimal change (< 2 KiB)
On-wire sizes (compressed)
|
Regression DetectorRegression Detector ResultsMetrics dashboard Baseline: 0d4f770 Optimization Goals: ✅ No significant changes detected
|
| perf | experiment | goal | Δ mean % | Δ mean % CI | trials | links |
|---|---|---|---|---|---|---|
| ➖ | docker_containers_cpu | % cpu utilization | -2.03 | [-5.10, +1.03] | 1 | Logs |
Fine details of change detection per experiment
| perf | experiment | goal | Δ mean % | Δ mean % CI | trials | links |
|---|---|---|---|---|---|---|
| ➖ | quality_gate_logs | % cpu utilization | +2.57 | [+0.89, +4.25] | 1 | Logs bounds checks dashboard |
| ➖ | otlp_ingest_logs | memory utilization | +1.08 | [+0.97, +1.20] | 1 | Logs |
| ➖ | quality_gate_metrics_logs | memory utilization | +0.72 | [+0.49, +0.96] | 1 | Logs bounds checks dashboard |
| ➖ | quality_gate_idle_all_features | memory utilization | +0.38 | [+0.35, +0.42] | 1 | Logs bounds checks dashboard |
| ➖ | ddot_metrics_sum_delta | memory utilization | +0.22 | [+0.05, +0.38] | 1 | Logs |
| ➖ | ddot_metrics_sum_cumulativetodelta_exporter | memory utilization | +0.14 | [-0.09, +0.36] | 1 | Logs |
| ➖ | ddot_metrics | memory utilization | +0.12 | [-0.07, +0.31] | 1 | Logs |
| ➖ | file_tree | memory utilization | +0.04 | [-0.02, +0.10] | 1 | Logs |
| ➖ | ddot_metrics_sum_cumulative | memory utilization | +0.02 | [-0.13, +0.17] | 1 | Logs |
| ➖ | tcp_dd_logs_filter_exclude | ingress throughput | +0.01 | [-0.10, +0.12] | 1 | Logs |
| ➖ | uds_dogstatsd_to_api_v3 | ingress throughput | -0.01 | [-0.22, +0.19] | 1 | Logs |
| ➖ | uds_dogstatsd_to_api | ingress throughput | -0.02 | [-0.22, +0.18] | 1 | Logs |
| ➖ | quality_gate_idle | memory utilization | -0.02 | [-0.07, +0.03] | 1 | Logs bounds checks dashboard |
| ➖ | uds_dogstatsd_20mb_12k_contexts_20_senders | memory utilization | -0.03 | [-0.09, +0.03] | 1 | Logs |
| ➖ | file_to_blackhole_1000ms_latency | egress throughput | -0.07 | [-0.50, +0.37] | 1 | Logs |
| ➖ | file_to_blackhole_100ms_latency | egress throughput | -0.07 | [-0.17, +0.04] | 1 | Logs |
| ➖ | file_to_blackhole_0ms_latency | egress throughput | -0.09 | [-0.64, +0.46] | 1 | Logs |
| ➖ | file_to_blackhole_500ms_latency | egress throughput | -0.10 | [-0.49, +0.29] | 1 | Logs |
| ➖ | docker_containers_memory | memory utilization | -0.18 | [-0.26, -0.10] | 1 | Logs |
| ➖ | ddot_logs | memory utilization | -0.35 | [-0.42, -0.28] | 1 | Logs |
| ➖ | otlp_ingest_metrics | memory utilization | -0.62 | [-0.78, -0.46] | 1 | Logs |
| ➖ | tcp_syslog_to_blackhole | ingress throughput | -0.95 | [-1.12, -0.77] | 1 | Logs |
| ➖ | docker_containers_cpu | % cpu utilization | -2.03 | [-5.10, +1.03] | 1 | Logs |
Bounds Checks: ✅ Passed
| perf | experiment | bounds_check_name | replicates_passed | observed_value | links |
|---|---|---|---|---|---|
| ✅ | docker_containers_cpu | simple_check_run | 10/10 | 719 ≥ 26 | |
| ✅ | docker_containers_memory | memory_usage | 10/10 | 272.06MiB ≤ 370MiB | |
| ✅ | docker_containers_memory | simple_check_run | 10/10 | 678 ≥ 26 | |
| ✅ | file_to_blackhole_0ms_latency | memory_usage | 10/10 | 0.19GiB ≤ 1.20GiB | |
| ✅ | file_to_blackhole_0ms_latency | missed_bytes | 10/10 | 0B = 0B | |
| ✅ | file_to_blackhole_1000ms_latency | memory_usage | 10/10 | 0.23GiB ≤ 1.20GiB | |
| ✅ | file_to_blackhole_1000ms_latency | missed_bytes | 10/10 | 0B = 0B | |
| ✅ | file_to_blackhole_100ms_latency | memory_usage | 10/10 | 0.19GiB ≤ 1.20GiB | |
| ✅ | file_to_blackhole_100ms_latency | missed_bytes | 10/10 | 0B = 0B | |
| ✅ | file_to_blackhole_500ms_latency | memory_usage | 10/10 | 0.21GiB ≤ 1.20GiB | |
| ✅ | file_to_blackhole_500ms_latency | missed_bytes | 10/10 | 0B = 0B | |
| ✅ | quality_gate_idle | intake_connections | 10/10 | 3 = 3 | bounds checks dashboard |
| ✅ | quality_gate_idle | memory_usage | 10/10 | 171.80MiB ≤ 181MiB | bounds checks dashboard |
| ✅ | quality_gate_idle_all_features | intake_connections | 10/10 | 3 = 3 | bounds checks dashboard |
| ✅ | quality_gate_idle_all_features | memory_usage | 10/10 | 491.16MiB ≤ 550MiB | bounds checks dashboard |
| ✅ | quality_gate_logs | intake_connections | 10/10 | 3 ≤ 6 | bounds checks dashboard |
| ✅ | quality_gate_logs | memory_usage | 10/10 | 208.82MiB ≤ 220MiB | bounds checks dashboard |
| ✅ | quality_gate_logs | missed_bytes | 10/10 | 0B = 0B | bounds checks dashboard |
| ✅ | quality_gate_metrics_logs | cpu_usage | 10/10 | 340.34 ≤ 2000 | bounds checks dashboard |
| ✅ | quality_gate_metrics_logs | intake_connections | 10/10 | 3 ≤ 6 | bounds checks dashboard |
| ✅ | quality_gate_metrics_logs | memory_usage | 10/10 | 422.88MiB ≤ 475MiB | bounds checks dashboard |
| ✅ | quality_gate_metrics_logs | missed_bytes | 10/10 | 0B = 0B | bounds checks dashboard |
Explanation
Confidence level: 90.00%
Effect size tolerance: |Δ mean %| ≥ 5.00%
Performance changes are noted in the perf column of each table:
- ✅ = significantly better comparison variant performance
- ❌ = significantly worse comparison variant performance
- ➖ = no significant change in performance
A regression test is an A/B test of target performance in a repeatable rig, where "performance" is measured as "comparison variant minus baseline variant" for an optimization goal (e.g., ingress throughput). Due to intrinsic variability in measuring that goal, we can only estimate its mean value for each experiment; we report uncertainty in that value as a 90.00% confidence interval denoted "Δ mean % CI".
For each experiment, we decide whether a change in performance is a "regression" -- a change worth investigating further -- if all of the following criteria are true:
-
Its estimated |Δ mean %| ≥ 5.00%, indicating the change is big enough to merit a closer look.
-
Its 90.00% confidence interval "Δ mean % CI" does not contain zero, indicating that if our statistical model is accurate, there is at least a 90.00% chance there is a difference in performance between baseline and comparison variants.
-
Its configuration does not mark it "erratic".
CI Pass/Fail Decision
✅ Passed. All Quality Gates passed.
- quality_gate_idle_all_features, bounds check intake_connections: 10/10 replicas passed. Gate passed.
- quality_gate_idle_all_features, bounds check memory_usage: 10/10 replicas passed. Gate passed.
- quality_gate_metrics_logs, bounds check intake_connections: 10/10 replicas passed. Gate passed.
- quality_gate_metrics_logs, bounds check cpu_usage: 10/10 replicas passed. Gate passed.
- quality_gate_metrics_logs, bounds check memory_usage: 10/10 replicas passed. Gate passed.
- quality_gate_metrics_logs, bounds check missed_bytes: 10/10 replicas passed. Gate passed.
- quality_gate_idle, bounds check memory_usage: 10/10 replicas passed. Gate passed.
- quality_gate_idle, bounds check intake_connections: 10/10 replicas passed. Gate passed.
- quality_gate_logs, bounds check intake_connections: 10/10 replicas passed. Gate passed.
- quality_gate_logs, bounds check missed_bytes: 10/10 replicas passed. Gate passed.
- quality_gate_logs, bounds check memory_usage: 10/10 replicas passed. Gate passed.
nathan-b
left a comment
There was a problem hiding this comment.
I don't see a test for include_all_devices: false with tag_by_physical_storage: true to ensure that no is_physical_storage:false tags appear.
Good catch! the test |
Add two new opt-in configuration options for the disk check:
- `tag_by_physical_storage`: adds `is_physical_storage:true/false` tag
to every `system.disk.*` metric
- `collect_physical_metrics`: emits `system.disk.physical_{total,used,
free,utilized,in_use}` metrics for physical devices only
Both default to false, preserving backward compatibility (no extra
syscalls or behavioral changes when disabled).
Closes #5921
…l disk classification Address two review comments on PR #48766: - Classify bind mounts by device name so partitions sharing a physical device (different mountpoint) are tagged as physical instead of being silently dropped. - Handle partial success from the all-partitions syscall: when some partitions are returned alongside an error, classify the available results instead of discarding them. Add three tests covering the new error/edge-case paths.
When getDiskPartitionsWithTimeout(false) returns a partial result with an error, the physicalDevices set is incomplete. Classifying all=true results against it would tag real physical disks as non-physical. Skip the all=true classification branch in this case and report only the physical partitions we successfully retrieved.
The goroutine clears the in-flight guard via defer after sending to the buffered result channel. A sequential caller can receive the result and re-enter getDiskPartitionsWithTimeout before the defer runs, hitting the guard spuriously. Clear the flag eagerly on the success path; the goroutine's deferred Store is redundant but harmless.
…w options - Move partitionEnumInFlight.Store(false) before the channel send so sequential calls within the same check run never hit the guard. The previous receiver-side Store introduced a race where call #1's defer could clear the flag while call #2 was active. - Disable tag_by_physical_storage and collect_physical_metrics at config time on Windows, where gopsutil ignores the all parameter and both syscalls return identical results. - Add new options to conf.yaml.default with platform support notes. - Add Windows gate test.
When classification is enabled and the physical scan returns empty (container-like host) while the all-partitions scan fails completely, return the error instead of silently returning empty slices with nil. This preserves error visibility so the check does not report success while emitting zero metrics.
Remove the physicalScanPartial guard that skipped classification when the physical-only scan returned partial results. Skipping the all=true scan entirely caused base system.disk.* metrics to be dropped for non-physical partitions, which is worse than the transient misclassification it was trying to prevent. Classification now always proceeds; during a partial physical scan some physical partitions may be temporarily tagged as non-physical until the next successful run.
When the physical-only partition scan (all=false) returns partial results, partitions not in that incomplete set were incorrectly tagged is_physical_storage:false. Introduce an "unclassified" category so these partitions still emit base system.disk.* metrics but receive no is_physical_storage tag until classification data is reliable.
…all_devices:false The existing test already covered the scenario but lacked an explicit assertion that no is_physical_storage:false tags appear when include_all_devices is disabled with tag_by_physical_storage enabled.
2c59844 to
59a018f
Compare
|
/merge |
|
View all feedbacks in Devflow UI.
This pull request is not mergeable according to GitHub. Common reasons include pending required checks, missing approvals, or merge conflicts — but it could also be blocked by other repository rules or settings.
The expected merge time in
|
Use AssertMetricTaggedWith with device: tag instead of AssertMetric with exact device_name: tag matching. baseDeviceName() differs between Linux (filepath.Base) and Windows (trim backslashes), so device_name:sda1 vs device_name:/dev/sda1 caused CI failure on Windows.
|
All contributors have signed the CLA ✍️ ✅ |
3e0a1a6
into
main
What does this PR do?
is_physical_storageto everysystem.disk.*metric iftag_by_physical_storageconfiguration option (defaults tofalse) is enabledsystem.disk.physical_total,system.disk.physical_used,system.disk.physical_free,system.disk.physical_utilized, andsystem.disk.physical_in_useifcollect_physical_metricsconfiguration option (defaults tofalse) is enabledBoth options default to
false, preserving full backward compatibility so no extra syscalls or behavioral changes when disabled.Motivation
#5921
Describe how you validated your changes
tag_by_physical_storage: physical and non-physical partitions tagged correctlycollect_physical_metrics: physical metrics emitted only for real devicesinclude_all_devices: false: only physical partitions reportedall=true) complete failure: graceful degradation, only physical reportedall=true) partial failure: non-physical partitions still classified from available resultsall=false) complete failure: check returns errorall=false) partial failure: classification skipped to avoid misclassifying real disks as non-physicalPossible Drawbacks / Trade-offs
allparameter on Windows and macOS, so classification is automatically disabled on those platforms with a log warning.