Skip to content

add sriov, power info support and pci id name resolution#1

Merged
jj-asama merged 2 commits intomasterfrom
pcidevice
Sep 19, 2025
Merged

add sriov, power info support and pci id name resolution#1
jj-asama merged 2 commits intomasterfrom
pcidevice

Conversation

@jj-asama
Copy link
Copy Markdown
Member

@jj-asama jj-asama commented Sep 19, 2025

Summary by CodeRabbit

  • New Features

    • Optional PCI ID name resolution for PCI device reporting (toggle via collector option; configurable IDs file).
    • New PCIe metrics: power state, D3cold allowance, and SR-IOV details (drivers_autoprobe, numvfs, totalvfs, vf_total_msix).
  • Tests

    • Added end-to-end test validating PCI name resolution and metrics against expected output.
  • Chores

    • Added PCI IDs and fixture data; expanded system-device fixtures with additional PCI devices and attributes.

Signed-off-by: Jain Johny <jj@asama.ai>
@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Sep 19, 2025

Walkthrough

Adds PCI device name resolution to the pcidevice collector via a pci.ids loader and flags, extends sysfs fixtures with a new PCI device and SR‑IOV entries, introduces new PCI state metrics, and adds a unit test comparing gathered metrics to expected output.

Changes

Cohort / File(s) Change Summary
PCI Fixtures
collector/fixtures/pci.ids, collector/fixtures/pcidevice-names-output.txt, collector/fixtures/sys.ttar
Added a pci.ids test fixture, expected Prometheus-format metrics exercising name resolution, and an expanded sys.ttar containing a new PCI device 0000:45:00.0 plus SR‑IOV, hwmon, i2c, ptp and driver metadata
PCI Device Collector
collector/pcidevice_linux.go
Added flags collector.pcidevice.idsfile and collector.pcidevice.names, PCI IDs loading/parsing, helper name-resolution functions, new descriptors for power/SR‑IOV metrics, nil-safe value handling, and integrated name labels into the info metric
PCI Device Test
collector/pcidevice_linux_test.go
New unit test (build tag !nopcidevice) that enables name resolution, points at fixture paths and fixtures/pci.ids, registers the collector, and compares gathered metrics to pcidevice-names-output.txt

Sequence Diagram

sequenceDiagram
    participant Test as Test Harness
    participant Collector as pcidevice Collector
    participant PCIIds as pci.ids loader
    participant SysFS as fixture sysfs (sys.ttar)
    Note over Test,Collector: Test configures flags to enable names and set idsfile
    Test->>Collector: Start collector (Update)
    Collector->>PCIIds: loadPCIIds(path(s))
    PCIIds-->>Collector: parsed vendors/devices/classes
    Collector->>SysFS: read PCI device sysfs entries
    SysFS-->>Collector: device attributes, links, SR‑IOV info
    Collector->>Collector: resolve names (vendor/device/subsystem/class)
    Collector->>Test: emit Prometheus metrics (with name labels)
    Test->>Test: compare emitted metrics to `pcidevice-names-output.txt`
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Poem

I nibble through pci.ids with care,
Mapping hex to names with flair,
Devices sing their human tune,
From sysfs roots to metrics moon,
A rabbit cheers this hardware fair 🐇✨

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 66.67% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check ✅ Passed The title accurately and concisely summarizes the primary changes in the changeset — adding SR‑IOV and power-related metrics support and enabling PCI ID name resolution — and directly maps to the updates in pcidevice_linux.go, fixtures, and tests. It is short, specific, and meaningful for a reviewer scanning history.
✨ Finishing touches
  • 📝 Generate Docstrings
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch pcidevice

📜 Recent review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 98bcb9f and 3fec521.

📒 Files selected for processing (1)
  • collector/pcidevice_linux.go (5 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • collector/pcidevice_linux.go

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (10)
collector/pcidevice_linux_test.go (3)

40-41: Remove commented code.

Line 40 contains commented-out code. If this alternative path is not needed, it should be removed rather than left as a comment.

 		"--collector.pcidevice",
 		"--collector.pcidevice.names",
-		//	"--collector.pcidevice.idsfile", "/usr/share/misc/pci.ids",
 		"--collector.pcidevice.idsfile", "fixtures/pci.ids",

72-85: Improve error handling in the Collect method.

The goroutine panics on error, which could crash the entire test suite. Consider proper error propagation instead.

 func (tc *testPCICollector) Collect(ch chan<- prometheus.Metric) {
 	sink := make(chan prometheus.Metric)
+	errCh := make(chan error, 1)
 	go func() {
 		err := tc.pc.Update(sink)
 		if err != nil {
-			panic(fmt.Errorf("failed to update collector: %s", err))
+			errCh <- fmt.Errorf("failed to update collector: %w", err)
+			close(sink)
+			return
 		}
+		errCh <- nil
 		close(sink)
 	}()
 
 	for m := range sink {
 		ch <- m
 	}
+	
+	if err := <-errCh; err != nil {
+		// Log error but don't panic - let the test framework handle it
+		fmt.Printf("Error in collector update: %v\n", err)
+	}
 }

32-44: Isolate kingpin flag parsing in tests — don't modify global CommandLine.

collector/pcidevice_linux_test.go (lines 32–44) calls kingpin.CommandLine.Parse; the same pattern appears in collector/paths_test.go, collector/processes_linux_test.go, collector/textfile_test.go, collector/ipvs_linux_test.go, and collector/filesystem_linux_test.go. Use a test-local kingpin app (kingpin.New) or reset/isolate CommandLine between tests, or avoid t.Parallel for these tests to prevent global-flag interference.

collector/fixtures/pci.ids (1)

26-26: File is missing a trailing newline.

POSIX-compliant text files should end with a newline character.

 17aa  Lenovo
+
collector/fixtures/pcidevice-names-output.txt (1)

74-74: File is missing a trailing newline.

POSIX-compliant text files should end with a newline character.

 node_pcidevice_sriov_vf_total_msix{bus="45",device="00",function="0",segment="0000"} 0
+
collector/pcidevice_linux.go (2)

326-487: Consider extracting PCI ID parsing logic to a separate package.

The loadPCIIds function is quite large (160+ lines) and handles multiple concerns. Consider refactoring this into a separate package or at least breaking it down into smaller helper functions for better maintainability and testability.

Would you like me to help refactor this function into smaller, more focused functions or create a separate PCI IDs parser package?


551-588: Improve error message for unknown class.

The function returns a formatted error string for unknown classes but not for other lookup functions. Consider consistent error handling across all helper functions.

 	// Return the original class ID if not found
-	return "Unknown class (" + classID + ")"
+	return classID // Return ID if name not found, consistent with other functions
 }
collector/fixtures/sys.ttar (3)

5110-5129: Fix sysfs modes for SR‑IOV attributes on PF 0000:01:00.0 (fidelity).

Kernel exposes read‑only SR‑IOV attributes as 0444. Here sriov_totalvfs and sriov_vf_total_msix are marked 0644. Please correct for fixture fidelity. Optionally add virtfn* symlinks if tests need VF enumeration.

Apply within this hunk:

 Path: sys/devices/pci0000:00/0000:00:02.1/0000:01:00.0/sriov_totalvfs
 Lines: 1
 8
-Mode: 644
+Mode: 444
@@
 Path: sys/devices/pci0000:00/0000:00:02.1/0000:01:00.0/sriov_vf_total_msix
 Lines: 1
 16
-Mode: 644
+Mode: 444

5488-5507: Normalize SR‑IOV attribute modes on upstream port 0000:00:02.1.

Same fidelity issue: sriov_totalvfs and sriov_vf_total_msix should be read‑only (0444).

 Path: sys/devices/pci0000:00/0000:00:02.1/sriov_totalvfs
 Lines: 1
 0
-Mode: 644
+Mode: 444
@@
 Path: sys/devices/pci0000:00/0000:00:02.1/sriov_vf_total_msix
 Lines: 1
 0
-Mode: 644
+Mode: 444

5936-6078: Great coverage on 0000:45:00.0; consider trimming/splitting the fixture to keep tests fast and focused.

The NIC subtree is very complete (AER, hwmon, i2c, PTP/ptp0, wakeup, VPD, power states). To reduce IO and parsing cost in unit tests:

  • Split this into smaller overlays (e.g., pci_sriov.ttar, pci_power.ttar, pci_ptp.ttar) and compose in tests that need them.
  • Drop large counters/statistics that aren’t asserted.
  • Keep at least one example per state variant (power/control=on vs auto, runtime_status=active vs suspended) — you already cover both across nodes.

Would you like a PR to add minimal overlays and update the tests to compose them?

Also applies to: 6111-6147, 6172-6213, 6318-6332, 7617-7666, 7714-7857, 7897-7969

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between b959d48 and 98bcb9f.

📒 Files selected for processing (5)
  • collector/fixtures/pci.ids (1 hunks)
  • collector/fixtures/pcidevice-names-output.txt (1 hunks)
  • collector/fixtures/sys.ttar (4 hunks)
  • collector/pcidevice_linux.go (5 hunks)
  • collector/pcidevice_linux_test.go (1 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
collector/pcidevice_linux_test.go (2)
collector/pcidevice_linux.go (1)
  • NewPcideviceCollector (124-172)
collector/collector.go (1)
  • Collector (180-183)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Analyze (go)
🔇 Additional comments (14)
collector/pcidevice_linux_test.go (2)

1-13: LGTM!

The copyright header and build tags are correctly formatted.


87-89: LGTM!

The no-op Describe method is appropriate for test collectors.

collector/fixtures/pci.ids (2)

1-3: LGTM!

Clear header comments explaining the purpose of this test fixture file.


4-11: LGTM!

The PCI class definitions follow the standard pci.ids format correctly with proper indentation for the hierarchy.

collector/fixtures/pcidevice-names-output.txt (2)

1-3: LGTM!

Clear documentation of the test fixture's purpose.


24-31: Good test coverage with varied examples.

The test fixture provides excellent coverage with three diverse PCI devices (AMD bridge, Micron NVMe, Intel network controller), testing different vendor/device combinations and name resolution scenarios.

collector/pcidevice_linux.go (6)

37-43: LGTM!

Good implementation of standard PCI ID file search paths and configuration flags.


68-102: LGTM!

Well-structured Prometheus metric descriptors with clear, informative descriptions for the new PCI device metrics.


130-171: LGTM!

The initialization logic correctly handles conditional label addition based on name resolution settings and properly initializes all metric descriptors.


193-211: LGTM!

Proper formatting of IDs to hex strings and correct conditional name resolution.


215-283: Good defensive programming with nil checks.

Excellent implementation of nil-safe access to device fields with appropriate fallback values. The consistent pattern of checking for nil before dereferencing pointers prevents potential panics.


304-324: LGTM!

Clean implementation of the power state enum to numeric conversion with proper handling of all enum values.

collector/fixtures/sys.ttar (2)

874-876: Good addition: bus->devices symlink for new PCI function.

This ensures 0000:45:00.0 is discoverable via sysfs canonical path and via sys/bus/pci/devices. LGTM.


7901-7934: PF 0000:45:00.0 SR‑IOV set looks good; add virtfn* symlinks if VF enumeration is scraped.

You have sriov_totalvfs/numvfs/stride/offset and sriov_vf_device. If the collector exports VF count only, this is sufficient. If it also walks VFs, add virtfn0..virtfnN symlinks.

Signed-off-by: Jain Johny <jj@asama.ai>
@jj-asama jj-asama merged commit 51db1b3 into master Sep 19, 2025
4 checks passed
jj-asama added a commit that referenced this pull request Oct 21, 2025
* add sriov, power info support and pci id name resolution

Signed-off-by: Jain Johny <jj@asama.ai>

* fix/remove debug lines

Signed-off-by: Jain Johny <jj@asama.ai>

---------

Signed-off-by: Jain Johny <jj@asama.ai>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant