Skip to content

kubeflow notebook collector#168

Merged
sandipanpanda merged 5 commits into
mainfrom
ph/kubeflow-notebooks
Sep 27, 2025
Merged

kubeflow notebook collector#168
sandipanpanda merged 5 commits into
mainfrom
ph/kubeflow-notebooks

Conversation

@Parthiba-Hazra
Copy link
Copy Markdown
Collaborator

@Parthiba-Hazra Parthiba-Hazra commented Sep 26, 2025

Summary by CodeRabbit

  • New Features

    • Collect Kubeflow Notebook resources with exclusion controls and dynamic enable/disable.
    • Add a Node Metadata API endpoint for retrieving node-to-node metadata.
    • Expand supported Kubernetes resource coverage with several additional resource types.
  • Tests

    • Add mocks and handlers to exercise the new Node Metadata API.
  • Chores

    • Upgrade controller tooling and update CRD annotations.
    • Restrict Kubernetes RBAC, granting read-only access for Kubeflow Notebooks and removing broad core permissions.
    • Bump multiple Go module dependencies.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Sep 26, 2025

Walkthrough

Adds Kubeflow Notebook resource support: new ResourceType and dynamic KubeflowNotebookCollector with exclusions/batching, controller wiring (config, registration, restart, enable/disable), RBAC changes, proto enum + NodeMetadata RPC, Makefile/CRD tooling bump, and dependency updates.

Changes

Cohort / File(s) Summary
Build & Tooling Version Bump
Makefile, config/crd/bases/devzero.io_collectionpolicies.yaml
Bumps CONTROLLER_TOOLS_VERSION from v0.16.1 to v0.16.5 and updates CRD Kubebuilder annotation.
RBAC Adjustments
config/rbac/role.yaml
Removes broad core API permissions and adds kubeflow.org notebooks get/list/watch rules.
Collector Core & Types
internal/collector/interface.go, internal/collector/types.go
Adds KubeflowNotebook ResourceType, string and proto mappings, and includes it in AllResourceTypes along with other resource-type additions.
Kubeflow Notebook Collector
internal/collector/kubeflow_notebook_collector.go
New KubeflowNotebookCollector and ExcludedKubeflowNotebook types; constructor, Start/Stop, event handlers, batching, exclusions, availability probe, resource channel, and AddResource.
Controller Integration
internal/controller/collectionpolicy_controller.go
Adds ExcludedKubeflowNotebooks to PolicyConfig; wires registration, selective restart, enable/disable handling, and affected-collector detection for kubeflow_notebook.
Public Proto API & Handlers
proto/metrics_collector.proto, internal/transport/connect_compression_test.go, test/testserver/main.go
Adds enum RESOURCE_TYPE_KUBEFLOW_NOTEBOOK = 49, new NodeMetadata RPC and messages (NodeMetadataRequest, NodeMetadataResponse); test and testserver handlers implemented returning empty NodeMetadataResponse.
Dependencies
go.mod
Upgrades multiple dependencies (otel, grpc, protobuf, golang.org/x/*), adds google.golang.org/genproto pseudo-version and other indirect bumps.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant Reconciler as Controller Reconciler
  participant Registry as Collector Registry
  participant KFNC as KubeflowNotebookCollector
  participant K8s as K8s API (dynamic)
  participant Batcher as Batcher
  participant Stream as Resource Channel

  rect rgba(230,245,255,0.6)
  Reconciler->>Registry: registerResourceCollectors(config)
  alt kubeflow_notebook enabled & available
    Registry->>KFNC: New(namespaces, excludes...)
    KFNC->>K8s: Create dynamic informer (kubeflow.org/v1 Notebooks)
    KFNC->>KFNC: Start informer & handlers
    KFNC->>Batcher: Start batching
  else disabled or unavailable
    Registry-->>Reconciler: skip registration
  end
  end

  rect rgba(235,255,235,0.6)
  K8s-->>KFNC: Add/Update/Delete Notebook event
  KFNC->>KFNC: Apply namespace/exclusion filter
  KFNC->>Batcher: enqueue CollectedResource{type=kubeflow_notebook,event}
  Batcher-->>Stream: emit []CollectedResource
  end
Loading
sequenceDiagram
  autonumber
  participant Client as Client
  participant Svc as MetricsCollectorService
  participant Store as Node Metadata Source

  Client->>Svc: NodeMetadata(NodeMetadataRequest)
  Svc->>Store: fetch node metadata for team_id, cluster_id
  Store-->>Svc: map<string, Node>
  Svc-->>Client: NodeMetadataResponse{node_to_meta}
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Suggested labels

enhancement

Suggested reviewers

  • dray92

Poem

I thump where clusters softly bob,
Notebooks hop from pod to blob,
Informers watch, exclusions keep,
Batches bundle dreams to sleep,
Proto pings — the data hops, hooray! 🐇✨

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check ✅ Passed The title succinctly identifies the primary feature introduced by the changeset—the Kubeflow Notebook collector—and is directly aligned with the main additions in the pull request without extraneous detail. It is concise and specific enough for readers to understand the key change at a glance.
Docstring Coverage ✅ Passed No functions found in the changes. Docstring coverage check skipped.
✨ Finishing touches
  • 📝 Generate Docstrings
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch ph/kubeflow-notebooks

📜 Recent review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between eee2425 and 2c22b65.

⛔ Files ignored due to path filters (3)
  • gen/api/v1/common.pb.go is excluded by !**/*.pb.go, !**/gen/**
  • gen/api/v1/metrics_collector.pb.go is excluded by !**/*.pb.go, !**/gen/**
  • proto/dakr_proto_descriptor.bin is excluded by !**/*.bin
📒 Files selected for processing (2)
  • internal/collector/interface.go (3 hunks)
  • proto/metrics_collector.proto (3 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • internal/collector/interface.go
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)
  • GitHub Check: Package & Push Docker image
  • GitHub Check: Build Docker Image
  • GitHub Check: Run make test
  • GitHub Check: Build Docker Image
  • GitHub Check: Analyze (go)

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
internal/controller/collectionpolicy_controller.go (1)

555-576: Populate ExcludedKubeflowNotebooks from the spec

Line 559 adds ExcludedKubeflowNotebooks to PolicyConfig, but createNewConfig never copies the notebooks listed in the CRD/env spec into this field. As a result the exclusion list is always empty, so the new selective-restart trigger never fires and the Kubeflow notebook collector will keep watching notebooks the user asked us to skip. Please wire the merge logic just like the other exclusion lists.

@@
 	// CSI
 	newConfig.ExcludedCSIDrivers = envSpec.Exclusions.ExcludedCSIDrivers
 	for _, csc := range envSpec.Exclusions.ExcludedCSIStorageCapacities {
 		newConfig.ExcludedCSIStorageCapacities = append(newConfig.ExcludedCSIStorageCapacities, collector.ExcludedCSIStorageCapacity{
 			Namespace: csc.Namespace,
 			Name:      csc.Name,
 		})
 	}
 	newConfig.ExcludedVolumeAttachments = envSpec.Exclusions.ExcludedVolumeAttachments
+
+	// Kubeflow Notebooks
+	for _, nb := range envSpec.Exclusions.ExcludedKubeflowNotebooks {
+		newConfig.ExcludedKubeflowNotebooks = append(newConfig.ExcludedKubeflowNotebooks, collector.ExcludedKubeflowNotebook{
+			Namespace: nb.Namespace,
+			Name:      nb.Name,
+		})
+	}
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 2f47a08 and 156d381.

⛔ Files ignored due to path filters (6)
  • gen/api/v1/apiv1connect/metrics_collector.connect.go is excluded by !**/gen/**
  • gen/api/v1/common.pb.go is excluded by !**/*.pb.go, !**/gen/**
  • gen/api/v1/metrics_collector.pb.go is excluded by !**/*.pb.go, !**/gen/**
  • gen/api/v1/metrics_collector_grpc.pb.go is excluded by !**/*.pb.go, !**/gen/**
  • gen/google/type/money.pb.go is excluded by !**/*.pb.go, !**/gen/**
  • proto/dakr_proto_descriptor.bin is excluded by !**/*.bin
📒 Files selected for processing (8)
  • Makefile (1 hunks)
  • config/crd/bases/devzero.io_collectionpolicies.yaml (1 hunks)
  • config/rbac/role.yaml (2 hunks)
  • internal/collector/interface.go (3 hunks)
  • internal/collector/kubeflow_notebook_collector.go (1 hunks)
  • internal/collector/types.go (1 hunks)
  • internal/controller/collectionpolicy_controller.go (6 hunks)
  • proto/metrics_collector.proto (3 hunks)
🧰 Additional context used
🧬 Code graph analysis (4)
internal/collector/interface.go (1)
gen/api/v1/metrics_collector.pb.go (1)
  • ResourceType_RESOURCE_TYPE_KUBEFLOW_NOTEBOOKS (172-172)
internal/collector/types.go (2)
internal/collector/interface.go (12)
  • Namespace (95-95)
  • CSINode (130-130)
  • Karpenter (131-131)
  • Datadog (132-132)
  • ArgoRollouts (133-133)
  • KedaScaledJob (135-135)
  • KedaScaledObject (136-136)
  • ClusterSnapshot (137-137)
  • CSIDriver (138-138)
  • CSIStorageCapacity (139-139)
  • VolumeAttachment (140-140)
  • KubeflowNotebook (141-141)
gen/api/v1/metrics_collector.pb.go (6)
  • Namespace (1465-1499)
  • Namespace (1514-1514)
  • Namespace (1529-1531)
  • ClusterSnapshot (1909-1920)
  • ClusterSnapshot (1935-1935)
  • ClusterSnapshot (1950-1952)
internal/collector/kubeflow_notebook_collector.go (3)
internal/collector/interface.go (7)
  • CollectedResource (305-321)
  • EventTypeAdd (18-18)
  • EventTypeUpdate (20-20)
  • EventTypeDelete (22-22)
  • EventType (12-12)
  • ResourceType (87-87)
  • KubeflowNotebook (141-141)
internal/collector/batcher.go (2)
  • ResourcesBatcher (23-31)
  • NewResourcesBatcher (34-55)
internal/logger/logger.go (1)
  • Logger (26-29)
internal/controller/collectionpolicy_controller.go (3)
internal/collector/kubeflow_notebook_collector.go (2)
  • ExcludedKubeflowNotebook (39-42)
  • NewKubeflowNotebookCollector (45-87)
internal/collector/batcher.go (2)
  • DefaultMaxBatchSize (16-16)
  • DefaultMaxBatchTime (19-19)
internal/collector/interface.go (1)
  • KubeflowNotebook (141-141)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Analyze (go)
🔇 Additional comments (6)
config/rbac/role.yaml (1)

7-34: Restore core API RBAC permissions

This change drops the manager’s get/list/watch on core resources (pods, nodes, services, PVCs, etc.). Every existing collector still issues those LIST/WATCH calls, so at runtime we’ll start getting 403s (forbidden: User "system:serviceaccount:..." cannot list resource "pods" in API group ""). That breaks virtually all collection flows, not just the new Kubeflow notebook support. Please keep the original core RBAC block and then layer the new Kubeflow rule on top.

Apply this diff to restore the required permissions:

+- apiGroups:
+  - ""
+  resources:
+  - configmaps
+  - endpoints
+  - events
+  - limitranges
+  - namespaces
+  - nodes
+  - persistentvolumeclaims
+  - persistentvolumes
+  - pods
+  - replicationcontrollers
+  - resourcequotas
+  - serviceaccounts
+  - services
+  verbs:
+  - get
+  - list
+  - watch
+- apiGroups:
+  - ""
+  resources:
+  - nodes/metrics
+  - nodes/status
+  - pods/status
+  verbs:
+  - get

Likely an incorrect or invalid review comment.

config/crd/bases/devzero.io_collectionpolicies.yaml (1)

6-6: Version annotation alignment looks good

Controller-gen annotation now matches the toolchain bump in the Makefile, so future regenerations will stay in sync.

Makefile (1)

505-505: Controller-tools bump acknowledged

The version update to v0.16.5 lines up with the CRD annotation change and keeps go install pulls consistent.

internal/collector/types.go (1)

13-15: KubeflowNotebook inclusion in AllResourceTypes looks right

Great to see KubeflowNotebook (and the adjacent late additions) surfaced here—this keeps the collector registry aware of the new resource.

internal/collector/interface.go (1)

141-142: KubeflowNotebook constant/string/proto wiring is consistent

The enum addition, string mapping, and proto mapping are all in lockstep, so downstream translations should just work.

Also applies to: 193-194, 297-298

proto/metrics_collector.proto (1)

8-9: Proto additions cover Kubeflow notebooks and node metadata cleanly

Importing common.proto, adding the notebook resource type, and exposing the NodeMetadata RPC/messages gives the API surface everything it needs.

Also applies to: 122-123, 443-453

Comment on lines +320 to +348
c.logger.Info("Stopping informer", "resource", key)
close(stopCh)
}

c.informers = make(map[string]cache.SharedIndexInformer)
c.informerStopChs = make(map[string]chan struct{})

// Close the main stop channel (signals informers to stop)
select {
case <-c.stopCh:
c.logger.Info("Kubeflow Notebook collector stop channel already closed")
default:
close(c.stopCh)
c.logger.Info("Closed Kubeflow Notebook collector stop channel")
}

// Close the batchChan (input to the batcher).
if c.batchChan != nil {
close(c.batchChan)
c.batchChan = nil
c.logger.Info("Closed Kubeflow Notebook collector batch input channel")
}

// Stop the batcher (waits for completion).
if c.batcher != nil {
c.batcher.stop()
c.logger.Info("Kubeflow Notebook collector batcher stopped")
}
// resourceChan is closed by the batcher's defer func.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Guard batch channel closes during Stop

Stop closes batchChan while informer event handlers may still be executing. If one of those handlers calls handleNotebookEvent after the close, we panic with “send on closed channel.” Please wait for in‑flight handlers to drain before closing the channel; a simple wait group around the handler suffices.

 type KubeflowNotebookCollector struct {
 	dynamicClient     dynamic.Interface
@@
 	telemetryLogger   telemetry_logger.Logger
 	mu                sync.RWMutex
+	wg                sync.WaitGroup
 }
@@
 func (c *KubeflowNotebookCollector) handleNotebookEvent(obj *unstructured.Unstructured, eventType EventType) {
+	c.wg.Add(1)
+	defer c.wg.Done()
+
 	name := obj.GetName()
 	namespace := obj.GetNamespace()
@@
 func (c *KubeflowNotebookCollector) Stop() error {
 	c.logger.Info("Stopping Kubeflow Notebook collector")
@@
 		close(stopCh)
 	}
 
+	// Wait for any in-flight handlers to finish before closing channels.
+	c.wg.Wait()
+
 	c.informers = make(map[string]cache.SharedIndexInformer)
 	c.informerStopChs = make(map[string]chan struct{})
@@
 	if c.batchChan != nil {
 		close(c.batchChan)
 		c.batchChan = nil
🤖 Prompt for AI Agents
internal/collector/kubeflow_notebook_collector.go lines 320-348: Stop currently
closes batchChan while informer event handlers may still be running, causing
“send on closed channel” panics; add a sync.WaitGroup field to the collector,
have each informer event handler Add(1) at start and Done() at exit (or defer
Done) around calls to handleNotebookEvent, and in Stop call wg.Wait() (with a
reasonable ordering: stop informers first or signal them to stop, then wait for
the wg) before closing and nil-ing batchChan; ensure Start/Stop initialization
makes the wg available and handlers check for nil/closed channel where
appropriate to avoid races.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 156d381 and bd5d524.

⛔ Files ignored due to path filters (5)
  • dist/backend-install.yaml is excluded by !**/dist/**
  • dist/install.yaml is excluded by !**/dist/**
  • dist/installer_updater.yaml is excluded by !**/dist/**
  • dist/zxporter.yaml is excluded by !**/dist/**
  • go.sum is excluded by !**/*.sum
📒 Files selected for processing (1)
  • go.mod (4 hunks)

Comment thread go.mod
go 1.23.8

toolchain go1.24.0
go 1.24.0
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Fix Go directive version format

The go directive must use the major.minor release format. With go 1.24.0, the Go toolchain rejects the module (go: go.mod: go 1.24.0: invalid go version) and prevents builds. Please drop the patch component.

-go 1.24.0
+go 1.24
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
go 1.24.0
go 1.24
🤖 Prompt for AI Agents
In go.mod around line 3, the go directive uses an invalid version format "go
1.24.0"; update it to the major.minor format by removing the patch component so
it reads "go 1.24" to satisfy the Go toolchain and allow builds.

@sandipanpanda sandipanpanda merged commit 08fd673 into main Sep 27, 2025
27 checks passed
@sandipanpanda sandipanpanda deleted the ph/kubeflow-notebooks branch September 27, 2025 01:07
@coderabbitai coderabbitai Bot mentioned this pull request Dec 13, 2025
Parthiba-Hazra added a commit that referenced this pull request May 5, 2026
* kubeflow notebook collector

* make build installer

* emty response NodeMetadata method

* log

* proto changes
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants