Skip to content

add learning period label to TS CPs#797

Merged
matthyx merged 1 commit into
mainfrom
learning-period-label
Apr 27, 2026
Merged

add learning period label to TS CPs#797
matthyx merged 1 commit into
mainfrom
learning-period-label

Conversation

@matthyx
Copy link
Copy Markdown
Contributor

@matthyx matthyx commented Apr 27, 2026

Summary by CodeRabbit

  • New Features

    • Added a learning period duration label to container metadata.
  • Chores

    • Updated dependency versions.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 27, 2026

📝 Walkthrough

Walkthrough

This change adds a learning period tracking capability to the node-agent. It bumps a dependency version, computes a sniffing duration and stores it in shared container state, exposes this duration as a metadata label, and refactors label validation logic to apply overrides before final sanitization.

Changes

Cohort / File(s) Summary
Dependency Update
go.mod
Bumps github.com/kubescape/k8s-interface from v0.0.206 to v0.0.207.
Learning Period Integration
pkg/containerprofilemanager/v1/lifecycle.go, pkg/objectcache/shared_container_data.go
Adds LearningPeriod field to WatchedContainerData struct, introduces formatDuration helper to format duration strings, and assigns computed sniffing duration to shared container state. Refactors label generation to apply overrides before single final sanitization pass (previously sanitized per-label).
Learning Period Tests
pkg/objectcache/shared_container_data_test.go
Adds validation for new kubescape.io/learning-period metadata label in existing test scenarios and introduces Test_formatDuration unit test with table-driven subtests covering various duration inputs.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

Suggested reviewers

  • jnathangreeg

Poem

🐰 A learning period takes its place,
In containers' metadata space,
With durations formatted tight,
The sniffing time shines so bright,
A rabbit observes with delight! ✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'add learning period label to TS CPs' accurately describes the main change: adding a new LearningPeriod label to WatchedContainerData, which is directly reflected in the code modifications.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch learning-period-label

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@matthyx matthyx force-pushed the learning-period-label branch from 4943be5 to a715f42 Compare April 27, 2026 09:27
@github-actions
Copy link
Copy Markdown

Performance Benchmark Results

Node-Agent Resource Usage
Metric BEFORE AFTER Delta
Avg CPU (cores) 0.000 0.000 N/A
Peak CPU (cores) 0.000 0.000 N/A
Avg Memory (MiB) 0.000 0.000 N/A
Peak Memory (MiB) 0.000 0.000 N/A
Dedup Effectiveness

No data available.

@github-actions
Copy link
Copy Markdown

Performance Benchmark Results

Node-Agent Resource Usage
Metric BEFORE AFTER Delta
Avg CPU (cores) 0.159 0.154 -3.2%
Peak CPU (cores) 0.164 0.161 -1.9%
Avg Memory (MiB) 332.361 258.823 -22.1%
Peak Memory (MiB) 338.590 264.832 -21.8%
Dedup Effectiveness (AFTER only)
Event Type Passed Deduped Ratio
capabilities 1 0 0.0%
hardlink 6000 0 0.0%
http 1704 119455 98.6%
network 899 77938 98.9%
open 34498 621834 94.7%
symlink 6000 0 0.0%
syscall 982 1889 65.8%
Event Counters
Metric BEFORE AFTER
capability_counter 9 8
dns_counter 1468 1398
exec_counter 7346 7024
network_counter 96599 92334
open_counter 805808 768157
syscall_counter 3499 3502

Signed-off-by: Matthias Bertschy <matthias.bertschy@gmail.com>
@matthyx matthyx force-pushed the learning-period-label branch from a715f42 to 93a5542 Compare April 27, 2026 10:23
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
pkg/objectcache/shared_container_data.go (1)

94-99: formatDuration works but is brittle to representation changes.

The two strings.Replace calls correctly compact time.Duration.String() output for the cases covered by Test_formatDuration (verified for 5m, 1h30m, 45s, 1h30s, 1h, and 0). However, it relies on the exact format of time.Duration.String() (e.g., it would also strip m0s from inside fractional-second strings like "1h0m0.5s" only partially). For the duration ranges this codebase uses (sniffing time in minutes/hours), this is acceptable — flagging only as a heads-up for future maintenance.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/objectcache/shared_container_data.go` around lines 94 - 99,
formatDuration currently massages time.Duration.String() output with brittle
string replacements (m0s/h0m) which can break for fractional seconds or
unexpected representations; update formatDuration to build the output from
duration components instead of post-processing the string: compute hours,
minutes and seconds (using d.Hours()/Minutes()/Seconds() or integer
division/modulo on d.Seconds()/time.Second) and format them conditionally (e.g.,
print hours if >0, minutes if >0, seconds otherwise) so outputs like "1h", "5m",
"1h30m", "45s", "1h30s", and "0s" are produced deterministically without relying
on time.Duration.String(). Ensure the logic lives in formatDuration and
preserves existing test cases.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@pkg/objectcache/shared_container_data.go`:
- Around line 104-107: The current loop applies watchedContainer.LabelOverrides
directly to labels which allows an invalid override to overwrite the base label
and later be dropped by sanitization; instead, validate each override entry
before applying it: iterate over watchedContainer.LabelOverrides, run the same
per-entry validation/sanitization logic used later (or a dedicated
isValidLabelValue check) and only set labels[k] = v when the override value is
valid; this preserves the original base label for keys whose overrides are
invalid and prevents losing keys during the later sanitization pass.

---

Nitpick comments:
In `@pkg/objectcache/shared_container_data.go`:
- Around line 94-99: formatDuration currently massages time.Duration.String()
output with brittle string replacements (m0s/h0m) which can break for fractional
seconds or unexpected representations; update formatDuration to build the output
from duration components instead of post-processing the string: compute hours,
minutes and seconds (using d.Hours()/Minutes()/Seconds() or integer
division/modulo on d.Seconds()/time.Second) and format them conditionally (e.g.,
print hours if >0, minutes if >0, seconds otherwise) so outputs like "1h", "5m",
"1h30m", "45s", "1h30s", and "0s" are produced deterministically without relying
on time.Duration.String(). Ensure the logic lives in formatDuration and
preserves existing test cases.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 989fc585-c048-499a-94f8-387a3c256e64

📥 Commits

Reviewing files that changed from the base of the PR and between 4aefa65 and 93a5542.

⛔ Files ignored due to path filters (1)
  • go.sum is excluded by !**/*.sum
📒 Files selected for processing (4)
  • go.mod
  • pkg/containerprofilemanager/v1/lifecycle.go
  • pkg/objectcache/shared_container_data.go
  • pkg/objectcache/shared_container_data_test.go

Comment on lines 104 to 107
// Apply label overrides
for k, v := range watchedContainer.LabelOverrides {
if v == "" {
delete(labels, k)
} else if errs := content.IsLabelValue(v); len(errs) != 0 {
logger.L().Warning("GetLabels - label override value is not valid, skipping", helpers.String("key", k), helpers.String("value", v))
delete(labels, k)
} else {
labels[k] = v
}
labels[k] = v
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Behavior change: invalid LabelOverrides now drop the base label too.

Previously, overrides were validated per-entry and invalid overrides were discarded so the underlying base label remained. With the single-pass sanitization at Lines 127–139, an invalid override value now first overwrites the base value at Line 106 and is then removed entirely at Line 137 — so the resulting label map is missing the key in cases where it would have kept the (valid) base value before.

If overrides are expected to be trusted/pre-validated this is fine; otherwise consider validating overrides before applying:

♻️ Suggested fix
 	// Apply label overrides
 	for k, v := range watchedContainer.LabelOverrides {
+		if v == "" || len(content.IsLabelValue(v)) != 0 {
+			continue
+		}
 		labels[k] = v
 	}
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
// Apply label overrides
for k, v := range watchedContainer.LabelOverrides {
if v == "" {
delete(labels, k)
} else if errs := content.IsLabelValue(v); len(errs) != 0 {
logger.L().Warning("GetLabels - label override value is not valid, skipping", helpers.String("key", k), helpers.String("value", v))
delete(labels, k)
} else {
labels[k] = v
}
labels[k] = v
}
// Apply label overrides
for k, v := range watchedContainer.LabelOverrides {
if v == "" || len(content.IsLabelValue(v)) != 0 {
continue
}
labels[k] = v
}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/objectcache/shared_container_data.go` around lines 104 - 107, The current
loop applies watchedContainer.LabelOverrides directly to labels which allows an
invalid override to overwrite the base label and later be dropped by
sanitization; instead, validate each override entry before applying it: iterate
over watchedContainer.LabelOverrides, run the same per-entry
validation/sanitization logic used later (or a dedicated isValidLabelValue
check) and only set labels[k] = v when the override value is valid; this
preserves the original base label for keys whose overrides are invalid and
prevents losing keys during the later sanitization pass.

@github-actions
Copy link
Copy Markdown

Performance Benchmark Results

Node-Agent Resource Usage
Metric BEFORE AFTER Delta
Avg CPU (cores) 0.157 0.161 +3.1%
Peak CPU (cores) 0.161 0.166 +3.0%
Avg Memory (MiB) 330.342 266.321 -19.4%
Peak Memory (MiB) 332.016 272.000 -18.1%
Dedup Effectiveness (AFTER only)
Event Type Passed Deduped Ratio
capabilities 0 0 N/A
hardlink 6000 0 0.0%
http 1704 119456 98.6%
network 900 78000 98.9%
open 36214 619955 94.5%
symlink 6000 0 0.0%
syscall 974 1859 65.6%
Event Counters
Metric BEFORE AFTER
capability_counter 11 8
dns_counter 1441 1407
exec_counter 7209 7079
network_counter 94837 93050
open_counter 789222 776659
syscall_counter 3539 3462

@matthyx matthyx merged commit 6f9697e into main Apr 27, 2026
28 checks passed
@matthyx matthyx deleted the learning-period-label branch April 27, 2026 10:59
entlein added a commit to k8sstormcenter/node-agent that referenced this pull request May 2, 2026
…eCache (kubescape#788) (#36)

* Replace AP and NN cache with CP (kubescape#788)

* feat: foundation for ContainerProfileCache unification (steps 1, 2, 5-early)

Additive-only scaffolding for the upcoming migration from the two
workload-keyed caches (applicationprofilecache + networkneighborhoodcache)
to a single container-keyed ContainerProfileCache. No consumers are
rewired yet; all new code is unused.

- Storage client: GetContainerProfile(namespace, name) on ProfileClient
  interface + *Storage impl + mock.
- ContainerProfileCache interface + stub impl (methods return zero values;
  filled in by step 3/4).
- Prometheus metrics: nodeagent_user_profile_legacy_loads_total{kind,completeness}
  deprecation counter + reconciler SLO metrics (entries gauge, hit/miss
  counter, tick duration histogram, eviction counter) registered up front
  so later steps emit cleanly.

Plan artifacts in .omc/plans/; approved by ralplan Planner/Architect/Critic
consensus (v2, iteration 2).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat: ContainerProfileCacheImpl + projection + shared-pointer fast-path (steps 3, 3.5, 4)

- CachedContainerProfile entry with Shared/RV/UserAP/UserNNRV fields
- Option A+ fast-path: shared storage pointer when no user overlay
- projection.go ports mergeContainers/mergeNetworkNeighbors from legacy caches
- partial-profile detection with dedup'd WARN log + completeness metric label
- Event-path delete with WithLock+ReleaseLock (Critic #2 lock-gap fix)
- Unit tests T4 (projection) + T6 (callstack parity) + fast-path identity

Step 5 (reconciler) and legacy deletion land in follow-ups.

Plan: .omc/plans/containerprofile-cache-unification-consensus.md

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat: ContainerProfileCache reconciler with evict + refresh (step 5)

- tickLoop drives evict + refresh on one goroutine, refresh gated by atomic
- reconcileOnce evicts entries whose pod is gone or container stopped
- refreshAllEntries snapshots IDs then refreshes outside Range to avoid a
  SafeMap RLock/WLock deadlock (rebuildEntry calls Set)
- isContainerRunning(pod, entry, id): containerID primary, (Name, PodUID)
  fallback for pre-running init containers with empty ContainerID
- ctx.Err() honored inside Range callbacks for graceful shutdown
- T8 end-to-end test: user-AP mutation -> cached projection reflects change

Plan: .omc/plans/containerprofile-cache-unification-consensus.md
Consensus deltas applied: #1 (isContainerRunning signature), #3 (ctx.Err),
#4 (extend fast-skip to overlay RVs), #5 (T8 test), #7 (RPC-cost comment).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat: profilehelper CP->legacy-shape shims + ContainerProfileCache aggregator wiring (step 6a)

Adds the ContainerProfileCache reader to the ObjectCache aggregator interface
so profilehelper can read CP and synthesize the legacy ApplicationProfileContainer /
NetworkNeighborhoodContainer shapes for callers that haven't migrated yet.

- pkg/objectcache/objectcache_interface.go: add ContainerProfileCache() to
  aggregator interface + mock (both AP/NN stay for 6a-6c transit)
- pkg/objectcache/v1/objectcache.go: add cp field, 5-arg NewObjectCache,
  ContainerProfileCache() accessor
- pkg/objectcache/v1/mock.go: RuleObjectCacheMock implements CP surface +
  Get/SetContainerProfile test helpers, Start stub
- pkg/rulemanager/profilehelper/profilehelper.go:
  - GetContainerProfile(objectCache, id) returns (*CP, syncChecksum, error)
    — the forward API
  - GetContainerApplicationProfile + GetContainerNetworkNeighborhood rewritten
    as ~30-LOC CP->legacy-shape shims (consensus delta #2). Marked deprecated;
    step 6c deletes them after CEL callers migrate.
- cmd/main.go: construct ContainerProfileCache alongside APC+NNC, pass to
  NewObjectCache; mock-path uses ContainerProfileCacheMock
- test call sites updated for 5-arg NewObjectCache

Plan: .omc/plans/containerprofile-cache-unification-consensus.md

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* refactor: migrate 20 CEL call sites to GetContainerProfile (step 6b)

- applicationprofile/{capability,exec,http,open,syscall}.go: read fields
  directly off cp.Spec instead of the per-container AP shape
- networkneighborhood/network.go: read Ingress/Egress/LabelSelector off
  cp.Spec directly
- pkg/objectcache/v1/mock.go: extend RuleObjectCacheMock so
  SetApplicationProfile / SetNetworkNeighborhood also project into the
  unified ContainerProfile, and GetContainerProfile honours the shared
  container-ID registry (preserves "invalid container ID -> no profile"
  semantics for existing tests)
- profilehelper CP->legacy shims remain in place; step 6c removes them

Plan: .omc/plans/containerprofile-cache-unification-consensus.md

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* refactor: delete profilehelper shims + migrate rule_manager + creator (step 6c)

- pkg/rulemanager/profilehelper/profilehelper.go: delete
  GetContainerApplicationProfile, GetContainerNetworkNeighborhood,
  GetApplicationProfile, GetNetworkNeighborhood, GetContainerFromApplicationProfile,
  GetContainerFromNetworkNeighborhood — CP-direct API is the only surface now
- pkg/rulemanager/rule_manager.go:
  - :202, :399 call profilehelper.GetContainerProfile instead of the shim
  - HasFinalApplicationProfile reads cp via ContainerProfileCache().GetContainerProfile;
    method name preserved (external API on RuleManagerInterface per plan v2 §2.4)
- pkg/rulemanager/rulepolicy.go: Validate takes *v1beta1.ContainerProfile
  and reads cp.Spec.PolicyByRuleId
- pkg/rulemanager/ruleadapters/creator.go: both AP + NN branches use
  ContainerProfileCache().GetContainerProfileState (unified state source)

Plan: .omc/plans/containerprofile-cache-unification-consensus.md

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* refactor: ObjectCache aggregator CP-only + collapse 2 callbacks to 1 (step 6d)

- pkg/objectcache/objectcache_interface.go: drop ApplicationProfileCache()
  and NetworkNeighborhoodCache() methods — the aggregator is now
  {K8s, ContainerProfile, Dns}
- pkg/objectcache/v1/objectcache.go: 3-arg NewObjectCache(k, cp, dc)
- pkg/containerwatcher/v2/container_watcher_collection.go:63-64: two
  ContainerCallback subscriptions (APC + NNC) collapse to one (CPC)
- cmd/main.go: both branches (runtime-detection + mock) construct only
  ContainerProfileCache + Dns; legacy APC/NNC wiring removed with startup
  log: "ContainerProfileCache active; legacy AP/NN caches removed"
- test call sites updated for 3-arg NewObjectCache

Legacy packages still physically present (imports retained where still
referenced, e.g. callstackcache); step 8 deletes them entirely.

Plan: .omc/plans/containerprofile-cache-unification-consensus.md

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* refactor: delete legacy AP/NN cache packages + move callstackcache (step 8)

- git rm -r pkg/objectcache/applicationprofilecache/ (766 LOC)
- git rm -r pkg/objectcache/networkneighborhoodcache/ (758 LOC)
- git rm pkg/objectcache/applicationprofilecache_interface.go
- git rm pkg/objectcache/networkneighborhoodcache_interface.go
- mv pkg/objectcache/applicationprofilecache/callstackcache/
    -> pkg/objectcache/callstackcache/ (domain-agnostic, shared)
- Update 4 importers: containerprofilecache_interface.go, v1/mock.go,
  containerprofilecache.go, reconciler.go
- RuleObjectCacheMock drops ApplicationProfileCache()/NetworkNeighborhoodCache()
  accessor methods; SetApplicationProfile/SetNetworkNeighborhood remain as
  test helpers that project into the unified CP
- projection.go comments kept as historical source pointers — git history
  preserves the originals

Plan: .omc/plans/containerprofile-cache-unification-consensus.md

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test: add T2 init-eviction, T5 packages-deleted, T7 lock-stress (step 9 partial)

- tests/containerprofilecache/packages_deleted_test.go: go/packages
  dep-graph assertion that legacy AP/NN paths are absent
- tests/containerprofilecache/lock_stress_test.go: 100-goroutine
  interleaved seed/read for same container IDs, 5s budget, race-safe
- tests/containerprofilecache/init_eviction_test.go: T2a (event-path
  evict) + T2b (reconciler-path evict for missed RemoveContainer)
- tests/containerprofilecache/helpers_test.go: shared test builders
- pkg/objectcache/containerprofilecache: export ReconcileOnce and
  SeedEntryForTest as out-of-package test hooks
- Makefile: check-legacy-packages target

T1 (golden-alert parity) and T3 (memory benchmark) are release-checklist
items per plan v2 §2.7 — the pre-migration baselines those tests require
can no longer be captured from this branch.

Plan: .omc/plans/containerprofile-cache-unification-consensus.md

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix: address Phase 4 review P1 findings

1. Drop ReleaseLock on delete paths (containerprofilecache.go deleteContainer,
   reconciler.go reconcileOnce). Security review flagged a race where the
   deleted mutex could be orphaned while a concurrent GetLock creates a new
   one, breaking mutual exclusion for the same container ID. Trade-off:
   bounded memory growth of stale lock entries, proportional to container
   churn — acceptable for a node-agent lifetime.

2. Extract emitOverlayMetrics helper (metrics.go) to de-duplicate the
   ~20-line overlay metric/deprecation-warn block between buildEntry
   (addContainer path) and rebuildEntry (refresh path). Keeps the two in
   lockstep — code review flagged silent drift risk.

Not addressed in this commit (plan-accepted tradeoffs, follow-up work):
- Shared-pointer read-only invariant is convention-enforced, not type-
  enforced (plan v2 §2.3 step 7, ADR consequences). Retaining as-is;
  downstream consumers must not mutate.
- Storage RPC context propagation (requires storage.ProfileClient interface
  change, out of scope for this migration).

Plan: .omc/plans/containerprofile-cache-unification-consensus.md

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix: retry pending ContainerProfile GETs when CP appears after container-start

Component tests on PR kubescape#788 regressed with "All alerts: []" and 54+
"container X not found in container-profile cache" log entries. Root cause:
addContainer did a one-shot GetContainerProfile at EventTypeAddContainer
time and bailed on 404. But the CP is created asynchronously by
containerprofilemanager ~60s AFTER container-start, so the one-shot GET
almost always missed; the cache entry was never created; rule evaluation
short-circuited as "no profile".

Legacy caches hid this via a periodic ListProfiles scan that picked up
late-arriving profiles on the next tick. The point-lookup model dropped
that scan. This commit adds an equivalent: a pending-container retry path
in the reconciler.

Changes:
- CachedContainerProfile unchanged; new pendingContainer struct captures
  (container, sharedData, cpName) needed to retry the initial GET.
- ContainerProfileCacheImpl.pending SafeMap records containerIDs waiting
  for their CP to land in storage.
- addContainer extracts the populate/GET into tryPopulateEntry. On miss
  (err or nil CP) it records a pending entry; the per-container goroutine
  exits. No more waiting 10 min inside addContainerWithTimeout.
- reconciler.retryPendingEntries iterates pending under per-container
  locks, re-issues the GET, and promotes via tryPopulateEntry on success.
- reconcileOnce gains a pending GC pass: containers whose pod is gone or
  whose status is not Running get dropped from pending so we don't retry
  forever on terminated containers.
- deleteContainer also clears from pending on EventTypeRemoveContainer.
- metrics: cache_entries gauge gains a "pending" kind; reconciler
  eviction counter gets a "pending_pod_stopped" reason.

Tests:
- TestRetryPendingEntries_CPCreatedAfterAdd: 404 on add -> pending; CP
  arrives in storage -> one tick promotes; exactly 2 GetCP calls.
- TestRetryPendingEntries_PodGoneIsGCed: pending entry dropped when its
  pod is no longer present in k8s cache.

Full findings and resume doc at
  .omc/plans/containerprofile-cache-component-test-findings.md

Follow-up plan updated at
  .omc/plans/containerprofile-cache-followups.md

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix: cache correctness — right CP slug, partial-on-restart, overlay refs, resurrection guard

PR kubescape#788 component tests continued failing after the pending-retry fix.
Deep investigation uncovered a fundamental slug misuse and three reviewer-
reported correctness gaps. All fixed here.

### Primary bug: wrong slug function

plan v2 §2.3 asserted that GetOneTimeSlug(false) was deterministic. It is
NOT — implementation at k8s-interface v0.0.206:
  func (id *InstanceID) GetOneTimeSlug(noContainer bool) (string, error) {
      u := uuid.New()
      hexSuffix := hex.EncodeToString(u[:])
      ...
  }
So containerprofilemanager.saveContainerProfile writes a *time-series* CP
per tick with a fresh UUID suffix, and the storage-side
ContainerProfileProcessor.consolidateKeyTimeSeries writes the consolidated
profile at the STABLE slug (GetSlug(false), no UUID).

The cache was querying for CPs at GetOneTimeSlug(false), so every GET 404'd
forever — even with the pending-retry in place. 13 component tests failed
with "All alerts: []" and 38+ "container X not found in container-profile
cache" log entries.

Switched addContainer to GetSlug(false). The refresh path inherits the
corrected name via entry.CPName.

### Reviewer #1: resurrection during refresh

refreshAllEntries snapshots entries without a lock. Between snapshot and
per-entry lock acquisition, deleteContainer or reconcile-evict may have
removed the entry. Previously, rebuildEntry's c.entries.Set(id, newEntry)
would resurrect the dead container.

Added a load-under-lock guard at the top of refreshOneEntry.

### Reviewer #2: overlay handling regressions (two parts)

(a) tryPopulateEntry returned "pending" on base-CP 404 BEFORE trying
user-AP/NN. Containers with only a user-defined profile (no base CP yet)
got no entry. Restructured: fetch base CP and user-AP/NN independently;
populate if ANY source is available; synthesize an empty base CP when only
the overlay exists so projection has something to merge into.

(b) UserAPRef / UserNNRef were only recorded on successful fetch. A
transient 404 on add would permanently drop the overlay intent — the
refresh path had nothing to re-fetch. Now, when the label is set, the
refs are always recorded, using the label's name and the container's
namespace. Refresh retries the fetch each tick.

### Reviewer #3: partial profiles reused across container restart

tryPopulateEntry blindly used whatever CP existed at the stable slug,
including Partial completions from the previous container incarnation.
Legacy caches explicitly deleted Partial profiles on non-PreRunning
restart so rule evaluation fell through to "no profile" until Full
arrived.

Now: if CP.completion == Partial && !sharedData.PreRunningContainer, we
treat the CP as absent → stay pending → retry each tick. When the CP
becomes Full (or the container stops), the pending state resolves.

The inverse is preserved: PreRunningContainer (agent-restart scenario)
accepts the Partial CP as-is so Test_19's "alert on partial profile"
semantics still work.

### Tests

Five new unit tests, all race-clean:
- TestPartialCP_NonPreRunning_StaysPending
- TestPartialCP_PreRunning_Accepted
- TestOverlayLabel_TransientFetchFailure_RefsRetained
- TestRefreshDoesNotResurrectDeletedEntry
- TestUserDefinedProfileOnly_NoBaseCP

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix: read workload-level AP/NN as primary data source

The storage server's consolidated ContainerProfile is not exposed via the
public k8s API — ContainerProfiles().Get(stableName) returns 404 even after
consolidation runs. Only time-series CPs (named <stable>-<UUID>) and the
server-aggregated ApplicationProfile / NetworkNeighborhood CRs at the
workload-name are queryable.

The component tests' WaitForApplicationProfileCompletion waits for the
workload-level AP/NN completion — that's what actually exists. The legacy
caches read these directly; we do the same now while the server-side
consolidated-CP plumbing is completed.

Changes:
- addContainer computes both cpName (per-container, forward-compat) and
  workloadName (per-workload, where AP/NN live) via GetSlug(false) and
  GetSlug(true) respectively.
- tryPopulateEntry fetches consolidated CP (kept for forward-compat),
  workload AP, and workload NN. Treats the workload AP/NN as the primary
  data source when the consolidated CP isn't available.
- projection pre-merges workloadAP + workloadNN onto the base (synthesized
  when CP is 404), then buildEntry applies user-overlay AP/NN on top.
- Partial-on-restart gate extended to cover workload AP/NN too — non
  PreRunning containers ignore partial workload profiles until they
  become Full, mirroring legacy deletion-on-restart semantics.
- pendingContainer gains workloadName so retries re-fetch the right CRs.
- fakeProfileClient gains overlayOnly field so tests can scope AP/NN
  returns to the overlay name; existing TestOverlayPath_DeepCopies updated
  accordingly.

Forward-compat: once storage publishes a queryable consolidated CP at
cpName, its fetch becomes primary and the workload AP/NN path becomes a
fallback. No API changes are required to make that transition — just drop
the workload-level fetches.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* debug: add tick-loop start log + change-detection log in reconciler

* fix: remove overly-aggressive pending GC that dropped entries before retry

CI run 24781030436 (commit ce32919) proved the reconciler IS ticking with
retryPendingEntries running, but the pending-GC pass in reconcileOnce was
dropping every pending entry on the first tick (pending_before=4 →
pending_after=0 at the FIRST tick, before retryPendingEntries could run).

Root cause: the GC pass asked k8sObjectCache.GetPod(ns, pod) and also
checked isContainerRunning. On a busy node, the k8s pod cache and
ContainerStatuses lag the containerwatcher Add event by tens of seconds.
So "pod not found" or "container not yet Running" routinely returned true
for a container that had just been registered, causing GC to remove the
pending entry immediately. Retries then ran against an empty pending map
→ no promotions → alerts fired without profile → test failure.

Change: remove the pending GC entirely. Cleanup for terminated containers
flows through deleteContainer (EventTypeRemoveContainer) which clears
both entries and pending under the per-container lock. Memory growth is
bounded by the node's container churn (containers that never got a
profile during their lifetime).

Test updated: TestRetryPendingEntries_PodGoneIsGCed replaced by
TestPendingEntriesAreNotGCedBeforeRetry which asserts the new semantics.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix: merge user-managed AP/NN and refresh workload-level sources

Two component-test regressions in PR kubescape#788:

Fix A (Test_12 / Test_13): the cache now reads the user-managed
ApplicationProfile and NetworkNeighborhood published at
"ug-<workloadName>" and projects them onto the base profile as a
dedicated ladder pass. Legacy caches did this via the
`kubescape.io/managed-by: User` annotation in handleUserManagedProfile;
we read them directly by their well-known name.

Fix B (Test_17 / Test_19): the reconciler refresh path re-fetches the
workload-level AP/NN (and user-managed / label-referenced overlays) on
every tick, not just the consolidated CP. This propagates the Status=
"ready" -> "completed" transition into the cached ProfileState, which
flips fail_on_profile from false to true at rule-eval time.

CachedContainerProfile gained WorkloadName plus WorkloadAPRV /
WorkloadNNRV / UserManagedAPRV / UserManagedNNRV fields so the refresh
can fast-skip when every source's RV matches what's cached.
refreshOneEntry's rebuild now runs the same projection ladder as
tryPopulateEntry: base CP (or synthesized) → workload AP+NN →
user-managed (ug-) AP+NN → label-referenced user AP+NN.

Also:
- Tick-loop log only fires when entries OR pending count actually moved
  (previously fired whenever pending_before > 0, producing per-tick
  noise while a stuck-pending entry waited for profile data).
- fakeProfileClient in tests returns userManagedAP/userManagedNN when
  the requested name starts with "ug-".
- New tests: TestWorkloadAPMerged_AndRefreshUpdatesStatus (Fix B
  happy-path) and TestUserManagedProfileMerged (Fix A happy-path).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix: reconcileOnce no longer evicts on pod-cache lag, only on Terminated

CI run 24783250693 (commit 32a76c0) showed reconcileOnce evicting live
entries every tick: "entries_before:10, entries_after:0" within 5 seconds
of the agent starting. Same class of bug as the pending-GC fix (c45803f):
the k8s pod cache lags ContainerCallback Add events by tens of seconds,
and evicting on "GetPod returns nil OR !isContainerRunning" churned every
entry before any rules could evaluate.

Change reconcileOnce eviction gate:
- If pod is missing from k8s cache: DO NOT evict. Cache lag is transient;
  deleteContainer handles real-world cleanup via EventTypeRemoveContainer.
- If pod present and container has clearly Terminated: evict (preserves
  init-container eviction for Test_02 and T2 acceptance).
- If pod present and container in Waiting state: retain (new container
  creation, init-container pre-run both legitimately pass through Waiting).

New helper isContainerTerminated mirrors isContainerRunning but gates on
State.Terminated only; "not found in any status" treated as terminated.

Tests:
- TestReconcilerEvictsWhenPodMissing → TestReconcilerKeepsEntryWhenPodMissing
- New TestReconcilerEvictsTerminatedContainer
- New TestReconcilerKeepsWaitingContainer

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* refactor: drop workload-level AP/NN fetch; CP-direct reading is authoritative

The workload-level AP/NN fetch added in d27be01 was a workaround for the
eviction/GC bugs (fixed in c45803f and d9ae0ac), not an architectural
need. The consolidated ContainerProfile IS queryable at the GetSlug(false)
name once storage aggregation runs; the cache simply needs to wait on the
pending-retry path.

This reverts the workload-AP/NN read while keeping:
- consolidated CP as the single base-profile source
- user-managed AP/NN at "ug-<workloadName>" (merged on top) — still needed
  because user-managed profiles are authored independently and are not
  consolidated into the CP server-side
- user-defined overlay via pod UserDefinedProfileMetadataKey label
- eviction fix (d9ae0ac), GC fix (c45803f), resurrection guard

Removed:
- workload-AP/NN fetch in tryPopulateEntry and refreshOneEntry
- WorkloadAPRV / WorkloadNNRV fields on CachedContainerProfile and the
  corresponding rebuildEntryFromSources ladder pass
- Partial-on-restart gate for workload AP/NN (only applies to CP now)
- Synth-CP annotation fallback chain (simplified to Completed/Full)

Tests:
- TestWorkloadAPMerged_AndRefreshUpdatesStatus → TestRefreshUpdatesCPStatus
  (CP now the source; RV transition propagates Status)
- TestUserManagedProfileMerged rewired to use a real base CP + ug- overlay
  instead of workloadAP + ug- overlay

This matches the migration plan's original intent: CP-direct, AP/NN only
as user overlays.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix: synthetic entry CPName override, PodUID backfill, phase-labeled reconciler histogram

Three review findings from the post-green audit.

### 1 (High) — synthetic entry stored the wrong CPName

When tryPopulateEntry synthesized a CP (consolidated CP still 404), the
synthetic name was workloadName or overlayName, and buildEntry persisted
entry.CPName = cp.Name (i.e. the synthetic name). refreshOneEntry then
queried the synthetic name instead of the real GetSlug(false) name; with
the stored RV also empty, the fast-skip's "absent matches empty" branch
kept the synthetic entry forever and the real consolidated CP could never
replace it.

Fix: after buildEntry, override entry.CPName = cpName (the real
GetSlug(false) result passed into tryPopulateEntry).

### 2 (Medium) — PodUID never backfilled

buildEntry only sets PodUID when the pod is already in k8sObjectCache at
add time. On busy nodes the pod cache lags, so addContainer often runs
before the pod lands and PodUID stays "". isContainerTerminated's
empty-ContainerID fallback matches against (ContainerName, PodUID);
when PodUID == "" and the status also has empty UID, the loop falls
through and returns true (treat as terminated) — evicting a still-live
init container. rebuildEntryFromSources copied prev.PodUID unchanged, so
the error never healed.

Fix: in rebuildEntryFromSources, if prev.PodUID is empty AND the pod is
now in the k8s cache, use the fresh UID.

### 3 (Low) — reconciler duration histogram mixed two phases

tickLoop (evict) and refreshAllEntries (refresh) both emitted
ReportContainerProfileReconcilerDuration into the same plain Histogram,
so nodeagent_containerprofile_reconciler_duration_seconds was a blend of
two very different workloads. Plan v2 §2.9 had specified a HistogramVec
with a "phase" label from the start.

Fix: MetricsManager.ReportContainerProfileReconcilerDuration(phase, d).
Prometheus implementation becomes a HistogramVec with phase label.
tickLoop emits phase="evict", refreshAllEntries emits phase="refresh".
MetricsMock/MetricsNoop signatures updated.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix: address all CodeRabbit review comments on PR kubescape#788

- ContainerProfileCacheMock.GetContainerProfileState returns synthetic
  error state instead of nil, matching the real impl's contract
- Remove IgnoreContainer check on EventTypeRemoveContainer to prevent
  stale entries when pod labels change after Add
- Deep-copy userAP/userNN in mergeApplicationProfile and
  mergeNetworkNeighborhood to eliminate aliasing of nested slices
  (Execs[i].Args, Opens[i].Flags, MatchExpressions[i].Values, etc.)
  into the cached ContainerProfile
- Fix Shared=true bug: buildEntry now takes userManagedApplied bool;
  fast-path only sets Shared=true when no overlay was applied at all,
  matching rebuildEntryFromSources logic in reconciler.go
- isContainerTerminated returns false when all status slices are empty
  (kubelet lag guard for brand-new pods)
- Fix misplaced doc comment above GetContainerProfile in storage layer
- Remove unused (*stubStorage).setCP test helper
- Lock stress test evict path now uses ContainerCallback(Remove) to
  exercise deleteContainer and per-container locking
- RuleObjectCacheMock stores per-container profiles in cpByContainerName;
  GetContainerProfile resolves via InstanceID.GetContainerName();
  GetContainerProfileState returns synthetic error state

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* feat: thread context.Context through ProfileClient and add per-call RPC budget

All five ProfileClient methods now accept ctx as their first argument so
callers can enforce cancellation and deadline propagation. Each storage RPC
in the reconciler is wrapped via refreshRPC(ctx, ...) which applies a
configurable per-call timeout (config.StorageRPCBudget, default 5 s) on top
of the parent context, preventing a slow API server from stalling an entire
reconciler burst. Tests cover the fast-skip, rebuild, and context-cancellation
mid-RPC paths.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* test: shared-pointer race-fuzz test + WarmContainerLocksForTest helper

Add TestSharedPointerReadersDoNotCorruptCache: 50 concurrent readers
traverse the returned *ContainerProfile slices while a writer goroutine
alternately calls RefreshAllEntriesForTest + SeedEntryForTest to keep
entry rebuilds active. Runs for 500ms under -race, proving the shared-
pointer fast-path never produces a concurrent read/write pair.

Also add TestSharedPointerFastPathPreservesPointerIdentity: after a
refresh against a storage object with a newer RV, the new entry's
Profile pointer IS the storage object (Shared=true, no DeepCopy), which
keeps the T3 memory budget intact.

Fix the pre-existing goradd/maps SafeMap initialisation race in
TestLockStressAddEvictInterleaved by pre-warming containerLocks via the
new WarmContainerLocksForTest helper (the previous pre-warm via
SeedEntryForTest only covered the entries SafeMap, not containerLocks).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* docs: document SetApplicationProfile / SetNetworkNeighborhood field partition in mock

Add a block comment above RuleObjectCacheMock spelling out the non-overlapping
cp.Spec field partition between the two setters and the first-container-wins
rule for r.cp. Without this, future callers risk aliasing NN fields into an
AP-only profile or vice-versa.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* refactor: T8 integration mirror, mock setter contract doc, SeedEntryWithOverlayForTest

Add SeedEntryWithOverlayForTest helper so out-of-package integration tests can
set UserAPRef / UserNNRef (which use the internal namespacedName type) without
requiring the type to be exported.

Mirror TestT8_EndToEndRefreshUpdatesProjection at tests/containerprofilecache/
using only the public + test-helper API: seeds an entry with a stale UserAPRV,
mutates storage to apV2 (RV=51), asserts RefreshAllEntriesForTest rebuilds the
projection with the new execs and drops the stale ones.

Add top-of-file block comment to RuleObjectCacheMock documenting the non-
overlapping AP-fields / NN-fields partition between SetApplicationProfile and
SetNetworkNeighborhood and the first-container-wins rule for r.cp.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix: address Phase 4 code-review findings

- reconciler.go: simplify dead-code cpErr/rpcErr guard (refreshRPC returns
  exactly cpErr; the rpcErr != nil && cpErr == nil branch could never fire)
- reconciler_test.go: make blockingProfileClient.blocked a buffered chan(1)
  with a blocking send so the signal is never silently dropped; bump
  rpcBudget to 100ms and timeout to 2s to reduce flakiness on loaded CI
- containerprofilecache.go: extract defaultStorageRPCBudget const alongside
  defaultReconcileInterval for discoverability
- shared_pointer_race_test.go: fix gofmt const-block alignment

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix: preserve cached entry when overlay AP/NN fetch fails transiently

Before this fix, a refreshRPC timeout on any overlay GET (user-managed
ug-<workload> AP/NN or user-defined label-referenced AP/NN) left the
overlay variable nil with the error silently discarded. The RV comparison
then saw rvOf(nil)="" != cached RV (e.g. "50"), treated it as a removal,
and rebuilt the entry without the overlay — temporarily stripping
user-managed/user-defined profile data from the cache and altering
alerting until the next successful tick.

Fix: capture each overlay's fetch error and, when it is non-nil and the
entry already has a non-empty cached RV for that overlay, return early
and keep the existing entry unchanged. Legitimate deletions (nil with
err==nil) still propagate correctly. Mirrors the existing CP error-
preservation logic at refreshOneEntry:272-288.

Add TestRefreshPreservesEntryOnTransientOverlayError covering all four
overlay fetch paths (user-managed AP, user-managed NN, user-defined AP,
user-defined NN) via a new overlayErrorClient stub.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix: address CodeRabbit review issues on PR kubescape#788

- Rename 5 CP cache metrics from nodeagent_* to node_agent_* to match
  the existing metric namespace convention used across node-agent.
- Route all 5 storage GETs in tryPopulateEntry through refreshRPC so
  they respect the per-call SLO (default 5s); prevents a hung GET from
  blocking the entire reconciler tick loop when called from
  retryPendingEntries.
- Add WarmPendingForTest helper to pre-initialise the pending SafeMap
  before concurrent test phases, preventing the goradd/maps
  nil-check-before-lock initialisation race.
- Pre-warm pending SafeMap in TestLockStressAddEvictInterleaved and
  poll for async deleteContainer goroutines to drain before asserting
  goroutine count.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix: distinct RNG seed per stress-test worker

Pass worker index into each goroutine closure and mix it into the
rand.NewSource seed (time.Now().UnixNano() + int64(worker)), so that
100 concurrently-launched goroutines don't all receive the same
nanosecond timestamp and end up with identical add/evict sequences.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* refactor: move test helpers out of production source into testing.go

The six *ForTest / ReconcileOnce helpers were previously mixed into
containerprofilecache.go alongside production logic. Move them to a
dedicated testing.go file in the same package.

export_test.go is the idiomatic alternative but is compiled only when
running tests in the same directory; test packages in other directories
(tests/containerprofilecache/) import the non-test version of the
package and never see _test.go contents. A plain testing.go is the
correct pattern here — it signals "test support" by name and groups all
scaffolding in one place, while remaining importable by any test binary.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* refactor: move integration tests into package dir; use export_test.go

export_test.go (package containerprofilecache) is only compiled during
`go test` so test helpers never enter the production binary. This only
works when callers are in the same directory; the prior layout put tests
in tests/containerprofilecache/ (a separate package), forcing helpers
into a plain testing.go that shipped in the binary.

Moving the six test files into pkg/objectcache/containerprofilecache/
as package containerprofilecache_test fixes this correctly:
- export_test.go replaces testing.go (test-binary-only)
- package declaration: containerprofilecache_integration → containerprofilecache_test
- packages_deleted_test.go Dir path: ../.. → ../../.. (module root)
- tests/containerprofilecache/ directory removed

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix: nil out overlay pointers when k8s client returns zero-value on 404

The Kubernetes generated client (gentype.Client.Get) pre-allocates a
zero-value struct before the HTTP call and returns it as the result even
on error (e.g. 404 not-found). In refreshOneEntry, the four overlay
fetch paths (userManagedAP, userManagedNN, userAP, userNN) guarded only
the "transient error with cached RV → keep old entry" branch; the
"first-time 404, no cached RV" branch fell through with a non-nil
empty-ObjectMeta struct still in the pointer, which reached
rebuildEntryFromSources → emitOverlayMetrics and logged spurious
"user-authored legacy profile merged" warnings with empty
namespace/name/resourceVersion fields.

Add an explicit nil-out after each non-returning error branch, mirroring
the pattern already used in tryPopulateEntry.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

---------

Signed-off-by: Matthias Bertschy <matthias.bertschy@gmail.com>

* feat: extract client CA file from kubelet config YAML and enhance service file handling (kubescape#791)

Signed-off-by: Matthias Bertschy <matthias.bertschy@gmail.com>

* add learning period label to TS CPs (kubescape#797)

Signed-off-by: Matthias Bertschy <matthias.bertschy@gmail.com>

* perf: switch to kubescape/syft v1.32.0-ks.2 + disable file catalogers (kubescape#798)

* perf: disable file-digest/metadata/executable catalogers

These three catalogers iterate every file in the scan tree and dominate
transient allocation, but their outputs are not consumed by the OOM-relevant
SBOM path. Disabling them saves ~200 MB peak RSS on gitlab-ee (main) and
stacks with upstream selective-indexing + binary-prefilter improvements to
~1.12 GB total (vs 1.62 GB baseline, fits 1.5 GB cgroup).
Signed-off-by: Ben <ben@armosec.io>

* deps: switch to kubescape/syft v1.32.0-ks.2 for memory reduction

Routes anchore/syft imports to the kubescape fork via replace directive.
The fork carries selective indexing + binary-cataloger pre-filtering on
top of v1.32.0; combined with the file-cataloger disable in the parent
commit, this reduces gitlab-ee scan peak RSS from 1,621 MB to 1,123 MB.

Refs: NAUT-1283
Signed-off-by: Ben <ben@armosec.io>

* fix: check dep.Replace for actual fork version; add cataloger removals to sidecar

- packageVersion() now returns dep.Replace.Version when present so the fork
  tag (v1.32.0-ks.2) propagates to runtime metadata and version-gating logic
- pkg/sbomscanner/v1/server.go: add the same WithCatalogerSelection/WithRemovals
  as sbom_manager.go so both SBOM paths drop file-digest/metadata/executable
  catalogers and stay in consistent memory behaviour
Signed-off-by: Ben <ben@armosec.io>

* fix: keep syft tool version at required version

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

---------

Signed-off-by: Ben <ben@armosec.io>
Co-authored-by: Matthias Bertschy <matthias.bertschy@gmail.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

---------

Signed-off-by: Matthias Bertschy <matthias.bertschy@gmail.com>
Signed-off-by: Ben <ben@armosec.io>
Co-authored-by: Matthias Bertschy <matthias.bertschy@gmail.com>
Co-authored-by: Ben Hirschberg <59160382+slashben@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Entlein <eineintlein@gmail.com>
entlein added a commit to k8sstormcenter/node-agent that referenced this pull request May 16, 2026
* Replace AP and NN cache with CP (#788)

* feat: foundation for ContainerProfileCache unification (steps 1, 2, 5-early)

Additive-only scaffolding for the upcoming migration from the two
workload-keyed caches (applicationprofilecache + networkneighborhoodcache)
to a single container-keyed ContainerProfileCache. No consumers are
rewired yet; all new code is unused.

- Storage client: GetContainerProfile(namespace, name) on ProfileClient
  interface + *Storage impl + mock.
- ContainerProfileCache interface + stub impl (methods return zero values;
  filled in by step 3/4).
- Prometheus metrics: nodeagent_user_profile_legacy_loads_total{kind,completeness}
  deprecation counter + reconciler SLO metrics (entries gauge, hit/miss
  counter, tick duration histogram, eviction counter) registered up front
  so later steps emit cleanly.

Plan artifacts in .omc/plans/; approved by ralplan Planner/Architect/Critic
consensus (v2, iteration 2).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat: ContainerProfileCacheImpl + projection + shared-pointer fast-path (steps 3, 3.5, 4)

- CachedContainerProfile entry with Shared/RV/UserAP/UserNNRV fields
- Option A+ fast-path: shared storage pointer when no user overlay
- projection.go ports mergeContainers/mergeNetworkNeighbors from legacy caches
- partial-profile detection with dedup'd WARN log + completeness metric label
- Event-path delete with WithLock+ReleaseLock (Critic #2 lock-gap fix)
- Unit tests T4 (projection) + T6 (callstack parity) + fast-path identity

Step 5 (reconciler) and legacy deletion land in follow-ups.

Plan: .omc/plans/containerprofile-cache-unification-consensus.md

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat: ContainerProfileCache reconciler with evict + refresh (step 5)

- tickLoop drives evict + refresh on one goroutine, refresh gated by atomic
- reconcileOnce evicts entries whose pod is gone or container stopped
- refreshAllEntries snapshots IDs then refreshes outside Range to avoid a
  SafeMap RLock/WLock deadlock (rebuildEntry calls Set)
- isContainerRunning(pod, entry, id): containerID primary, (Name, PodUID)
  fallback for pre-running init containers with empty ContainerID
- ctx.Err() honored inside Range callbacks for graceful shutdown
- T8 end-to-end test: user-AP mutation -> cached projection reflects change

Plan: .omc/plans/containerprofile-cache-unification-consensus.md
Consensus deltas applied: #1 (isContainerRunning signature), #3 (ctx.Err),
#4 (extend fast-skip to overlay RVs), #5 (T8 test), #7 (RPC-cost comment).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat: profilehelper CP->legacy-shape shims + ContainerProfileCache aggregator wiring (step 6a)

Adds the ContainerProfileCache reader to the ObjectCache aggregator interface
so profilehelper can read CP and synthesize the legacy ApplicationProfileContainer /
NetworkNeighborhoodContainer shapes for callers that haven't migrated yet.

- pkg/objectcache/objectcache_interface.go: add ContainerProfileCache() to
  aggregator interface + mock (both AP/NN stay for 6a-6c transit)
- pkg/objectcache/v1/objectcache.go: add cp field, 5-arg NewObjectCache,
  ContainerProfileCache() accessor
- pkg/objectcache/v1/mock.go: RuleObjectCacheMock implements CP surface +
  Get/SetContainerProfile test helpers, Start stub
- pkg/rulemanager/profilehelper/profilehelper.go:
  - GetContainerProfile(objectCache, id) returns (*CP, syncChecksum, error)
    — the forward API
  - GetContainerApplicationProfile + GetContainerNetworkNeighborhood rewritten
    as ~30-LOC CP->legacy-shape shims (consensus delta #2). Marked deprecated;
    step 6c deletes them after CEL callers migrate.
- cmd/main.go: construct ContainerProfileCache alongside APC+NNC, pass to
  NewObjectCache; mock-path uses ContainerProfileCacheMock
- test call sites updated for 5-arg NewObjectCache

Plan: .omc/plans/containerprofile-cache-unification-consensus.md

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* refactor: migrate 20 CEL call sites to GetContainerProfile (step 6b)

- applicationprofile/{capability,exec,http,open,syscall}.go: read fields
  directly off cp.Spec instead of the per-container AP shape
- networkneighborhood/network.go: read Ingress/Egress/LabelSelector off
  cp.Spec directly
- pkg/objectcache/v1/mock.go: extend RuleObjectCacheMock so
  SetApplicationProfile / SetNetworkNeighborhood also project into the
  unified ContainerProfile, and GetContainerProfile honours the shared
  container-ID registry (preserves "invalid container ID -> no profile"
  semantics for existing tests)
- profilehelper CP->legacy shims remain in place; step 6c removes them

Plan: .omc/plans/containerprofile-cache-unification-consensus.md

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* refactor: delete profilehelper shims + migrate rule_manager + creator (step 6c)

- pkg/rulemanager/profilehelper/profilehelper.go: delete
  GetContainerApplicationProfile, GetContainerNetworkNeighborhood,
  GetApplicationProfile, GetNetworkNeighborhood, GetContainerFromApplicationProfile,
  GetContainerFromNetworkNeighborhood — CP-direct API is the only surface now
- pkg/rulemanager/rule_manager.go:
  - :202, :399 call profilehelper.GetContainerProfile instead of the shim
  - HasFinalApplicationProfile reads cp via ContainerProfileCache().GetContainerProfile;
    method name preserved (external API on RuleManagerInterface per plan v2 §2.4)
- pkg/rulemanager/rulepolicy.go: Validate takes *v1beta1.ContainerProfile
  and reads cp.Spec.PolicyByRuleId
- pkg/rulemanager/ruleadapters/creator.go: both AP + NN branches use
  ContainerProfileCache().GetContainerProfileState (unified state source)

Plan: .omc/plans/containerprofile-cache-unification-consensus.md

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* refactor: ObjectCache aggregator CP-only + collapse 2 callbacks to 1 (step 6d)

- pkg/objectcache/objectcache_interface.go: drop ApplicationProfileCache()
  and NetworkNeighborhoodCache() methods — the aggregator is now
  {K8s, ContainerProfile, Dns}
- pkg/objectcache/v1/objectcache.go: 3-arg NewObjectCache(k, cp, dc)
- pkg/containerwatcher/v2/container_watcher_collection.go:63-64: two
  ContainerCallback subscriptions (APC + NNC) collapse to one (CPC)
- cmd/main.go: both branches (runtime-detection + mock) construct only
  ContainerProfileCache + Dns; legacy APC/NNC wiring removed with startup
  log: "ContainerProfileCache active; legacy AP/NN caches removed"
- test call sites updated for 3-arg NewObjectCache

Legacy packages still physically present (imports retained where still
referenced, e.g. callstackcache); step 8 deletes them entirely.

Plan: .omc/plans/containerprofile-cache-unification-consensus.md

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* refactor: delete legacy AP/NN cache packages + move callstackcache (step 8)

- git rm -r pkg/objectcache/applicationprofilecache/ (766 LOC)
- git rm -r pkg/objectcache/networkneighborhoodcache/ (758 LOC)
- git rm pkg/objectcache/applicationprofilecache_interface.go
- git rm pkg/objectcache/networkneighborhoodcache_interface.go
- mv pkg/objectcache/applicationprofilecache/callstackcache/
    -> pkg/objectcache/callstackcache/ (domain-agnostic, shared)
- Update 4 importers: containerprofilecache_interface.go, v1/mock.go,
  containerprofilecache.go, reconciler.go
- RuleObjectCacheMock drops ApplicationProfileCache()/NetworkNeighborhoodCache()
  accessor methods; SetApplicationProfile/SetNetworkNeighborhood remain as
  test helpers that project into the unified CP
- projection.go comments kept as historical source pointers — git history
  preserves the originals

Plan: .omc/plans/containerprofile-cache-unification-consensus.md

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test: add T2 init-eviction, T5 packages-deleted, T7 lock-stress (step 9 partial)

- tests/containerprofilecache/packages_deleted_test.go: go/packages
  dep-graph assertion that legacy AP/NN paths are absent
- tests/containerprofilecache/lock_stress_test.go: 100-goroutine
  interleaved seed/read for same container IDs, 5s budget, race-safe
- tests/containerprofilecache/init_eviction_test.go: T2a (event-path
  evict) + T2b (reconciler-path evict for missed RemoveContainer)
- tests/containerprofilecache/helpers_test.go: shared test builders
- pkg/objectcache/containerprofilecache: export ReconcileOnce and
  SeedEntryForTest as out-of-package test hooks
- Makefile: check-legacy-packages target

T1 (golden-alert parity) and T3 (memory benchmark) are release-checklist
items per plan v2 §2.7 — the pre-migration baselines those tests require
can no longer be captured from this branch.

Plan: .omc/plans/containerprofile-cache-unification-consensus.md

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix: address Phase 4 review P1 findings

1. Drop ReleaseLock on delete paths (containerprofilecache.go deleteContainer,
   reconciler.go reconcileOnce). Security review flagged a race where the
   deleted mutex could be orphaned while a concurrent GetLock creates a new
   one, breaking mutual exclusion for the same container ID. Trade-off:
   bounded memory growth of stale lock entries, proportional to container
   churn — acceptable for a node-agent lifetime.

2. Extract emitOverlayMetrics helper (metrics.go) to de-duplicate the
   ~20-line overlay metric/deprecation-warn block between buildEntry
   (addContainer path) and rebuildEntry (refresh path). Keeps the two in
   lockstep — code review flagged silent drift risk.

Not addressed in this commit (plan-accepted tradeoffs, follow-up work):
- Shared-pointer read-only invariant is convention-enforced, not type-
  enforced (plan v2 §2.3 step 7, ADR consequences). Retaining as-is;
  downstream consumers must not mutate.
- Storage RPC context propagation (requires storage.ProfileClient interface
  change, out of scope for this migration).

Plan: .omc/plans/containerprofile-cache-unification-consensus.md

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix: retry pending ContainerProfile GETs when CP appears after container-start

Component tests on PR #788 regressed with "All alerts: []" and 54+
"container X not found in container-profile cache" log entries. Root cause:
addContainer did a one-shot GetContainerProfile at EventTypeAddContainer
time and bailed on 404. But the CP is created asynchronously by
containerprofilemanager ~60s AFTER container-start, so the one-shot GET
almost always missed; the cache entry was never created; rule evaluation
short-circuited as "no profile".

Legacy caches hid this via a periodic ListProfiles scan that picked up
late-arriving profiles on the next tick. The point-lookup model dropped
that scan. This commit adds an equivalent: a pending-container retry path
in the reconciler.

Changes:
- CachedContainerProfile unchanged; new pendingContainer struct captures
  (container, sharedData, cpName) needed to retry the initial GET.
- ContainerProfileCacheImpl.pending SafeMap records containerIDs waiting
  for their CP to land in storage.
- addContainer extracts the populate/GET into tryPopulateEntry. On miss
  (err or nil CP) it records a pending entry; the per-container goroutine
  exits. No more waiting 10 min inside addContainerWithTimeout.
- reconciler.retryPendingEntries iterates pending under per-container
  locks, re-issues the GET, and promotes via tryPopulateEntry on success.
- reconcileOnce gains a pending GC pass: containers whose pod is gone or
  whose status is not Running get dropped from pending so we don't retry
  forever on terminated containers.
- deleteContainer also clears from pending on EventTypeRemoveContainer.
- metrics: cache_entries gauge gains a "pending" kind; reconciler
  eviction counter gets a "pending_pod_stopped" reason.

Tests:
- TestRetryPendingEntries_CPCreatedAfterAdd: 404 on add -> pending; CP
  arrives in storage -> one tick promotes; exactly 2 GetCP calls.
- TestRetryPendingEntries_PodGoneIsGCed: pending entry dropped when its
  pod is no longer present in k8s cache.

Full findings and resume doc at
  .omc/plans/containerprofile-cache-component-test-findings.md

Follow-up plan updated at
  .omc/plans/containerprofile-cache-followups.md

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix: cache correctness — right CP slug, partial-on-restart, overlay refs, resurrection guard

PR #788 component tests continued failing after the pending-retry fix.
Deep investigation uncovered a fundamental slug misuse and three reviewer-
reported correctness gaps. All fixed here.

### Primary bug: wrong slug function

plan v2 §2.3 asserted that GetOneTimeSlug(false) was deterministic. It is
NOT — implementation at k8s-interface v0.0.206:
  func (id *InstanceID) GetOneTimeSlug(noContainer bool) (string, error) {
      u := uuid.New()
      hexSuffix := hex.EncodeToString(u[:])
      ...
  }
So containerprofilemanager.saveContainerProfile writes a *time-series* CP
per tick with a fresh UUID suffix, and the storage-side
ContainerProfileProcessor.consolidateKeyTimeSeries writes the consolidated
profile at the STABLE slug (GetSlug(false), no UUID).

The cache was querying for CPs at GetOneTimeSlug(false), so every GET 404'd
forever — even with the pending-retry in place. 13 component tests failed
with "All alerts: []" and 38+ "container X not found in container-profile
cache" log entries.

Switched addContainer to GetSlug(false). The refresh path inherits the
corrected name via entry.CPName.

### Reviewer #1: resurrection during refresh

refreshAllEntries snapshots entries without a lock. Between snapshot and
per-entry lock acquisition, deleteContainer or reconcile-evict may have
removed the entry. Previously, rebuildEntry's c.entries.Set(id, newEntry)
would resurrect the dead container.

Added a load-under-lock guard at the top of refreshOneEntry.

### Reviewer #2: overlay handling regressions (two parts)

(a) tryPopulateEntry returned "pending" on base-CP 404 BEFORE trying
user-AP/NN. Containers with only a user-defined profile (no base CP yet)
got no entry. Restructured: fetch base CP and user-AP/NN independently;
populate if ANY source is available; synthesize an empty base CP when only
the overlay exists so projection has something to merge into.

(b) UserAPRef / UserNNRef were only recorded on successful fetch. A
transient 404 on add would permanently drop the overlay intent — the
refresh path had nothing to re-fetch. Now, when the label is set, the
refs are always recorded, using the label's name and the container's
namespace. Refresh retries the fetch each tick.

### Reviewer #3: partial profiles reused across container restart

tryPopulateEntry blindly used whatever CP existed at the stable slug,
including Partial completions from the previous container incarnation.
Legacy caches explicitly deleted Partial profiles on non-PreRunning
restart so rule evaluation fell through to "no profile" until Full
arrived.

Now: if CP.completion == Partial && !sharedData.PreRunningContainer, we
treat the CP as absent → stay pending → retry each tick. When the CP
becomes Full (or the container stops), the pending state resolves.

The inverse is preserved: PreRunningContainer (agent-restart scenario)
accepts the Partial CP as-is so Test_19's "alert on partial profile"
semantics still work.

### Tests

Five new unit tests, all race-clean:
- TestPartialCP_NonPreRunning_StaysPending
- TestPartialCP_PreRunning_Accepted
- TestOverlayLabel_TransientFetchFailure_RefsRetained
- TestRefreshDoesNotResurrectDeletedEntry
- TestUserDefinedProfileOnly_NoBaseCP

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix: read workload-level AP/NN as primary data source

The storage server's consolidated ContainerProfile is not exposed via the
public k8s API — ContainerProfiles().Get(stableName) returns 404 even after
consolidation runs. Only time-series CPs (named <stable>-<UUID>) and the
server-aggregated ApplicationProfile / NetworkNeighborhood CRs at the
workload-name are queryable.

The component tests' WaitForApplicationProfileCompletion waits for the
workload-level AP/NN completion — that's what actually exists. The legacy
caches read these directly; we do the same now while the server-side
consolidated-CP plumbing is completed.

Changes:
- addContainer computes both cpName (per-container, forward-compat) and
  workloadName (per-workload, where AP/NN live) via GetSlug(false) and
  GetSlug(true) respectively.
- tryPopulateEntry fetches consolidated CP (kept for forward-compat),
  workload AP, and workload NN. Treats the workload AP/NN as the primary
  data source when the consolidated CP isn't available.
- projection pre-merges workloadAP + workloadNN onto the base (synthesized
  when CP is 404), then buildEntry applies user-overlay AP/NN on top.
- Partial-on-restart gate extended to cover workload AP/NN too — non
  PreRunning containers ignore partial workload profiles until they
  become Full, mirroring legacy deletion-on-restart semantics.
- pendingContainer gains workloadName so retries re-fetch the right CRs.
- fakeProfileClient gains overlayOnly field so tests can scope AP/NN
  returns to the overlay name; existing TestOverlayPath_DeepCopies updated
  accordingly.

Forward-compat: once storage publishes a queryable consolidated CP at
cpName, its fetch becomes primary and the workload AP/NN path becomes a
fallback. No API changes are required to make that transition — just drop
the workload-level fetches.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* debug: add tick-loop start log + change-detection log in reconciler

* fix: remove overly-aggressive pending GC that dropped entries before retry

CI run 24781030436 (commit ce329196) proved the reconciler IS ticking with
retryPendingEntries running, but the pending-GC pass in reconcileOnce was
dropping every pending entry on the first tick (pending_before=4 →
pending_after=0 at the FIRST tick, before retryPendingEntries could run).

Root cause: the GC pass asked k8sObjectCache.GetPod(ns, pod) and also
checked isContainerRunning. On a busy node, the k8s pod cache and
ContainerStatuses lag the containerwatcher Add event by tens of seconds.
So "pod not found" or "container not yet Running" routinely returned true
for a container that had just been registered, causing GC to remove the
pending entry immediately. Retries then ran against an empty pending map
→ no promotions → alerts fired without profile → test failure.

Change: remove the pending GC entirely. Cleanup for terminated containers
flows through deleteContainer (EventTypeRemoveContainer) which clears
both entries and pending under the per-container lock. Memory growth is
bounded by the node's container churn (containers that never got a
profile during their lifetime).

Test updated: TestRetryPendingEntries_PodGoneIsGCed replaced by
TestPendingEntriesAreNotGCedBeforeRetry which asserts the new semantics.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix: merge user-managed AP/NN and refresh workload-level sources

Two component-test regressions in PR #788:

Fix A (Test_12 / Test_13): the cache now reads the user-managed
ApplicationProfile and NetworkNeighborhood published at
"ug-<workloadName>" and projects them onto the base profile as a
dedicated ladder pass. Legacy caches did this via the
`kubescape.io/managed-by: User` annotation in handleUserManagedProfile;
we read them directly by their well-known name.

Fix B (Test_17 / Test_19): the reconciler refresh path re-fetches the
workload-level AP/NN (and user-managed / label-referenced overlays) on
every tick, not just the consolidated CP. This propagates the Status=
"ready" -> "completed" transition into the cached ProfileState, which
flips fail_on_profile from false to true at rule-eval time.

CachedContainerProfile gained WorkloadName plus WorkloadAPRV /
WorkloadNNRV / UserManagedAPRV / UserManagedNNRV fields so the refresh
can fast-skip when every source's RV matches what's cached.
refreshOneEntry's rebuild now runs the same projection ladder as
tryPopulateEntry: base CP (or synthesized) → workload AP+NN →
user-managed (ug-) AP+NN → label-referenced user AP+NN.

Also:
- Tick-loop log only fires when entries OR pending count actually moved
  (previously fired whenever pending_before > 0, producing per-tick
  noise while a stuck-pending entry waited for profile data).
- fakeProfileClient in tests returns userManagedAP/userManagedNN when
  the requested name starts with "ug-".
- New tests: TestWorkloadAPMerged_AndRefreshUpdatesStatus (Fix B
  happy-path) and TestUserManagedProfileMerged (Fix A happy-path).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix: reconcileOnce no longer evicts on pod-cache lag, only on Terminated

CI run 24783250693 (commit 32a76c03) showed reconcileOnce evicting live
entries every tick: "entries_before:10, entries_after:0" within 5 seconds
of the agent starting. Same class of bug as the pending-GC fix (c45803f5):
the k8s pod cache lags ContainerCallback Add events by tens of seconds,
and evicting on "GetPod returns nil OR !isContainerRunning" churned every
entry before any rules could evaluate.

Change reconcileOnce eviction gate:
- If pod is missing from k8s cache: DO NOT evict. Cache lag is transient;
  deleteContainer handles real-world cleanup via EventTypeRemoveContainer.
- If pod present and container has clearly Terminated: evict (preserves
  init-container eviction for Test_02 and T2 acceptance).
- If pod present and container in Waiting state: retain (new container
  creation, init-container pre-run both legitimately pass through Waiting).

New helper isContainerTerminated mirrors isContainerRunning but gates on
State.Terminated only; "not found in any status" treated as terminated.

Tests:
- TestReconcilerEvictsWhenPodMissing → TestReconcilerKeepsEntryWhenPodMissing
- New TestReconcilerEvictsTerminatedContainer
- New TestReconcilerKeepsWaitingContainer

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* refactor: drop workload-level AP/NN fetch; CP-direct reading is authoritative

The workload-level AP/NN fetch added in d27be013 was a workaround for the
eviction/GC bugs (fixed in c45803f5 and d9ae0ac6), not an architectural
need. The consolidated ContainerProfile IS queryable at the GetSlug(false)
name once storage aggregation runs; the cache simply needs to wait on the
pending-retry path.

This reverts the workload-AP/NN read while keeping:
- consolidated CP as the single base-profile source
- user-managed AP/NN at "ug-<workloadName>" (merged on top) — still needed
  because user-managed profiles are authored independently and are not
  consolidated into the CP server-side
- user-defined overlay via pod UserDefinedProfileMetadataKey label
- eviction fix (d9ae0ac6), GC fix (c45803f5), resurrection guard

Removed:
- workload-AP/NN fetch in tryPopulateEntry and refreshOneEntry
- WorkloadAPRV / WorkloadNNRV fields on CachedContainerProfile and the
  corresponding rebuildEntryFromSources ladder pass
- Partial-on-restart gate for workload AP/NN (only applies to CP now)
- Synth-CP annotation fallback chain (simplified to Completed/Full)

Tests:
- TestWorkloadAPMerged_AndRefreshUpdatesStatus → TestRefreshUpdatesCPStatus
  (CP now the source; RV transition propagates Status)
- TestUserManagedProfileMerged rewired to use a real base CP + ug- overlay
  instead of workloadAP + ug- overlay

This matches the migration plan's original intent: CP-direct, AP/NN only
as user overlays.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix: synthetic entry CPName override, PodUID backfill, phase-labeled reconciler histogram

Three review findings from the post-green audit.

### 1 (High) — synthetic entry stored the wrong CPName

When tryPopulateEntry synthesized a CP (consolidated CP still 404), the
synthetic name was workloadName or overlayName, and buildEntry persisted
entry.CPName = cp.Name (i.e. the synthetic name). refreshOneEntry then
queried the synthetic name instead of the real GetSlug(false) name; with
the stored RV also empty, the fast-skip's "absent matches empty" branch
kept the synthetic entry forever and the real consolidated CP could never
replace it.

Fix: after buildEntry, override entry.CPName = cpName (the real
GetSlug(false) result passed into tryPopulateEntry).

### 2 (Medium) — PodUID never backfilled

buildEntry only sets PodUID when the pod is already in k8sObjectCache at
add time. On busy nodes the pod cache lags, so addContainer often runs
before the pod lands and PodUID stays "". isContainerTerminated's
empty-ContainerID fallback matches against (ContainerName, PodUID);
when PodUID == "" and the status also has empty UID, the loop falls
through and returns true (treat as terminated) — evicting a still-live
init container. rebuildEntryFromSources copied prev.PodUID unchanged, so
the error never healed.

Fix: in rebuildEntryFromSources, if prev.PodUID is empty AND the pod is
now in the k8s cache, use the fresh UID.

### 3 (Low) — reconciler duration histogram mixed two phases

tickLoop (evict) and refreshAllEntries (refresh) both emitted
ReportContainerProfileReconcilerDuration into the same plain Histogram,
so nodeagent_containerprofile_reconciler_duration_seconds was a blend of
two very different workloads. Plan v2 §2.9 had specified a HistogramVec
with a "phase" label from the start.

Fix: MetricsManager.ReportContainerProfileReconcilerDuration(phase, d).
Prometheus implementation becomes a HistogramVec with phase label.
tickLoop emits phase="evict", refreshAllEntries emits phase="refresh".
MetricsMock/MetricsNoop signatures updated.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix: address all CodeRabbit review comments on PR #788

- ContainerProfileCacheMock.GetContainerProfileState returns synthetic
  error state instead of nil, matching the real impl's contract
- Remove IgnoreContainer check on EventTypeRemoveContainer to prevent
  stale entries when pod labels change after Add
- Deep-copy userAP/userNN in mergeApplicationProfile and
  mergeNetworkNeighborhood to eliminate aliasing of nested slices
  (Execs[i].Args, Opens[i].Flags, MatchExpressions[i].Values, etc.)
  into the cached ContainerProfile
- Fix Shared=true bug: buildEntry now takes userManagedApplied bool;
  fast-path only sets Shared=true when no overlay was applied at all,
  matching rebuildEntryFromSources logic in reconciler.go
- isContainerTerminated returns false when all status slices are empty
  (kubelet lag guard for brand-new pods)
- Fix misplaced doc comment above GetContainerProfile in storage layer
- Remove unused (*stubStorage).setCP test helper
- Lock stress test evict path now uses ContainerCallback(Remove) to
  exercise deleteContainer and per-container locking
- RuleObjectCacheMock stores per-container profiles in cpByContainerName;
  GetContainerProfile resolves via InstanceID.GetContainerName();
  GetContainerProfileState returns synthetic error state

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* feat: thread context.Context through ProfileClient and add per-call RPC budget

All five ProfileClient methods now accept ctx as their first argument so
callers can enforce cancellation and deadline propagation. Each storage RPC
in the reconciler is wrapped via refreshRPC(ctx, ...) which applies a
configurable per-call timeout (config.StorageRPCBudget, default 5 s) on top
of the parent context, preventing a slow API server from stalling an entire
reconciler burst. Tests cover the fast-skip, rebuild, and context-cancellation
mid-RPC paths.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* test: shared-pointer race-fuzz test + WarmContainerLocksForTest helper

Add TestSharedPointerReadersDoNotCorruptCache: 50 concurrent readers
traverse the returned *ContainerProfile slices while a writer goroutine
alternately calls RefreshAllEntriesForTest + SeedEntryForTest to keep
entry rebuilds active. Runs for 500ms under -race, proving the shared-
pointer fast-path never produces a concurrent read/write pair.

Also add TestSharedPointerFastPathPreservesPointerIdentity: after a
refresh against a storage object with a newer RV, the new entry's
Profile pointer IS the storage object (Shared=true, no DeepCopy), which
keeps the T3 memory budget intact.

Fix the pre-existing goradd/maps SafeMap initialisation race in
TestLockStressAddEvictInterleaved by pre-warming containerLocks via the
new WarmContainerLocksForTest helper (the previous pre-warm via
SeedEntryForTest only covered the entries SafeMap, not containerLocks).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* docs: document SetApplicationProfile / SetNetworkNeighborhood field partition in mock

Add a block comment above RuleObjectCacheMock spelling out the non-overlapping
cp.Spec field partition between the two setters and the first-container-wins
rule for r.cp. Without this, future callers risk aliasing NN fields into an
AP-only profile or vice-versa.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* refactor: T8 integration mirror, mock setter contract doc, SeedEntryWithOverlayForTest

Add SeedEntryWithOverlayForTest helper so out-of-package integration tests can
set UserAPRef / UserNNRef (which use the internal namespacedName type) without
requiring the type to be exported.

Mirror TestT8_EndToEndRefreshUpdatesProjection at tests/containerprofilecache/
using only the public + test-helper API: seeds an entry with a stale UserAPRV,
mutates storage to apV2 (RV=51), asserts RefreshAllEntriesForTest rebuilds the
projection with the new execs and drops the stale ones.

Add top-of-file block comment to RuleObjectCacheMock documenting the non-
overlapping AP-fields / NN-fields partition between SetApplicationProfile and
SetNetworkNeighborhood and the first-container-wins rule for r.cp.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix: address Phase 4 code-review findings

- reconciler.go: simplify dead-code cpErr/rpcErr guard (refreshRPC returns
  exactly cpErr; the rpcErr != nil && cpErr == nil branch could never fire)
- reconciler_test.go: make blockingProfileClient.blocked a buffered chan(1)
  with a blocking send so the signal is never silently dropped; bump
  rpcBudget to 100ms and timeout to 2s to reduce flakiness on loaded CI
- containerprofilecache.go: extract defaultStorageRPCBudget const alongside
  defaultReconcileInterval for discoverability
- shared_pointer_race_test.go: fix gofmt const-block alignment

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix: preserve cached entry when overlay AP/NN fetch fails transiently

Before this fix, a refreshRPC timeout on any overlay GET (user-managed
ug-<workload> AP/NN or user-defined label-referenced AP/NN) left the
overlay variable nil with the error silently discarded. The RV comparison
then saw rvOf(nil)="" != cached RV (e.g. "50"), treated it as a removal,
and rebuilt the entry without the overlay — temporarily stripping
user-managed/user-defined profile data from the cache and altering
alerting until the next successful tick.

Fix: capture each overlay's fetch error and, when it is non-nil and the
entry already has a non-empty cached RV for that overlay, return early
and keep the existing entry unchanged. Legitimate deletions (nil with
err==nil) still propagate correctly. Mirrors the existing CP error-
preservation logic at refreshOneEntry:272-288.

Add TestRefreshPreservesEntryOnTransientOverlayError covering all four
overlay fetch paths (user-managed AP, user-managed NN, user-defined AP,
user-defined NN) via a new overlayErrorClient stub.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix: address CodeRabbit review issues on PR #788

- Rename 5 CP cache metrics from nodeagent_* to node_agent_* to match
  the existing metric namespace convention used across node-agent.
- Route all 5 storage GETs in tryPopulateEntry through refreshRPC so
  they respect the per-call SLO (default 5s); prevents a hung GET from
  blocking the entire reconciler tick loop when called from
  retryPendingEntries.
- Add WarmPendingForTest helper to pre-initialise the pending SafeMap
  before concurrent test phases, preventing the goradd/maps
  nil-check-before-lock initialisation race.
- Pre-warm pending SafeMap in TestLockStressAddEvictInterleaved and
  poll for async deleteContainer goroutines to drain before asserting
  goroutine count.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix: distinct RNG seed per stress-test worker

Pass worker index into each goroutine closure and mix it into the
rand.NewSource seed (time.Now().UnixNano() + int64(worker)), so that
100 concurrently-launched goroutines don't all receive the same
nanosecond timestamp and end up with identical add/evict sequences.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* refactor: move test helpers out of production source into testing.go

The six *ForTest / ReconcileOnce helpers were previously mixed into
containerprofilecache.go alongside production logic. Move them to a
dedicated testing.go file in the same package.

export_test.go is the idiomatic alternative but is compiled only when
running tests in the same directory; test packages in other directories
(tests/containerprofilecache/) import the non-test version of the
package and never see _test.go contents. A plain testing.go is the
correct pattern here — it signals "test support" by name and groups all
scaffolding in one place, while remaining importable by any test binary.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* refactor: move integration tests into package dir; use export_test.go

export_test.go (package containerprofilecache) is only compiled during
`go test` so test helpers never enter the production binary. This only
works when callers are in the same directory; the prior layout put tests
in tests/containerprofilecache/ (a separate package), forcing helpers
into a plain testing.go that shipped in the binary.

Moving the six test files into pkg/objectcache/containerprofilecache/
as package containerprofilecache_test fixes this correctly:
- export_test.go replaces testing.go (test-binary-only)
- package declaration: containerprofilecache_integration → containerprofilecache_test
- packages_deleted_test.go Dir path: ../.. → ../../.. (module root)
- tests/containerprofilecache/ directory removed

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix: nil out overlay pointers when k8s client returns zero-value on 404

The Kubernetes generated client (gentype.Client.Get) pre-allocates a
zero-value struct before the HTTP call and returns it as the result even
on error (e.g. 404 not-found). In refreshOneEntry, the four overlay
fetch paths (userManagedAP, userManagedNN, userAP, userNN) guarded only
the "transient error with cached RV → keep old entry" branch; the
"first-time 404, no cached RV" branch fell through with a non-nil
empty-ObjectMeta struct still in the pointer, which reached
rebuildEntryFromSources → emitOverlayMetrics and logged spurious
"user-authored legacy profile merged" warnings with empty
namespace/name/resourceVersion fields.

Add an explicit nil-out after each non-returning error branch, mirroring
the pattern already used in tryPopulateEntry.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

---------

Signed-off-by: Matthias Bertschy <matthias.bertschy@gmail.com>

* feat: extract client CA file from kubelet config YAML and enhance service file handling (#791)

Signed-off-by: Matthias Bertschy <matthias.bertschy@gmail.com>

* add learning period label to TS CPs (#797)

Signed-off-by: Matthias Bertschy <matthias.bertschy@gmail.com>

* perf: switch to kubescape/syft v1.32.0-ks.2 + disable file catalogers (#798)

* perf: disable file-digest/metadata/executable catalogers

These three catalogers iterate every file in the scan tree and dominate
transient allocation, but their outputs are not consumed by the OOM-relevant
SBOM path. Disabling them saves ~200 MB peak RSS on gitlab-ee (main) and
stacks with upstream selective-indexing + binary-prefilter improvements to
~1.12 GB total (vs 1.62 GB baseline, fits 1.5 GB cgroup).
Signed-off-by: Ben <ben@armosec.io>

* deps: switch to kubescape/syft v1.32.0-ks.2 for memory reduction

Routes anchore/syft imports to the kubescape fork via replace directive.
The fork carries selective indexing + binary-cataloger pre-filtering on
top of v1.32.0; combined with the file-cataloger disable in the parent
commit, this reduces gitlab-ee scan peak RSS from 1,621 MB to 1,123 MB.

Refs: NAUT-1283
Signed-off-by: Ben <ben@armosec.io>

* fix: check dep.Replace for actual fork version; add cataloger removals to sidecar

- packageVersion() now returns dep.Replace.Version when present so the fork
  tag (v1.32.0-ks.2) propagates to runtime metadata and version-gating logic
- pkg/sbomscanner/v1/server.go: add the same WithCatalogerSelection/WithRemovals
  as sbom_manager.go so both SBOM paths drop file-digest/metadata/executable
  catalogers and stay in consistent memory behaviour
Signed-off-by: Ben <ben@armosec.io>

* fix: keep syft tool version at required version

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

---------

Signed-off-by: Ben <ben@armosec.io>
Co-authored-by: Matthias Bertschy <matthias.bertschy@gmail.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* fix: record exec path symmetric with rule-side resolver (#800)

The application-profile recorder in ReportFileExec derives the path it
stores into the AP from `args[0]`, while the rule-side resolver
(`parse.get_exec_path` in pkg/rulemanager/cel/libraries/parse/parse.go)
falls back to `comm` when `args[0]` is empty. The asymmetry causes
"Unexpected process launched" (R0001) and other ap.was_executed-based
rules to fire on processes that are present in the application profile.

Trigger: fexecve / execveat with AT_EMPTY_PATH. modern libpam (>= 1.5)
invokes its helpers (unix_chkpwd, unix_update, ...) via fexecve to avoid
TOCTOU on the helper path. The kernel implements fexecve as
execveat(fd, "", argv, envp, AT_EMPTY_PATH) — pathname is empty by
design.

Inspektor Gadget's trace_exec puts the syscall pathname into args[0]
and reads argv from index 1 (gadgets/trace_exec/program.bpf.c:146-153).
For fexecve/execveat empty-pathname, this produces args = ["", argv[1]]
in the agent's exec event. The recorder then sets path = args[0] = ""
and the AP entry is unreachable to ap.was_executed("unix_chkpwd")
(which the rule-side resolver computes via the empty-args[0] -> comm
fallback).

Fix: derive the recorder's path the same way the rule-side does — prefer
exepath (the kernel-authoritative exe_file path, immune to argv[0]
spoofing too), then argv[0] when non-empty, then comm.

Concrete impact in production: 408 of 1976 Bonial I013 incidents on
production scoring-api APs are exactly this case — cron user-context
setup invokes pam_unix -> unix_chkpwd via fexecve, AP records path: ""
with args ["", "root"], rule looks up "unix_chkpwd" via comm fallback,
no match.

The new resolveExecPath helper is also more defensive against argv[0]
spoofing in general — exepath comes from task->mm->exe_file in the BPF
side and cannot be controlled by user code.

Verified locally on a kind cluster with kubescape v0.3.94: a pod that
loops execve (control) and execveat-AT_EMPTY_PATH (bug) reproduces the
production-shape AP entry on the unfixed code path.

Signed-off-by: Ben <ben@armosec.io>

* implement Rule-Aware Profile Projection (#799)

* implement Rule-Aware Profile Projection

Signed-off-by: Matthias Bertschy <matthias.bertschy@gmail.com>

* fix: address CodeRabbit review comments (batch 2)

- profiledata.go: reset receiver in UnmarshalJSON/YAML for ProfileDataRequired
  and FieldRequirement; add PatternObject unknown-field rejection
- function_cache.go: include SyncChecksum in cache key to invalidate on
  profile content changes (not only spec changes); iterate all extraKeyFn callbacks
- rule_manager.go: gate strict-validation rejection behind StrictValidation flag;
  coalesce specNotify bursts before recompile
- exec.go: document wasExecutedWithArgs v1 limitation for rule authors

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Matthias Bertschy <matthias.bertschy@gmail.com>

* docs: document wasExecutedWithArgs v1 path-only matching limitation

Add a CEL Helper Limitations table to the Detection Rules section noting
that wasExecutedWithArgs currently performs path-only matching (equivalent
to wasExecuted) and does not validate the argument list in v1.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix: pass-through all profile data when no rules declare profileDataRequired

When InUse=false (no rule declared a requirement for a field), projectField
was returning an empty ProjectedField{}, causing CEL helpers to see no profile
data and fire false-positive alerts for every exec/open/capability/etc.

Fix: treat InUse=false as All=true (pass-through), so existing rules that omit
profileDataRequired continue working with the full raw profile.

Update TestApply_NilSpec, TestApply_DynamicNotRetainedWhenNotInUse (renamed),
and TestSpecChange_TriggersReprojection to reflect the new pass-through semantics.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix: update stale comments and strengthen reprojection test

- projection_apply.go: update Apply doc-comment and dynamic-patterns
  comment to reflect pass-through semantics (InUse=false retains all data)
- reconciler_test.go: add SpecHash assertions to TestSpecChange to prove
  reprojection actually occurred rather than testing pass-through twice

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix: improve error logging for user-managed resource fetch failures

Signed-off-by: Matthias Bertschy <matthias.bertschy@gmail.com>

* feat: add profileDataRequired field for rule-aware projection requirements

Signed-off-by: Matthias Bertschy <matthias.bertschy@gmail.com>

* feat: enhance profileDataRequired field to allow additional properties for rule-aware projection

Signed-off-by: Matthias Bertschy <matthias.bertschy@gmail.com>

* feat: update profileDataRequired field to preserve unknown fields for rule-aware projection

Signed-off-by: Matthias Bertschy <matthias.bertschy@gmail.com>

* feat: run malicious job from /app to use the rule watched path

Signed-off-by: Matthias Bertschy <matthias.bertschy@gmail.com>

* feat: change working directory for malicious job to /var/log

Signed-off-by: Matthias Bertschy <matthias.bertschy@gmail.com>

* feat: increase timeout for helm upgrade and kubectl wait in component tests; update malicious job to include command and args

Signed-off-by: Matthias Bertschy <matthias.bertschy@gmail.com>

* feat: update malicious job working directory to /tmp and modify command for service account token access

Signed-off-by: Matthias Bertschy <matthias.bertschy@gmail.com>

* feat: update malicious job to read environment variables from /proc/self/environ

Signed-off-by: Matthias Bertschy <matthias.bertschy@gmail.com>

* feat: create marker file in /var/lib/r0002-test for malicious job

Signed-off-by: Matthias Bertschy <matthias.bertschy@gmail.com>

* feat: enable file access anomalies detection (R0002)

Signed-off-by: Matthias Bertschy <matthias.bertschy@gmail.com>

---------

Signed-off-by: Matthias Bertschy <matthias.bertschy@gmail.com>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>

* get services from API, removing sidecar requirement (#772)

* get services from API, removing sidecar requirement

Signed-off-by: Matthias Bertschy <matthias.bertschy@gmail.com>

* fix: add timeout and file-based fallback to LoadServiceURLs

- Bound HTTP service discovery to 10 s so a slow/unreachable API
  cannot stall node-agent startup; failure is handled gracefully by
  the existing nil-check at the call site.
- Restore SERVICES env var / /etc/config/services.json fallback
  (using ServiceDiscoveryFileV3) so sidecar deployments retain
  scan-failure reporting without requiring migration to API_URL.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

---------

Signed-off-by: Matthias Bertschy <matthias.bertschy@gmail.com>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>

* tests(resources): 20 NetworkNeighborhood fixtures for v0.0.2 wildcard surface

Living documentation for the feat/network-wildcards work. Each fixture
is a complete, kubectl-applicable NetworkNeighborhood document
exercising ONE edge case in the v0.0.2 wildcard surface. Test_34
(forthcoming) consumes them directly; users learning the syntax can
copy-paste them as authoritative examples.

Coverage:
  01 — IPv4 literal in ipAddresses[]
  02 — IPv6 literal (canonicalisation)
  03 — IPv4 CIDR
  04 — IPv6 CIDR
  05 — '*' sentinel for ANY IP (with discouragement annotation)
  06 — 0.0.0.0/0 + ::/0 (RFC-aligned alternative to '*')
  07 — mixed list (literal + CIDR + sentinel)
  08 — backward-compat singular ipAddress
  09 — DNS literal
  10 — DNS leading '*' (RFC 4592)
  11 — DNS mid '⋯' (DynamicIdentifier)
  12 — DNS trailing '*' (one or more, never zero)
  13 — trailing-dot normalisation
  14 — '**' recursive — admission MUST reject
  15 — egress + ingress on same container, direction isolation
  16 — egress: [] NONE (declared zero-egress)
  17 — realistic Stripe API + cluster DNS
  18 — Kubernetes service-FQDN via mid '⋯' (the user's case)
  19 — port + protocol + CIDR composed
  20 — multi-container pod, different rules per container

README.md indexes all fixtures and lists the wildcard token vocabulary.

Each fixture's header comment lists the edge case, expected outcomes,
match path, spec reference, and operational guidance. Ready to be
consumed by node-agent's Test_34_NetworkWildcardSurface (forthcoming)
and by storage's networkmatch unit tests via testdata-style references.

* feat(nn): rewire CEL functions to use storage networkmatch

Replaces byte-equality with the v0.0.2 wildcard-aware matchers from
storage's pkg/registry/file/networkmatch — applied symmetrically to
all six nn.* CEL functions (egress + ingress mirror images):

  nn.was_address_in_egress / _in_ingress
  nn.is_domain_in_egress   / _in_ingress
  nn.was_address_port_protocol_in_egress / _in_ingress

Each function now walks BOTH the deprecated singular field
(IPAddress / DNS, byte-equality, back-compat) AND the new plural
field (IPAddresses / DNSNames, wildcard-aware) on each NetworkNeighbor
entry. A profile that uses only the deprecated form behaves exactly
as before; a profile that uses the new form gains CIDR + wildcard
matching with no rule-side changes required.

Two helpers (neighborMatchesIP / neighborMatchesDNS) factor the
two-list walk so the six call sites stay readable. Compiled-form
caching of the matcher across calls is deferred to a follow-up — the
existing cel functionCache still memoises (containerID, observed)
tuples, so the per-call MatchIP/MatchDNS overhead only fires on
cache misses.

Tests cover:
  - CIDR membership across egress/ingress
  - '*' sentinel for any IP
  - leading-* DNS wildcard (RFC 4592, exactly one label)
  - mid-⋯ DynamicLabel (the kubernetes service-FQDN case)
  - trailing-dot resilience
  - direction isolation (egress and ingress lists are walked
    independently — same address allowed on one direction
    must NOT match the other)
  - back-compat: deprecated singular IPAddress/DNS still works
  - mixed: profile with one entry using singular, another using plural
  - composed match: CIDR + port + protocol on the granular variant

go.mod: temporary local-path replace for kubescape/storage so the
node-agent picks up the in-flight feat/network-wildcards work; user
flips back to fork ref before pushing.

* test(nn): fixture-walk parser + behaviour gate

TestFixturesParse: every YAML under tests/resources/network-wildcards/
parses against the v1beta1 NetworkNeighborhood schema. The fixtures
double as authoritative user-facing syntax documentation, so a fixture
that fails to parse is a documentation bug.

TestFixturesMatchExpectedBehaviour: representative observed→match
triples for each major edge case (literal IP, CIDR, '*' sentinel,
deprecated singular IPAddress, leading-* DNS RFC 4592, mid-⋯
DynamicLabel, direction isolation between egress and ingress) are
exercised through the actual nn.* CEL functions. If a fixture's
header comment says '10.1.2.3 → match' and the matcher disagrees,
ONE of them is wrong; this test pins both.

True end-to-end Test_34_NetworkWildcardSurface (kubectl-applies the
fixtures against a live cluster) belongs in the iximiuz lab; that
job is left for the lab pass once the storage + node-agent images
ship via the fork CI.

* chore: drop k8sstormcenter/storage from go.sum

Local replace points at ../storage so the fork ref isn't fetched.
User reverts both go.mod and go.sum before pushing the branch.

* chore: gitignore .claude + pin storage to fork ref carrying networkmatch

Updates the storage replace to a pseudo-version on the fork that includes
the v0.0.2 wildcard surface (pkg/registry/file/networkmatch/, IPAddresses
schema field, REST validation). Build and tests stay green against the
pinned ref.

The .claude/ entry on .gitignore prevents the agent state directory from
being tracked accidentally.

* fix(nn): address CodeRabbit review on PR #41

Five findings, all legit, all fixed:

- Port range guard (Major): wasAddressPortProtocolInEgress/Ingress now
  reject portInt outside [0, 65535] BEFORE narrowing to int32. Without
  this, a CEL value like 4294967739 wraps to 443 and would falsely
  match a port-443 entry. New TestWasAddressPortProtocolInEgress_
  PortWrapRejected pins the contract.

- neighborMatchesDNS now routes the deprecated singular DNS field
  through MatchDNS (single-element slice) instead of raw string
  equality, so back-compat behaviour gets the same trailing-dot
  stripping + lowercasing as the new DNSNames[]. New
  TestIsDomainInEgress_DeprecatedDNS_TrailingDotParity pins this.

- Direction-isolation fixture test now exercises BOTH
  wasAddressInEgress and wasAddressInIngress for each observation,
  via a new ipBothCheck struct. The prior version only checked egress,
  so a regression that broke ingress matching would have slipped through.

- TestFixturesParse uses yaml.UnmarshalStrict so a typo in any user-
  facing fixture (the YAML files double as documentation) fails the
  test instead of silently parsing.

- README clarifies that fixture 14 is intentionally rejected at
  admission and shouldn't be kubectl-applied — points readers at the
  index entry so they don't try to use it as a template.

Also bumps the storage replace to e1263bf6, which carries storage's CR
fixes (deprecated IPAddress validation, ValidateUpdate now also runs
network-profile validation, field-path assertions in admission tests).

* chore(deps): bump storage SHA to 0910dc3f (CR round 2)

Pulls in storage's CR round-2 fixes: deterministic admission error
ordering across container groups, and field-path assertions on the
ValidateUpdate test.

* chore(deps): bump storage SHA to 02c4438f (CR round 3)

Pulls in storage's deprecated-DNS validation parity fix.

* fix(nn): address CodeRabbit round 2 on PR #41

Two findings, both nitpick-level, both applied:

- Remove the unused 'maps', 'objectcache', 'objectcachev1' imports
  from fixtures_test.go along with the blank-identifier _ = ... lines
  at the bottom that existed only to silence the unused-import error.
  buildLibWithContainer is defined in wildcard_test.go (same package),
  so fixtures_test.go has no real need for those imports.

- Route the deprecated singular IPAddress through networkmatch.MatchIP
  for symmetry with the deprecated singular DNS (which round 1 already
  routed through MatchDNS). Both deprecated fields now get the same
  canonicalisation (IPv6 expanded forms, IPv4-mapped IPv6) as the new
  list fields. New TestWasAddressInEgress_DeprecatedIPAddress_
  IPv6Canonicalisation pins this.

* test(nn): pin wildcard/CIDR semantics on deprecated IPAddress (CR round 3)

CR caught that the round-2 routing of deprecated IPAddress through
MatchIP had a documentation gap: existing tests only proved literal
+ canonical (IPv6) matching, never the wildcard/CIDR semantics that
MatchIP now also enables on the deprecated field.

Adds TestWasAddressInEgress_DeprecatedIPAddress_AcceptsWildcardAndCIDR
which pins the contract: deprecated singular field accepts the SAME
wildcard token vocabulary as the new list form — '*' sentinel,
CIDRs, 0.0.0.0/0 and ::/0 alternatives. Comment on neighborMatchesIP
documents this is intentional unification, not accidental.

* fix: improve logging for rules with missing profileDataRequired (#803)

Signed-off-by: Matthias Bertschy <matthias.bertschy@gmail.com>

* perf(nn): amortise CompileIP/CompileDNS via per-container matcher cache

Profile-checksum-invalidated cache of compiled networkmatch.IPMatcher /
DNSMatcher per (containerID, neighborIndex). The previous code path
re-compiled every NetworkNeighbor's entries on each CEL function-cache
miss; this PR builds each matcher at most once per profile-checksum
lifetime and reuses it across subsequent misses.

Design:

  matcherCache (sync.Map) inside nnLibrary, zero-value safe so existing
  test fixtures that construct nnLibrary{} directly continue to work
  without changes.

  Per-container entry tagged with the profile's SyncChecksumMetadataKey
  annotation. On lookup: if checksum matches, reuse; else allocate a
  fresh containerMatchers and store with LoadOrStore (concurrent-safe).

  Per-neighbor matchers are nil-init and lazily compiled on first use,
  so a profile with 10 egress entries that only ever fires through 2 of
  them pays compile cost for only those 2.

Benchmarks (arm64, -benchtime=1s):

  IP, realistic profile (5 neighbors x 3 entries, observation misses all):
    Cold (per-call recompile): 1733 ns/op   1920 B/op   76 allocs/op
    Hot  (cached matchers)   :  177 ns/op     32 B/op    2 allocs/op
                             ~ -90% time, -98% bytes, -97% allocs

  DNS, realistic profile:
    Cold: 1219 ns/op   1800 B/op   41 allocs/op
    Hot :  318 ns/op    272 B/op    7 allocs/op
                             ~ -74% time, -85% bytes, -83% allocs

  Churning profile (checksum flips every iteration — pathological):
    1527 ns/op   1936 B/op   77 allocs/op
    Matches cold path: cache overhead itself is negligible; the savings
    come strictly from amortising compile across stable-checksum windows.

In production this stacks on top of the existing CEL functionCache
(which already absorbs same-(containerID,observed) cache hits). The
matcher cache catches what slips through: unique-observation cache
misses within a profile-checksum lifetime.

Touched:
  - matcher_cache.go             new file: cache impl
  - matcher_cache_bench_test.go  new file: comparison bench
  - network.go                   use cached matchers in all 6 CEL fns
  - nn.go                        matcherCache field on nnLibrary

* fix(matcher_cache): atomic-pointer lazy init + unconditional staleness replace (CR #42)

Two findings from CodeRabbit round 1, both fixed:

1. Stale-entry shape race in getOrBuild (Major)

   Old code used LoadOrStore on the staleness path and only replaced
   on checksum mismatch — but a shape mismatch (neighbor count change)
   could leak the stale entry to a caller whose profile has a different
   shape, which then index-panics in ipMatcher/dnsMatcher.

   Fix: when staleness is detected (by checksum OR shape), always
   Store unconditionally. Worst-case contention: several goroutines
   build shape-correct fresh entries and one Store wins; all callers
   still see a shape-correct entry. Orphans get GC'd.

2. Unsynchronised lazy-init of per-neighbor matchers (Critical)

   neighborMatchers.ip / .dns were *Matcher with a non-atomic 'if nil
   then build then assign' pattern — a real data race.

   Fix: switched to atomic.Pointer[networkmatch.IPMatcher] (and DNS).
   First-build callers may race on Compile but only one pointer wins
   via CompareAndSwap; everyone returns the winning matcher. Pure
   functions (no shared state) so duplicate Compile work is wasteful
   but not incorrect.

New tests in matcher_cache_test.go pin the contract:
  - TestMatcherCache_ConcurrentFirstBuild: 64 goroutines racing on
    the same slot, run under -race, asserts matchers are populated
    exactly once
  - TestMatcherCache_StaleEntryReplaced: shape-mismatch path returns
    a fresh containerMatchers, not the stale one
  - TestMatcherCache_ChecksumPreservedAcrossCalls: same checksum hits
    cache (no rebuild)

Benchmarks re-run after atomic.Pointer switch — negligible impact
(177 → 186 ns/op, still 8x faster than cold path). All headline
savings preserved.

* test(matcher_cache): add start barrier to concurrency test (CR #42 round 2)

Without the barrier, goroutine launch jitter staggers first-call
arrivals, hiding any unsynchronised-write data race during the
first-build window. With the barrier, all 64 goroutines hit the
contended path simultaneously when close(start) fires — much tighter
race-detector coverage of the atomic.Pointer.CompareAndSwap path.

* feat: recover wildcards + exec-args matching on top of upstream projection

Upstream's projection-v1 (PR #799) explicitly dropped two pieces of
behaviour that the fork's earlier wildcard work relied on:

  1. Network-surface Patterns (CIDRs, '*' sentinels, DNS leading-/mid-/
     trailing-wildcards) were never populated because projectField only
     routed entries to Patterns on path surfaces.
  2. wasExecutedWithArgs() degraded to path-only matching — 'args list
     is validated but not matched against', with a comment that
     ExecArgsByPath is 'future work'.

This commit re-introduces both, layered cleanly on top of the
projection rather than working around it:

Network wildcards (spec §5.7, §5.8)
- projectField: third parameter is now an isDynamic classifier rather
  than the isPathSurface bool. Path surfaces pass containsDynamicSegment;
  network surfaces pass isNetworkIPWildcard or isNetworkDNSWildcard
  (each fork-defined here).
- extractEgressAddresses / extractIngressAddresses now also pull the
  v0.0.2 IPAddresses[] list-form alongside the deprecated singular
  IPAddress (storage's networkmatch wildcards land in Patterns).
- CEL helpers (nn.was_address_in_*, nn.is_domain_in_*) now consult
  Values, All, and Patterns via networkmatch.MatchIP / MatchDNS.
- matchIPField canonicalises observed IPs (net.ParseIP) so IPv6 expanded
  forms and ::ffff: IPv4-mapped addresses hit the Values fast path.
- matchDNSField normalises trailing dots on observed and tries both
  forms against Values.

Exec-args matching restored
- ProjectedContainerProfile gains ExecsByPath map[string][]string —
  the per-Path Args slice from cp.Spec.Execs.
- extractExecsByPath populates it in projection_apply.
- wasExecutedWithArgs runs dynamicpathdetector.CompareExecArgs against
  the matched profile entry. Back-compat: a path with no ExecsByPath
  entry matches with no argv constraint (preserves old wasExecuted-
  equivalent behaviour for partial profiles).

Mock parity
- RuleObjectCacheMock.GetProjectedContainerProfile now routes the same
  classifications (network wildcards → Patterns, path dynamics →
  Patterns, ExecsByPath populated). Tests no longer need a real cache.
- ensureProjectedAllInit no longer mis-sets All=true (that's the
  match-any sentinel, not a comprehensiveness hint).

Tamper detection survives
- tamperAlertExporter + tamperEmitted fields re-added to the new
  ContainerProfileCacheImpl so the R1016 wiring keeps working.
- exporters import added.

Storage pin: k8sstormcenter/storage @ b23d85f0 (merge/upstream-
profile-rearch, which carries networkmatch + IPAddresses schema +
upstream's clean-standalone-pods).

Known port/protocol regression (degrade-noted in tests):
- was_address_port_protocol_in_egress / _in_ingress still degrade to
  address-only — port/protocol granularity needs an AddressPortsByAddr
  projection field which upstream noted as future work. Updated unit
  tests document the degradation; the only production rules that would
  exercise this didn't use the helper anyway.

Full suite: 46/48 packages green. 2 failing (containerwatcher/v2/tracers,
validator) are pre-existing eBPF kernel-privilege issues that reproduce
on main without root.

* fix: address CodeRabbit round 1 on PR #43

Two findings on code introduced by the wildcards recovery commit; the
other 5 findings touch upstream code I didn't modify in this rebase
and are out of scope.

CRITICAL — drop field.All short-circuit in matchIPField/matchDNSField:

  ProjectedField.All is the producer-side flag set by projectField when
  no rule declared profileDataRequired for the surface (pass-through
  retention mode). In that mode projectField already populates Values
  with every raw entry, so the Values lookup catches the match.
  Treating All=true as a 'match any input' sentinel in the consumer
  would let an unknown IP/DNS match even when absent from the profile
  — a false-positive admission bug.

  Removed both All short-circuits. Values + Patterns lookups cover the
  semantic correctly: pass-through projects everything into Values;
  rule-declared mode filters Values to the declared subset and routes
  wildcards to Patterns. Either way, an unknown observation falls
  through to false.

MAJOR — clone Args slice in extractExecsByPath:

  Apply() is contract-bound to be a pure transform of the source
  profile. extractExecsByPath was aliasing cp.Spec.Execs[i].Args into
  the projected map, so a consumer mutating the projected slice could
  silently corrupt the underlying CRD pointer. Cloned via copy() so
  Apply stays observably pure.

Skipped (upstream code not touched in this rebase):
  - actions/setup-go@v4 in component-tests.yaml
  - Silent SBOM URL error in cmd/main.go
  - Blocking channel send in rulebindingmanager/cache.go
  - Pre-existing was_path_opened_with_flags behaviour in open_test.go
  - Nil CEL arg guard in function_cache.go

* fix: restore fork's .github and tests/chart from main (lost during upstream merge)

The earlier merge of upstream/main mistakenly took upstream's versions
of fork-customized files. Restored from origin/main:

  .github/workflows/component-tests.yaml — fork's smart rebuild logic
    (skip image build when only tests/.github change), manual dispatch
    options, signature-verification branch handling, and the
    architectural comment block.

  tests/chart/templates/node-agent/default-rules.yaml — fork's
    'default-rules' (not upstream's 'kubescape-rules') with all the
    fork-specific tags, profileDataRequired shapes, exec-arg-wildcard
    rules (PR #38) and tamperalert/signed-profile rules (PR #22).
    +614 line diff vs the (wrongly-taken) upstream version.

  tests/chart/crds/rules.crd.yaml — fork's CRD shape.
  tests/chart/templates/node-agent/configmap.yaml — fork's configmap.
  tests/chart/values.yaml — fork's chart values.

All Go code builds clean, all packages tests still green. Fork's
component-tests workflow + custom rules survive intact.

* fix(parse): get_exec_path 3-arg overload — symmetric with recording

ROOT CAUSE of R0001 convergence regression on merge/upstream-profile-rearch
(observed when bobctl tune ran against the rebased node-agent image):

The recording-side resolver in pkg/containerprofilemanager/v1/
event_reporting.go:resolveExecPath uses
  1. exepath (kernel-authoritative)
  2. argv[0] when non-empty
  3. comm
That fix landed in upstream PR #800 ('fix exec path symmetric resolver')
on the RECORDING side, but the rule-side helper parse.get_exec_path
was left as a 2-arg function honouring only (args, comm). The comment
on resolveExecPath claims symmetry, but the rule side was missing
exepath entirely — the symmetry was aspirational.

Effect on shell invocations: the kernel reports exepath=/bin/sh,
argv[0]=sh, comm=sh. resolveExecPath writes '/bin/sh' into the
ApplicationProfile. The rule side queries 'sh' (argv[0]). The map
lookup misses → R0001 'Unexpected process launched' fires. The
autotuner adds 'sh' to AllowedProcesses, but the alert keeps firing
because the runtime is still looking for '/bin/sh' under the hood.

Fix in two parts:

1. pkg/rulemanager/cel/libraries/parse/parse.go + parselib.go:
   - 2-arg overload preserved for back-compat.
   - New 3-arg overload parse.get_exec_path(args, comm, exepath) that
     mirrors resolveExecPath's precedence: exepath → argv[0] → comm.

2. tests/chart/templates/node-agent/default-rules.yaml:
   - All 7 rule expressions updated to pass event.exepath as the
     third arg. Rules: R0001 + 6 others (in the same expression
     pattern). Stable migration via sed s/(event.args, event.comm)/
     (event.args, event.comm, event.exepath)/g.

Tests:
  TestGetExecPath_SymmetryWithRecordingSide pins the contra…
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant