[observer] Add logssource component — feed container logs to observer without logs agent#49481
[observer] Add logssource component — feed container logs to observer without logs agent#49481CelianR merged 34 commits intoq-branch-observerfrom
Conversation
… without logs agent New component at comp/observer/logssource/ that pipes container logs into the observer when logs_enabled=false. Reuses the existing container launcher, processor, and auditor; adds a drain-only pipeline with no network sender. - Gated: no-ops when logs_enabled=true (real logs agent feeds observer instead) or when observer/workloadmeta are unavailable - Discovers containers automatically via workloadmeta (no customer annotations) - Filters pause containers and agent-own containers - Uses real auditor for offset tracking (prevents replay spikes on restart) - Container ID idempotency prevents dual-tailer on repeated Set events - Ordered shutdown sequence prevents deadlock between launcher and processor drain
Gitlab CI Configuration Changes
|
| Removed | Modified | Added | Renamed |
|---|---|---|---|
| 0 | 361 | 0 | 0 |
Updated: .gitlab/distribution.yml
Changes Summary
| Removed | Modified | Added | Renamed |
|---|---|---|---|
| 0 | 0 | 2 | 0 |
ℹ️ Diff available in the job log.
Go Package Import DifferencesBaseline: e5b320d
|
logssource.Component has no consumers in the Fx graph so Fx would never call NewComponent without this. fx.Invoke is the established codebase convention for side-effectful components (connectivitychecker, configsync, privateactionrunner, etc.).
Static quality checks❌ Please find below the results from static quality gates Error
Gate failure full details
Static quality gates prevent the PR to merge! Successful checksInfo
On-wire sizes (compressed)
|
…dfilter GetContainerPausedFilters() uses the same curated pause image list as the containerd/docker workloadmeta collectors. The image-name substring check is kept as a fallback when FilterStore is absent (noop builds, tests). isAgentContainer() stays as a heuristic — workloadfilter has no built-in concept of "is this the agent itself".
ObserveLog is now called in the logssource drain goroutine rather than inside the processor. The processor no longer has any observer dependency. - Add PassthroughEncoder: preserves raw log line through the pipeline so the observer sees actual text, not a JSON transport envelope - logssource pipeline uses PassthroughEncoder and calls ObserveLog in the drain goroutine - Remove observerHandle from processor.New, NewPipeline, NewProvider, and all callsites (logs agent, otelcol, security, compliance)
Adds three Info-level log lines to confirm: - component started (not noop'd) - container sources discovered via workloadmeta - first log received through the drain goroutine Temporary debug instrumentation for live episode validation.
The container launcher is a source-type transformer, not a log reader — it publishes child Type=file sources via WrappedSource. Without a file launcher subscribed to FileType sources, tailers are created but nothing reads the file content. Matches the pattern in agent_core_init.go where both launchers are registered on the same LogSources.
The socket case in makeTailer silently swallowed errors from makeFileTailer, making it impossible to diagnose why file tailing failed on EKS/containerd.
Instrument launchTailers to log entry, file count, and each skip point (ShouldIgnore, fingerprint) so we can see where container log tailing drops off on gensim-eks.
…d owner Container entities in workloadmeta can be populated by the containerd collector before the kubelet collector enriches them with the KubernetesPod owner. If source_provider emits a source at that point, the container launcher's tailerfactory calls getPodAndContainer which fails with "cannot find pod for container X" — a silent failure because the container launcher has no retry and the error only lands in source.Status. Workloadmeta re-notifies subscribers with the merged entity when any source changes (cached_entity.go), so skipping events without a KubernetesPod owner lets us pick the container up on the subsequent kubelet-enriched event. Observed on gensim-eks: 13 of 16 containers failed with this error before the fix, producing no container log observations.
…ssource The observer logssource component gates on logs_enabled=false — when the logs agent is on, our component no-ops to avoid double-tailing. The Helm chart only auto-mounts /var/log/pods and /var/log/containers when logs.enabled is true, so we add the hostPath mounts manually.
The one-shot "first log received" flag only proved a single message made it through — no signal about ongoing flow. Replace with a counter log at 1 and every 10000 messages so operators can see flow progressing, plus a final count when the drain goroutine exits (confirms clean shutdown).
The one-shot "first log received" flag only proved a single message made it through — no signal about ongoing flow. Replace with a counter log at 1 and every 10000 messages so operators can see flow progressing.
20a37a5 to
1653ed6
Compare
Expand isAgentContainer heuristic to match all agent image short names containing "agent" (covers agent-dev, cluster-agent, etc.). Also raise the delivery count log threshold from 10 to 1000 to reduce noise.
a9b6d58 to
89bdad8
Compare
## Summary - Adds a journald launcher to the logssource component to tail kubelet process logs alongside container logs - Hardcodes a `kubelet.service` log source so the observer ingests node-level kubelet events (lease failures, scheduling decisions, etc.) - Mounts `/var/log/journal` as a hostPath in gensim-eks Helm values The journald launcher is already built into the main agent binary (`systemd` is in `AGENT_TAGS`). On non-systemd builds, `journaldlauncher.NewLauncher` resolves to a no-op, so this compiles safely without the `systemd` tag. ## Why Container logs only surface application-level symptoms. Kubelet logs provide node-level signals — lease failures, scheduling rejections, resource pressure — that precede container-level anomalies and can be used for root cause attribution. ## Test plan - [x] Rebuild and deploy to gensim-eks - [ ] Verify `Start tailing journal` log line appears in agent logs on startup - [ ] Confirm kubelet logs appear in observer parquets tagged with `source:kubelet`
These were added to diagnose tailing issues on gensim-eks but live in shared code, so they fire for all log pipelines including the regular logs agent — not just logssource. Remove before merge.
…nt into eokye/logs_component
Add targeted instrumentation to understand whether kubelet journal logs reach the observer. In 002_kubelet_kafka_saturation recording, 0 of 53k logs in parquet had source:kubelet despite the journald tailer starting — need to pinpoint where entries are lost. - component.go: log once when kubelet journald source is registered (confirms build-tag path runs) - pipeline.go: count source:kubelet messages separately in the drain goroutine, dump first 5 kubelet messages with full tags+content, increase delivery log frequency 1000 -> 100 To be reverted once kubelet tracking is verified end-to-end.
…et source Two config fixes to address why kubelet journal entries don't reach the parquet in the 002_kubelet_kafka run: 1. ConfigID: "kubelet" Previously our source had no ConfigID and no Path, so the journald tailer identifier defaulted to "journald:default". If any other journald source is also registered without ConfigID, the launcher drops duplicates via an "already tailed" warning (launcher.go:86). Setting ConfigID guarantees our tailer gets its own slot and makes the startup log unambiguous (journald:kubelet). 2. TailingMode: "forceBeginning" Default seek is SeekTail, so the tailer only reads journal entries written after the agent starts. Kubelet has typically been running for hours before the agent comes up, so all historical OOMKill, CrashLoopBackOff, and probe-failure events are invisible. Force beginning seeks to the journal head so we capture the full history. Also updates the startup log to include ConfigID and TailingMode so the applied config is visible in agent logs.
The kubelet journald tailer was opening successfully but reading 0 entries. Root cause: the agent container has an empty /etc/machine-id. libsystemd's sd_journal_open() uses /etc/machine-id to resolve the per-machine journal directory under /var/log/journal/<machine-id>/. Without a valid machine-id, it can't find the journal files that DO contain kubelet.service entries. Fix: bind-mount the host's /etc/machine-id read-only into the agent pod. This is the canonical pattern for journal-reading containers — it's how any libsystemd consumer locates the host journal. Verified live on the gensim-eks cluster: after mounting machine-id, the tailer drains the full kubelet journal and continues live-tailing (source:kubelet count grows as kubelet writes new entries).
The machine-id bind-mount is what actually fixed the journald tailer's zero-entries problem. forceBeginning was a speculative addition that turned out to be unnecessary and introduces a ~16k-entry backlog dump at startup (replay of 3+ days of kubelet journal), which is noisy for detectors. Drop it — default seek-to-tail is fine for gensim since disruptions fire after agent startup. Also trim debug instrumentation added during the investigation: - Remove first-5-kubelet-message content dump (verified, no longer needed) - Revert delivery log frequency from %100 back to %1000 Keep: - ConfigID: "kubelet" — good hygiene, guarantees a unique tailer id (journald:kubelet) if another default-id journald source is ever registered elsewhere in the agent. - source:kubelet counter in the delivery log — cheap and gives instant future visibility into whether journal tailing is flowing. - Startup config log — one-time, useful for debugging.
Address review feedback (PR #49481 comments from CelianR and Codex): the existing kubelet||docker build tag was broader than the component's actual support. source_provider.handleSet() requires every container to have a KubernetesPod owner before emitting a LogSource, so any non-Kubernetes environment (standalone Docker, CRI-O, Podman) gets all containers dropped silently even though the component compiles. Narrow the tag to kubelet only on component.go, pipeline.go, source_provider.go, and source_provider_test.go, and flip noop.go's inverse to !kubelet. This reflects the component's real scope — Kubernetes via kubelet — and makes tags match what workloadmeta enrichment requires. No behavior change for standard agent builds: the full agent has both kubelet and docker tags (still gets the real component), and Heroku and IoT agents have neither (still get the noop). Only effect is on custom -tags=docker -only builds, which were already non-functional due to the KubernetesPod owner check.
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 43b8b53108
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 43b8b53108
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
runningContainer() was missing an Owner field, so tests that expected handleSet to add a source silently passed the KubernetesPod owner guard and returned early — 4 tests were failing.
55c9e39 to
db679b1
Compare
Added option to disable logs agent for eks gensim task, this will be used for edge only log AD.
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: efc5d9dabe
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
|
@codex review plz |
|
Tweaked few things:
|
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 1b4c174ab5
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
|
I'm gonna merge this PR, few notes:
|
Summary
comp/observer/logssource/that pipes container logs into the observer whenlogs_enabled: false, with no logs shipped to the Datadog backendDesign
cancel → sp.wait() → launchersMgr.Stop() → proc.Stop() → close(outputChan) → <-drainDone— prevents deadlock between launcher writes and processor drainTest plan
dda inv test --targets=./comp/observer/logssource/...)handleSet/handleUnsetbehavior (running filter, idempotency, re-add after removal)goleakverifies clean exit after cancel+wait