Skip to content

feat(kafka): SASL/GSSAPI plugin + CLI + Docker E2E (0.15.0)#96

Draft
sionsmith wants to merge 9 commits intomainfrom
feat/gssapi-plugin
Draft

feat(kafka): SASL/GSSAPI plugin + CLI + Docker E2E (0.15.0)#96
sionsmith wants to merge 9 commits intomainfrom
feat/gssapi-plugin

Conversation

@sionsmith
Copy link
Copy Markdown
Contributor

Summary

Adds SASL/GSSAPI (Kerberos) authentication via the SaslMechanismPlugin
trait from #94. Default builds stay Kerberos-free; opt in with
cargo build --features gssapi -p kafka-backup-cli.

Credits @kthimjo (#95): state machine, credential hints, and
GSS_MECH_KRB5 wiring are adapted from their PR. The plugin-trait
refactor, process-wide KRB5_ENV_LOCK, KRB5CCNAME isolation, upfront
keytab validation, unit tests, and the Docker KDC fixture are ours.

What's in the box

  • SaslMechanism::Gssapi enum variant + SecurityConfig fields
    (sasl_kerberos_service_name, sasl_keytab_path, sasl_krb5_config_path).
  • GssapiPlugin:
    • RFC 4752 Phase 1 multi-round gss_init_sec_context.
    • Phase 1→2 turnaround + Phase 2 layer=0x01 (no sec layer, no size)
      wrap/unwrap.
    • KIP-368 re-auth via fresh-context rebuild.
    • Process-wide KRB5_ENV_LOCK mutex serialises KRB5_CLIENT_KTNAME /
      KRB5_CONFIG / KRB5CCNAME mutation during Cred::acquire
      eliminates the multi-client env-var race inherent to libgssapi 0.9.
    • KRB5CCNAME=MEMORY:<ptr> per-plugin ccache isolation when a keytab
      is configured — prevents stale tickets in the OS default ccache
      (common on macOS API:<uuid> caches) from being preferred over a
      fresh TGT from the keytab.
  • CLI: --sasl-mechanism, --sasl-keytab, --sasl-krb5-config,
    --sasl-kerberos-service-name on offset-reset, offset-reset-bulk,
    and offset-rollback. YAML configs auto-wire the GSSAPI plugin when
    sasl_mechanism: GSSAPI is set. Helpful runtime error if the CLI was
    built without --features gssapi.
  • Shared CLI security-args parsing (commands/security_args.rs)
    consolidates three prior copies.
  • Docker KDC fixture at tests/sasl-gssapi-test-infra/ (MIT KDC +
    cp-kafka 7.7.0 configured for SASL_PLAINTEXT://kafka.test.local:9098
    with GSSAPI enabled, realm TEST.LOCAL, keytab auto-gen healthcheck).
  • Three #[ignore] E2E tests (keytab happy-path, missing-keytab clear
    error, KIP-368 reauth fires within broker's 60s window).

Local E2E evidence (macOS aarch64, MIT krb5 1.22.2)

Gate Result
cargo fmt --all -- --check clean
cargo clippy --all-targets -- -D warnings clean
cargo clippy --all-targets --all-features -- -D warnings clean
cargo test --workspace --lib --bins --all-features 206 pass
OAUTH E2E (sasl_oauth_* ignored) all green
GSSAPI E2E: sasl_gssapi_keytab_e2e connect + metadata OK
GSSAPI E2E: sasl_gssapi_missing_keytab_surfaces_clear_error clear error
GSSAPI E2E: sasl_gssapi_reauth_fires_within_broker_window 6× 15s probes all OK across 90s (crosses broker's 60s reauth window)
CLI smoke: offset-rollback snapshot against live fixture auth handshake OK, GSSAPI Phase 2 complete server_layers=0x01

CI skips sasl_* integration tests by default because the OAUTH / GSSAPI
compose fixtures aren't brought up in the workflow (same pattern as the
pre-existing --skip tls). See .github/workflows/test.yml and the
fixture READMEs for manual runs.

Build requirements

The gssapi feature links against MIT krb5 at build time:

  • macOS: brew install krb5 + export PKG_CONFIG_PATH="$(brew --prefix krb5)/lib/pkgconfig:…".
    Apple's bundled Heimdal does not expose the symbols libgssapi 0.9 requires.
  • Debian/Ubuntu: apt-get install libkrb5-dev.
  • Fedora/RHEL: dnf install krb5-devel.

Limitations (V1)

Single-broker Kerberos only. Routing a dedicated plugin instance per
broker for multi-broker clusters is a known follow-up — the
PartitionLeaderRouter path currently clones the configured plugin
Arc for every broker connection. Release binaries and the default
Docker image do not include GSSAPI; build your own image with
--build-arg FEATURES=gssapi once the downstream image ships that arg.

Test plan

  • Local cargo fmt --check / clippy -D warnings (default + all-features)
  • Local cargo test --workspace --lib --bins --all-features green
  • Local OAUTH + GSSAPI E2E green (including 90s reauth probe)
  • Local CLI smoke against live GSSAPI fixture authenticates
  • CI green on this PR (Integration Tests skip SASL fixtures; unit + clippy cover the added code)
  • Mark ready for review

🤖 Generated with Claude Code

sionsmith and others added 9 commits April 21, 2026 20:14
Adds a `gssapi` cargo feature on both `kafka-backup-core` and
`kafka-backup-cli` (passthrough) with `default = []`, plus an optional
`libgssapi = "0.9"` workspace dependency. No logic changes — subsequent
commits build the `GssapiPlugin` impl behind this gate.

Default builds are unchanged and do not pull `libgssapi`. The gssapi
feature requires system krb5 development headers at build time:
- Debian/Ubuntu: `apt-get install libkrb5-dev`
- RHEL/Fedora: `dnf install krb5-devel`
- macOS: `brew install krb5` (then
  `export PKG_CONFIG_PATH="$(brew --prefix krb5)/lib/pkgconfig:$PKG_CONFIG_PATH"`)

Verified:
- `cargo check -p kafka-backup-core` (default, no libgssapi)
- `cargo check -p kafka-backup-core --features gssapi` (pulls libgssapi)
- `cargo check -p kafka-backup-cli` (default)
- `cargo check -p kafka-backup-cli --features gssapi` (passthrough works)

Part of the GSSAPI plugin rework superseding PR #95 (authored by
@kthimjo) on the `SaslMechanismPlugin` extension point.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Adds the always-present `SaslMechanism::Gssapi` enum variant and three
optional `SecurityConfig` fields backing it:

- `sasl_kerberos_service_name` — Kafka service principal (defaults to
  `kafka` at the CLI layer)
- `sasl_keytab_path` — keytab file path; OS credential cache is used
  if unset
- `sasl_krb5_config_path` — path to `krb5.conf`; system default if unset

All three are `#[serde(default)]` so existing configs keep parsing.
YAML round-trip tested: `sasl_mechanism: GSSAPI` (SCREAMING-KEBAB-CASE)
decodes to `SaslMechanism::Gssapi` and all three path fields populate.

The variant is always compiled (so the YAML surface is consistent
across binaries), but a working GSSAPI client requires the `gssapi`
cargo feature at the CLI level. Core's `authenticate()` surfaces a
clear error if `SaslMechanism::Gssapi` is set without a plugin — the
CLI installs a `GssapiPlugin` via `populate_sasl_plugin` in a later
commit.

Part of the GSSAPI plugin rework superseding PR #95 by @kthimjo.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Adds `kafka_backup_core::kafka::GssapiPlugin` behind the `gssapi`
cargo feature. The plugin is a state machine around
`libgssapi::context::ClientCtx` that implements RFC 4752 §3.1:

  Phase 1 — gss_init_sec_context rounds (Context → ContextInProgress)
  Phase 1→2 transition — empty turnaround token (AwaitingLayerProposal)
  Phase 2 — unwrap broker proposal, check 0x01 (no-security-layer) bit,
            wrap reply `0x01 0x00 0x00 0x00 | authz_id` (AwaitingFinalAck)
  Done — broker ack closes the handshake

Notable design decisions:

- Interior mutability via `Arc<tokio::sync::Mutex<State>>` to bridge
  the trait's `&self` methods with `ClientCtx::step`'s `&mut`.
- Process-wide `KRB5_ENV_LOCK: tokio::sync::Mutex<()>` serialises
  `KRB5_CLIENT_KTNAME` / `KRB5_CONFIG` env-var mutation around
  `Cred::acquire`. libgssapi 0.9.1 does not expose a keytab-path
  argument, so env vars are the only route; without this lock,
  concurrent `KafkaClient`s would race. PR #95's unsynchronised
  `set_var` is the underlying issue this fixes.
- `reauth_payload` resets state to Initial and re-acquires a fresh
  credential + ClientCtx — Kerberos tickets expire and a stale
  context cannot be reused.
- Keytab existence is checked upfront in `new()` so misconfig fails
  fast at construction rather than mid-handshake.

Day-1 spike result: libgssapi 0.9.1 exposes `Cred::acquire` and
`Cred::acquire_with_password` only; no keytab-aware constructor. The
env-var mutex is the correct mitigation for OSS until upstream gains
a keytab argument.

Tests (7 unit tests, feature-gated): Phase 2 proposal parser
(rejects <4 bytes, rejects missing 0x01 bit, accepts 0x01/0x07),
Phase 2 reply wire format, keytab-missing construction error,
continue_payload-before-initial poison, mechanism name. The full
gss_init_sec_context / wrap / unwrap round-trip is exercised by the
Docker E2E added in a later commit.

Part of the GSSAPI plugin rework superseding PR #95 by @kthimjo.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Adds two new CLI helpers:
  - commands/sasl_plugin.rs: populate_sasl_plugin[_opt] installs a
    GssapiPlugin into SecurityConfig::sasl_mechanism_plugin when
    sasl_mechanism: GSSAPI is set. Gated by #[cfg(feature = "gssapi")];
    the default build surfaces an actionable rebuild error naming the
    feature and the system krb5 dev headers for each major platform.
  - commands/security_args.rs: #[derive(Args)] SecurityCliArgs with
    --security-protocol, --sasl-mechanism, --sasl-keytab,
    --sasl-krb5-config, --sasl-kerberos-service-name (plus env-var
    fallbacks). into_security_config(bootstrap_servers) assembles a
    SecurityConfig and runs populate_sasl_plugin.

Config-file entry points (backup, restore, three-phase, snapshot-groups)
call populate_sasl_plugin_opt immediately after serde_yaml::from_str so
YAML sasl_mechanism: GSSAPI is wired automatically.

Offset-reset-family commands (offset-reset execute, offset-reset-bulk,
offset-rollback rollback/verify/snapshot) now flatten SecurityCliArgs
in place of the lone --security-protocol flag and consume
into_security_config. Deletes the three triplicated parse_security_config
helpers.

Adds "env" feature to the workspace clap dep so #[arg(env = "…")]
compiles.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Adds tests/sasl-gssapi-test-infra/ — a three-container compose stack
that boots a self-contained MIT Kerberos realm (TEST.LOCAL), exports
service + client keytabs, and advertises a cp-kafka 7.7.0 broker on
SASL_PLAINTEXT://kafka.test.local:9098 with GSSAPI enabled. The KDC
runs in a self-hosted image (Dockerfile.kdc, ubuntu:22.04 + krb5-kdc)
rather than pulling an abandoned upstream. Keytab init is idempotent
and the compose healthcheck gates broker startup on the kafka keytab
existing on disk.

Rationale for compose-level decisions:
  - hostname: kafka.test.local enforces that clients connect to the
    FQDN that matches the service principal — the documented remedy
    for KRB5KDC_ERR_S_PRINCIPAL_UNKNOWN.
  - Separate INTERNAL listener (PLAINTEXT://:9092) so inter-broker
    traffic doesn't need its own GSSAPI identity.
  - KAFKA_CONNECTIONS_MAX_REAUTH_MS=60000 keeps the reauth E2E test
    under 90s while staying comfortably above the 30s floor clamp in
    sasl/reauth.rs.
  - Only aes256/aes128-cts enctypes enabled; DES is disabled in MIT
    1.19 and would only cause salt-mismatch noise.

E2E tests (sasl_gssapi_tests.rs, #[cfg(feature = "gssapi")], #[ignore]):
  - sasl_gssapi_keytab_e2e: full handshake + post-auth metadata RPC.
  - sasl_gssapi_missing_keytab_surfaces_clear_error: construction
    validates the keytab path before any GSS call.
  - sasl_gssapi_reauth_fires_within_broker_window: holds a connection
    open for 90s with 15s metadata probes; every probe must succeed,
    proving KIP-368 reauth runs inside the broker's 60s window.

Registered in integration_suite/mod.rs.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…(0.15.0)

- Bump workspace version 0.14.0 -> 0.15.0 (per CI version-gate rule).
- CHANGELOG: full 0.15.0 entry covering SaslMechanism::Gssapi, GssapiPlugin,
  KRB5_ENV_LOCK env-var serialisation, CLI flag surface, Docker fixture, and
  the E2E test trio. Calls out build requirements + V1 single-broker limit.
- README: new "Optional: SASL/GSSAPI (Kerberos) support" subsection under
  Building from Source with the krb5 install commands and a pointer at
  tests/sasl-gssapi-test-infra/.
- PRD: adds section 10.a describing the in-tree feature-gated plugin,
  crate surface, handshake mapping to the trait, env-var serialisation
  rationale, and V1 operational caveats.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Ports two small improvements from PR #95:

- Build an OidSet containing GSS_MECH_KRB5 and pass it to Cred::acquire's
  desired-mechs parameter instead of None. Locks the mechanism to Kerberos 5
  rather than relying on the libgssapi default; matches the convention in
  librdkafka + the Java Kafka client.
- Pass Some(&GSS_MECH_KRB5) to ClientCtx::new for the same reason.

Plus one observability improvement adapted from PR #95: parse_phase2_proposal
now returns the observed layer mask and the caller emits it at DEBUG alongside
the authz_id when Phase 2 wrap succeeds, so a field report can distinguish
"broker offered 0x01" from "broker offered 0x07".

The thread-safe KRB5_ENV_LOCK + plugin-trait refactor + upfront keytab
validation + Docker E2E fixture + unit tests + CLI flag surface remain as they
were — those stay ours.

Co-authored-by: Krist Thimjo <krist.thimjo@intesasanpaolo.com>

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Four fixes uncovered while running the Docker GSSAPI E2E tests locally
for the first time on this machine:

1. `GssapiPlugin` now sets `KRB5CCNAME=MEMORY:<ptr>` whenever a keytab
   is configured, so the plugin never reads or writes the OS default
   credential cache. Without this, stale tickets from a prior `kinit`
   (common on macOS `API:<uuid>` caches that persist across logins) are
   preferred over a fresh TGT from the keytab. The broker then rejects
   the AP-REQ with "invalid credentials" because the ticket was issued
   by an older KDC instance whose service key has since rotated. The
   env-var write is already under `KRB5_ENV_LOCK`.

2. Fixture KDC publishes on host port 88 (was 48088). The shared
   `krb5.conf` references `kdc.test.local:88`; the previous mapping
   broke host-side clients that followed the file-as-written. Port 88
   is the Kerberos default and is unbound on macOS / most Linux dev
   boxes by default.

3. Fixture broker config swaps JVM-level `-Djava.security.auth.login.config`
   for listener-scoped `KAFKA_LISTENER_NAME_SASL_GSSAPI_SASL_JAAS_CONFIG`.
   The JVM-level form makes cp-kafka's preflight ZK client try SASL
   against an unauthenticated ZooKeeper, which hangs `cub zk-ready`.
   Matches the OAUTH fixture's earlier fix (c4d7e59).

4. `init-kdc.sh` removes stale host-mounted keytabs before `ktadd`.
   `docker compose down -v` wipes the in-container KDC principal DB
   but not the `./keytabs` bind mount — without this cleanup a second
   `up` leaves the old keytab in place and nothing works.

Also drops the unused `kafka_server_jaas.conf` (superseded by inline
JAAS in the compose file) and extends README troubleshooting with the
four failure-mode descriptions encountered during this session.

Verified by:
  - 3/3 `sasl_gssapi_*` E2E tests pass, including the 90s reauth probe.
  - OAUTH E2E stays green (1/1).
  - CLI smoke test (`offset-rollback snapshot` with GSSAPI) completes
    the full handshake against the live broker, trace shows
    `GSSAPI Phase 2 complete server_layers=0x01`.
  - Full CI gate green: fmt, clippy (default + all-features),
    206 unit tests (--all-features).
The new `gssapi` cargo feature (0.15.0) pulls `libgssapi-sys`, whose
build.rs generates bindings against `gssapi.h` at compile time. GitHub's
default Ubuntu runners ship only the runtime `libgssapi_krb5.so` and omit
the development headers, so every job that runs `cargo <...> --all-features`
now fails with `'gssapi.h' file not found`.

Fix: install `libkrb5-dev` before cargo runs in the affected jobs:
- test.yml: check, unit-tests, integration-tests, chaos-tests
- semver-check.yml: Detect Breaking Changes
- release.yml: Pre-release Tests

Release binaries continue to exclude `gssapi` (cargo-dist uses default
features), so `build-local-artifacts`, `publish-crates-io`, and the Docker
publish path don't need krb5 headers.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant