Bug2020962 sessions#3438
Conversation
Introduce the `out_of_session` flag on `CommonMetricData` (defaulting to `true`) so metrics can declare whether they participate in session-level sampling. Add session configuration fields (`session_mode`, `session_sample_rate`, `session_inactivity_timeout_ms`) to `InternalConfiguration` and the RLB `Configuration` struct, and add `session_sample_rate` to `RemoteSettingsConfig` for server-side overrides. Update all existing test, benchmark, and example call sites to supply the new configuration fields.
Implement the session management engine for Glean:
- New `session` module with `SessionManager`, `SessionMode` (Auto,
Lifecycle, Manual), `SessionState`, `SessionMetadata`, and
`EventSessionContext` types, plus persistence helpers for session_id,
session_seq, event_seq, inactive_since, and session_start_time.
- Core lifecycle integration in `Glean`: session_start/session_end,
handle_client_active/inactive transitions, inactivity timeout
evaluation, dirty-flag crash recovery emitting synthetic
session_end("abnormal"/"abnormal_inactive"), orphaned session cleanup
on mode-mismatch across builds (session_end("abandoned")), and
session state restoration across clean restarts in AUTO mode.
- Boundary events: `glean.session_start` and `glean.session_end` defined
in metrics.yaml and emitted as out-of-session events carrying
session_id, session_seq, session_start_time, and sampled_in.
- Session sampling gate in `MetricType::should_record()` suppresses
in-session metrics when the active session is sampled out, applying
uniformly to all metric types.
- Event-session integration: `EventSessionContext` flows through
`event_database::record()` so in-session events carry per-event
session metadata (session_id, session_seq, event_seq,
session_sample_rate, session_start_time).
- Remote Settings override for session_sample_rate is intentionally
in-memory only; RS refreshes on every startup so persistence would
risk making stale values sticky.
- session_start_time uses millisecond-precision RFC 3339 serialization
for consistency with millisecond event timestamps.
Add session lifecycle methods to the UniFFI interface definition and wire them through all language bindings: - UDL: export SessionMode enum, session_start(), session_end(), handle_client_active/inactive session hooks, and apply_server_knobs_config session_sample_rate support. - Android/Kotlin: forward handle_client_active/inactive calls through the Glean singleton; update test configuration for session fields. - iOS/Swift: forward handle_client_active/inactive calls; update test helpers and assertions for session-aware event payloads. - Python: forward handle_client_active/inactive calls; update _RecordingUploader to accept a list of filter strings (needed because session lifecycle now submits additional pings on inactive); update test configuration. - RLB: expose session_start/session_end and configuration fields on the public Configuration struct.
Comprehensive test coverage for session management:
- tests/session.rs (30 tests): Auto/Lifecycle/Manual mode basics,
session_seq persistence and monotonic increase across restarts,
event_seq increment and reset semantics, event_seq continuity across
AUTO mode restarts, session metadata on in-session events,
out-of-session event bypass, sampling gate enforcement, boundary
event emission, inactivity timeout evaluation, session resumption
vs. new session on restart, session_start_time persistence,
sampled-out session resumption, sessions_seen diagnostic counter,
and mode-mismatch orphaned session cleanup.
- lib_unit_tests.rs (4 tests): dirty-flag crash recovery emitting
session_end("abnormal") for active crashes and
session_end("abnormal_inactive") for mid-shutdown crashes, no-op
when no session is persisted, and session_seq continuity after
crash recovery.
7471f21 to
00f8341
Compare
badboy
left a comment
There was a problem hiding this comment.
I gave this a first somewhat-thorough review and left lots of comments.
High-level I think this is a good approach, some clarifications in the details needed maybe.
I have not yet gone through all the new test; I'd like to do that before I'm out tomorrow.
What's currently missing though is docs, both dev docs about all the new state we're storing (and the state changes! oh is it time for a new diagram?), and user-facing docs.
Fine if they land in followups, but should be there before we get this out to applications.
| @@ -83,6 +83,9 @@ pub fn metric_dispatcher_benchmark(c: &mut Criterion) { | |||
| ping_schedule: Default::default(), | |||
There was a problem hiding this comment.
Adding a note to the very first line of the diff for the first commit, because one still can't leave a comment for the whole commit (or the message).
Thanks for splitting it up into individual commits. That's certainly helpful, though unfortunately this commit is not working on its own, introducing use of the SessionMode before it's defined.
I think I can cope with it in review though, just comments might be a bit out of order then.
| ping_lifetime_max_time: 0, | ||
| session_mode: glean_core::SessionMode::Auto, | ||
| session_sample_rate: 1.0, | ||
| session_inactivity_timeout_ms: 1_800_000, |
There was a problem hiding this comment.
gnah, I really wish we could make this a Duration. But eh, it's the UDL-exposed type, we stick with integers for now.
| /// Note: the naming is an inversion of the spec's `in_session` field. | ||
| /// `out_of_session: false` ↔ `in_session: true`. | ||
| #[serde(default = "default_out_of_session")] | ||
| pub out_of_session: bool, |
There was a problem hiding this comment.
Why the inversion?
The spec says in_session and that's also what ends up in the metrics.yaml in the end.
It seems confusing to me to use the inverted one here then.
Event metrics that participate in session tracking must explicitly set this to
false
Setting something to false to participate/opt-in seems also the wrong way around in my mind.
| // Non-event metrics are out-of-session by default per the spec: | ||
| // "By default, all non-event metrics will be in_session: false." | ||
| // Event metrics must explicitly set `out_of_session: false`. | ||
| out_of_session: true, |
There was a problem hiding this comment.
Using in_session instead would also make Default derivable again because the default for bool is false
| /// Server knobs configuration received from remote settings. | ||
| pub server_knobs_config: ObjectMetric, | ||
|
|
||
| /// The total number of sessions started, regardless of sampling outcome. |
There was a problem hiding this comment.
started per what? (looks like this is ping lifetime on the health ping, so ... sessions started between two full starts of the application [I think that's the only time we sent the health ping? though the health ping's lifecycle is not as strict, we could at any point change it, making this counter even more ambiguous]
| /// Returns `None` if no session ID is stored or if it was cleared. | ||
| pub(crate) fn read_session_id(glean: &Glean) -> Option<String> { | ||
| let id = make_session_id_metric().get_value(glean, INTERNAL_STORAGE)?; | ||
| if id.is_empty() { |
There was a problem hiding this comment.
Under which circumstances would it be the empty string?
|
|
||
| /// Clears the persisted session start timestamp. | ||
| pub(crate) fn clear_session_start_time(glean: &Glean) { | ||
| make_session_start_time_metric().set_sync(glean, ""); |
| # It's an empty ping. | ||
| assert "metrics" not in payload | ||
| # No duration is recorded on an active ping. | ||
| assert "timespan" not in payload.get("metrics", {}) |
| "abnormal", | ||
| extra.get("reason").expect("reason missing from synthetic session_end"), |
There was a problem hiding this comment.
You should be able to simplify this into a single assertion like:
| "abnormal", | |
| extra.get("reason").expect("reason missing from synthetic session_end"), | |
| Some("abnormal"), | |
| extra.get("reason"), |
Saves us one long line and expect, gives the same good error message if it's not right.
There was a problem hiding this comment.
Ok, I tried and that won't work as-is because expected `Option<&str>`, found `Option<&String>`
Maybe we keep that expect
| "synthetic session_end must carry the crashed session's session_id" | ||
| ); | ||
|
|
||
| // After recovery, session state must be fully cleared. |
There was a problem hiding this comment.
Heh, I tend to also add these comments, but to be honest, the assert message has the same info. We could remove the comment (up to you though)
I attempted to structure the commits here to make this easier to review, so I recommend reviewing it that way rather than as a whole.