Leader must reject reports with duplicate extensions#3999
Conversation
Note there's an error response inconsistency between Helper and Leader -- Leader returns ReportRejectionReason::DuplicateExtension while Helper returns ReportError::InvalidMessage (line 235 in aggregation_job_init.rs). Also note that this adds the telemetry to TaskUploadCounter, but we already have -- completely unused -- DuplicateExtension in TaskAggregationCounter. Maybe this is putting the rejection in the wrong place for the Leader? Fixes #3997
| use reqwest::Client; | ||
| use std::{ | ||
| borrow::Cow, | ||
| collections::HashSet, |
There was a problem hiding this comment.
I was tempted to do this one as a HashSet since it's already imported here, and switch the Helper's implementation over to HashSet.
| ); | ||
| metrics | ||
| .upload_decode_failure_counter | ||
| .add(1, &[KeyValue::new("type", "duplicate_extension")]); |
There was a problem hiding this comment.
We could add a upload_decode_failure_counter.add(0, &[Keyvalue::new("type", "duplicate_extension")]); in Aggregator::new() to eagerly populate this time series at startup.
tgeoghegan
left a comment
There was a problem hiding this comment.
Note there's an error response inconsistency between Helper and Leader -- Leader returns ReportRejectionReason::DuplicateExtension while Helper returns ReportError::InvalidMessage (line 235 in aggregation_job_init.rs).
ReportRejectionReason::DuplicateExtension doesn't seem right to me. The pertinent text from draft-ietf-ppm-dap-15:
If the same extension type appears more than once among the public extensions and the private extensions in the Leader's input share, then the Leader MUST ignore the report and MAY abort with error invalidMessage.
(See discussion of that MAY below) It seems fine for us to track this rejection reason internally to Janus (in particular so we can increment a counter) but what error does the client get? I'm not sure if the test you added lets us examine the HTTP response code and problem document a client would get. FWIW I think it'd be fine to add a test proving that ReportRejectionReason::DuplicateExtension gets rendered into the appropriate HTTP response. We don't necessarily need to rewrite the test to exercise the HTTP layer.
Also note that this adds the telemetry to TaskUploadCounter, but we already have -- completely unused -- DuplicateExtension in TaskAggregationCounter.
Maybe this is putting the rejection in the wrong place for the Leader?
The MAYs in the various upload path checks are intended to allow leaders to defer checks from the upload path (which is high-volume and keeps a client waiting), but Janus opts to do these checks as soon as possible. Or rather, we chose to do it in the more obvious place and figured we'd move it to the aggregation job path if it ever proved to be a performance problem. It never has -- what has caused us perf issues during upload is writing reports to the database (IO), not validating incoming messages (compute).
Anyway, we need that other country so that the helper can record such occurrences, since it never gets to see reports until it handles aggregation jobs. Admittedly the scenario of a helper getting duplicate extensions is quite unlikely -- you'd need a client to have correctly constructed the leader report share but mangled the helper's -- but such a bug being improbably makes tools for detecting it that much more valuable, I think.
| report_too_early BIGINT NOT NULL DEFAULT 0, -- reports whose timestamp is too far in the future. | ||
| task_not_started BIGINT NOT NULL DEFAULT 0, -- reports sent to the task before it started. | ||
| task_ended BIGINT NOT NULL DEFAULT 0, -- reports sent to the task after it ended. | ||
| duplicate_extension BIGINT NOT NULL DEFAULT 0, -- reports with duplicate extensions |
There was a problem hiding this comment.
divviup-api will need matching changes: divviup/divviup-api#1884
As it's a |
This is the counterpart to divviup/janus#3999 Fixes #1884
Note there's an error response inconsistency between Helper and Leader -- Leader returns ReportRejectionReason::DuplicateExtension while Helper returns ReportError::InvalidMessage (line 235 in aggregation_job_init.rs).
Also note that this adds the telemetry to TaskUploadCounter, but we already have -- completely unused -- DuplicateExtension in TaskAggregationCounter.
Maybe this is putting the rejection in the wrong place for the Leader?
Fixes #3997