Speed up & deflake tests by disabling bundling#1689
Conversation
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: evankanderson The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
/assign @vagababov since I know this was driving him crazy -- any additional thoughts on tests? $ go test ./metrics -count 40 -timeout 900s 2>&1 > /tmp/test-log
$ cat /tmp/test-log
ok knative.dev/pkg/metrics 208.145s(I didn't push too much past |
|
Fixing th e races now, forgot to run with |
c40b6ae to
263b86c
Compare
|
Note that at head, I've noticed that Prow test runs seem to be much slower than my local machine, so it's possible the timeouts are due to the slower processing time on the Prow cluster. |
|
/retest In hopes of getting another Prow run on this. |
| records = append(records, extracted) | ||
| if strings.HasPrefix(ts.Metric.Type, "knative.dev/") { | ||
| if diff := cmp.Diff(ts.Resource.Type, metricskey.ResourceTypeKnativeRevision); diff != "" { | ||
| t.Errorf("Incorrect resource type for %q: (-want +got): %s", ts.Metric.Type, diff) |
There was a problem hiding this comment.
| t.Errorf("Incorrect resource type for %q: (-want +got): %s", ts.Metric.Type, diff) | |
| t.Errorf("Incorrect resource type for %q: (-want +got):\n%s", ts.Metric.Type, diff) |
| sdFake.srv.GracefulStop() | ||
| break loop | ||
| } | ||
| case <-time.After(5 * time.Second): |
There was a problem hiding this comment.
Should those timeouts be consistent? it's 4s above.
There was a problem hiding this comment.
There used to be an extra delay on the Stackdriver outputs, but they should be similar now.
|
|
||
| // A variable for testing to reduce the size (number of metrics) buffered before | ||
| // Stackdriver will send a bundled metric report. Only applies if non-zero. | ||
| TestOverrideBundleCount = 0 |
There was a problem hiding this comment.
why is this a public variable?
There was a problem hiding this comment.
It seemed like e2e tests in Serving or Eventing might want to be able to disable the batching for the same reasons that the metrics tests would want to do so.
I'd like to add tests for important exports to the Serving and Eventing repos once these tests are no longer flaky.
There was a problem hiding this comment.
Sure then let's specify that, since just looking at the comment I see no reason why it should be public.
Also let's do this in a linter happy format TestOverrideBundleCount is a variable...
|
Updated! |
|
The following is the coverage report on the affected files.
|
| @@ -225,9 +226,10 @@ func sortMetrics() cmp.Option { | |||
|
|
|||
| // Begin table tests for exporters | |||
There was a problem hiding this comment.
| // Begin table tests for exporters |
| break loop | ||
| } | ||
| case <-time.After(4 * time.Second): | ||
| t.Error("Timeout reading input") |
There was a problem hiding this comment.
I wonder if fatal is more appropriate here?
| select { | ||
| case record := <-ocFake.published: | ||
| if record == nil { | ||
| continue loop |
There was a problem hiding this comment.
I guess it's not important here?
| continue loop | |
| continue |
There was a problem hiding this comment.
Actually, you need the label, otherwise the continue applies to the select, rather than to the loop.
There was a problem hiding this comment.
No, if it's a lambda you return from the func rather than continue. But since we're keeping the label, it doesn't matter.
| labels := map[string]string{} | ||
| if record.Resource != nil { | ||
| labels = record.Resource.Labels | ||
| loop: |
There was a problem hiding this comment.
Stylistically I'd prefer a lambda with return rather than goto label, but up to you. :)
There was a problem hiding this comment.
That would look like:
for {
breakErr := error.New("intentional break")
err := func() error {
select {
case record := <-ocFake.published:
if record == nil {
return nil
}
// ...
if len(records) >= len(expected) {
return breakErr
}
case <-time.After(5 * time.Second):
return errors.New("timeout")
}
}()
if err != breakErr {
t.Fatal("Unable to fetch events", err)
}
if err == breakErr {
break
}
}
There was a problem hiding this comment.
I think with the label is better, stylistically.
| Value: ts.Points[0].Value.GetInt64Value(), | ||
| } | ||
| // Override 'cluster-name' label to reset to a fixed value | ||
| if extracted.Labels["cluster_name"] != "" { |
There was a problem hiding this comment.
Should it not be set if empty?
There was a problem hiding this comment.
This is using Golang defaults, so mapVal["doesNotExist"] will return the default value from the map map[string]string, which is "".
There was a problem hiding this comment.
Strictly theoretically there may be the value in the map with "" value. So if this happens due to a bug it won't be caught. But 🤷 .
|
/lgtm |
Fixes #1672
It turns out that both Stackdriver and OpenCensus have rpc Bundlers that hold reports until at least N have been accumulated or for 1-2s. I had to send an upstream PR to OpenCensus to expose the bundler options in order to speed up reporting, which accounts for the
go.modupdate.It also turns out that gRPC will take (on my machine, going through ??? WSL2 -> Windows DNS -> internet) about 1s to attempt to resolve
external-svc, so changing that to a name that will resolve sped up theconfig_testby a second or so. It also turns out that gRPC takes an additional 1-2 seconds to set up "fail-fast" channels on Windows but not on a Linux kernel -- caveat testor if you're attempting to debug timing on Windows.I think this has been de-flaked from a timeout perspective; on my machine (i7 mobile processor, plugged in, running WSL2), the entire suite takes about 5s, compared with 10s before these changes.