Skip to content

Autoscaler metrics and performance investigation guide#789

Merged
mdemirhan merged 7 commits intoknative:masterfrom
mdemirhan:perfdash
May 1, 2018
Merged

Autoscaler metrics and performance investigation guide#789
mdemirhan merged 7 commits intoknative:masterfrom
mdemirhan:perfdash

Conversation

@mdemirhan
Copy link
Copy Markdown
Contributor

Fixes #578 and #493

Proposed Changes

  • Added a document titled "Investigating Performance Issues" - this document will guide users through debugging application performance issues and will show how they can use the observability features offered by Elafros to identify such issues.
  • Added metrics and a dashboard for auto scaler component and documented how it is used in the guide above.

@google-prow-robot
Copy link
Copy Markdown

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: mdemirhan
To fully approve this pull request, please assign additional approvers.
We suggest the following additional approver: vaikas-google

Assign the PR to them by writing /assign @vaikas-google in a comment when ready.

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@google-prow-robot google-prow-robot added the size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. label Apr 30, 2018
@mdemirhan mdemirhan changed the title Perfdash Autoscaler metrics and performance investigation guide Apr 30, 2018

elaRevision = os.Getenv("ELA_REVISION")
if elaDeployment == "" {
if elaRevision == "" {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!


This dashboard gives visibility into:
* Request volume per revision
* Request volume per HTTP response code
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about "Request volume per revision and HTTP response code"? I would expect all the dashboards to be scoped to a single revision. Or at least be able to.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that is correct. I will reword that part to clarify.

* Request and response sizes per revision

This dashboard can show traffic volume or latency discrepancies between different revisions.
If, for example, a revision's latency is higher than others',
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: "...is higher than other revisions, then focus..."

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

## Auto scaler metrics
If request metrics or traces do not show any obvious hot spots, or if they show
that most of the time is spent in your own code, auto scaler metrics should be
looked next. To open auto scaler dashboard, open Grafana UI and select
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: "autoscaler" (no space)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will fix all. Thanks!

Grafana URL is taking the most time and investigation should focus on why
that URL is taking that long.

## Auto scaler metrics
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should mention the panic metric I think.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do.

@josephburnett
Copy link
Copy Markdown
Contributor

/lgtm

@google-prow-robot google-prow-robot added the lgtm Indicates that a PR is ready to be merged. label May 1, 2018
@google-prow-robot google-prow-robot removed the lgtm Indicates that a PR is ready to be merged. label May 1, 2018
Copy link
Copy Markdown
Contributor

@josephburnett josephburnett left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@google-prow-robot google-prow-robot added the lgtm Indicates that a PR is ready to be merged. label May 1, 2018
@mdemirhan
Copy link
Copy Markdown
Contributor Author

/retest

@mdemirhan
Copy link
Copy Markdown
Contributor Author

/assign @tcnghia @mattmoor

Need an LGTM from the code owners.

@mdemirhan mdemirhan merged commit d264bb7 into knative:master May 1, 2018
@mdemirhan mdemirhan deleted the perfdash branch May 1, 2018 22:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

lgtm Indicates that a PR is ready to be merged. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants