Support prometheus metrics#73
Conversation
|
Hi @7ing. Thanks for your PR. I'm waiting for a cert-manager member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
/ok-to-test |
munnerz
left a comment
There was a problem hiding this comment.
This looks great, thanks Jing 🙌 the integration and unit tests here make it far easier to review confidently!
My main questions/concerns are around the construction logic in the metrics subpackage, which I think we need to decouple from net.Listener (and allow more flexibility for projects that already have their own prometheus.Registry they'd like to re-use).
|
@munnerz Thank you for your valuable inputs. Sorry took so long to make the change. But I guess this version addressed most of your concerns. |
|
@7ing We have made some major dependency upgrades in this module. Are you able to rebase your PR, preparing for another round of review? Sorry for the inconvenience and for the delays in review. 😒 |
|
/retest |
|
@erikgb done with rebase |
erikgb
left a comment
There was a problem hiding this comment.
This is great stuff! Thanks @7ing! I did my first pass on this PR now, and I think I would prefer using a Prometheus collector to avoid the add/remove metrics for metrics based on API resources. Please take a look and let me know what you think! It's not a blocker, but a pattern we are adopting in cert-manager projects nowadays.
|
Updated the certificate requests' metrics implementation with collector pattern, to align with cert-manager.io project. Basically the collector will periodically scan the certificaterequest sharedinformer (for expiry) and host metadata files (for renewal time), then record the metrics. |
There was a problem hiding this comment.
Pull Request Overview
This PR adds Prometheus metrics support to the CSI library to expose operational metrics about certificate management. The implementation includes metrics for certificate request expiration timestamps, ready status, renewal times, driver issue call counts, and managed volume/certificate counts.
Key changes:
- Added comprehensive metrics collection system with Prometheus integration
- Implemented certificate request collector for tracking certificate lifecycle metrics
- Added metrics tracking to the manager for monitoring issue calls and errors
- Created extensive test coverage for metrics functionality
Reviewed Changes
Copilot reviewed 8 out of 9 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| test/util/testutil.go | Exported function to support metrics testing |
| test/integration/metrics_test.go | Added comprehensive integration test for metrics server functionality |
| test/driver/driver_testing.go | Added metrics support to test driver infrastructure |
| metrics/metrics.go | Core metrics implementation with Prometheus registry and HTTP handler |
| metrics/certificaterequest_test.go | Unit tests for certificate request metrics collection |
| metrics/certificaterequest_collector.go | Prometheus collector for certificate request lifecycle metrics |
| manager/manager.go | Integrated metrics tracking into certificate issuance operations |
| go.mod | Added prometheus client dependency |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
|
This looks really great, @7ing! Since the PR is large, I will need some time to review it thoroughly. I've already asked @hjoshi123 for some help in review. I would also appreciate it if @munnerz could take another look. Most maintainers from CyberArk are busy with other stuff right now, so our capacity to review PRs is rather low. |
Signed-off-by: Jing Liu <jingliu@apple.com>
Co-authored-by: Erik Godding Boye <egboye@gmail.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Signed-off-by: Jing <7ing@users.noreply.github.com>
erikgb
left a comment
There was a problem hiding this comment.
/lgtm
/cc @wallrj
The code looks good to me. @wallrj has volunteered to test this in csi-driver through cert-manager/csi-driver#502.
wallrj-cyberark
left a comment
There was a problem hiding this comment.
I tested it and left a few questions.
Please answer or address those and unhold if you're happy with the observed behaviour.
/hold until you've addressed my questions.
/approve
| managedVolumeCountTotal = prometheus.NewDesc("certmanager_csi_managed_volume_count_total", "The total number of managed volumes by the csi driver.", []string{"node"}, nil) | ||
| managedCertRequestCountTotal = prometheus.NewDesc("certmanager_csi_managed_certificate_request_count_total", "The total number of managed certificate requests by the csi driver.", []string{"node"}, nil) |
There was a problem hiding this comment.
I deployed the driver, built from cert-manager/csi-driver#502 and observed these new metrics, before creating any Pods to use the driver:
$ kubectl get --raw "/api/v1/namespaces/cert-manager/pods/${POD_NAME}:9402/proxy/metrics" | grep certmanager
# HELP certmanager_csi_managed_certificate_request_count_total The total number of managed certificate requests by the csi driver.
# TYPE certmanager_csi_managed_certificate_request_count_total counter
certmanager_csi_managed_certificate_request_count_total{node="54bb9576d6"} 0
# HELP certmanager_csi_managed_volume_count_total The total number of managed volumes by the csi driver.
# TYPE certmanager_csi_managed_volume_count_total counter
certmanager_csi_managed_volume_count_total{node="54bb9576d6"} 0
| Help: "The number of errors encountered during the driver issue() calls.", | ||
| }, | ||
| []string{"node", "volume"}, | ||
| ) |
There was a problem hiding this comment.
I created the test resources, from the docs, BEFORE creating the associated issuer, from the docs and observed the error metric incremented:
# HELP certmanager_csi_certificate_request_expiration_timestamp_seconds The timestamp after which the certificate request expires, expressed in Unix Epoch Time.
# TYPE certmanager_csi_certificate_request_expiration_timestamp_seconds gauge
certmanager_csi_certificate_request_expiration_timestamp_seconds{issuer_group="cert-manager.io",issuer_kind="Issuer",issuer_name="ca-issuer",name="d7b6e2ae-9b49-44d0-97af-d29aab494ba7",names
pace="sandbox"} 0
# HELP certmanager_csi_certificate_request_ready_status The ready status of the certificate request.
# TYPE certmanager_csi_certificate_request_ready_status gauge
certmanager_csi_certificate_request_ready_status{condition="False",issuer_group="cert-manager.io",issuer_kind="Issuer",issuer_name="ca-issuer",name="d7b6e2ae-9b49-44d0-97af-d29aab494ba7",nam
espace="sandbox"} 1
certmanager_csi_certificate_request_ready_status{condition="True",issuer_group="cert-manager.io",issuer_kind="Issuer",issuer_name="ca-issuer",name="d7b6e2ae-9b49-44d0-97af-d29aab494ba7",name
space="sandbox"} 0
certmanager_csi_certificate_request_ready_status{condition="Unknown",issuer_group="cert-manager.io",issuer_kind="Issuer",issuer_name="ca-issuer",name="d7b6e2ae-9b49-44d0-97af-d29aab494ba7",n
amespace="sandbox"} 0
# HELP certmanager_csi_certificate_request_renewal_timestamp_seconds The timestamp after which the certificate request should be renewed, expressed in Unix Epoch Time.
# TYPE certmanager_csi_certificate_request_renewal_timestamp_seconds gauge
certmanager_csi_certificate_request_renewal_timestamp_seconds{issuer_group="cert-manager.io",issuer_kind="Issuer",issuer_name="ca-issuer",name="d7b6e2ae-9b49-44d0-97af-d29aab494ba7",namespac
e="sandbox"} 0
# HELP certmanager_csi_driver_issue_call_count_total The number of issue() calls made by the driver.
# TYPE certmanager_csi_driver_issue_call_count_total counter
certmanager_csi_driver_issue_call_count_total{node="54bb9576d6",volume="csi-e835e60ce4cd95dede2a23142e79d87af0014eb4d150380d051a96e3aeb062e2"} 1
# HELP certmanager_csi_managed_certificate_request_count_total The total number of managed certificate requests by the csi driver.
# TYPE certmanager_csi_managed_certificate_request_count_total counter
certmanager_csi_managed_certificate_request_count_total{node="54bb9576d6"} 1
# HELP certmanager_csi_managed_volume_count_total The total number of managed volumes by the csi driver.
# TYPE certmanager_csi_managed_volume_count_total counter
certmanager_csi_managed_volume_count_total{node="54bb9576d6"} 0
The certmanager_csi_managed_volume_count_total said 0 volumes, but the logs reported that a volume had been registered. Is that expected?
I1209 17:48:59.325407 1 nodeserver.go:87] "Registered new volume with storage backend" logger="driver" pod_name="my-csi-app"
I1209 17:48:59.325685 1 manager.go:396] "Processing issuance" logger="manager" volume_id="csi-e835e60ce4cd95dede2a23142e79d87af0014eb4d150380d051a96e3aeb062e2"
I1209 17:48:59.372729 1 manager.go:461] "Created new CertificateRequest resource" logger="manager" volume_id="csi-e835e60ce4cd95dede2a23142e79d87af0014eb4d150380d051a96e3aeb062e2"
E1209 17:49:59.312797 1 server.go:109] "failed processing request" err="waiting for request: request \"d7b6e2ae-9b49-44d0-97af-d29aab494ba7\" is pending: Referenced \"Issuer\" not found: issuer.cert-manager.io \"ca-issuer\" not found" logger="driver" rpc_method="/csi.v1.Node/NodePublishVolume" request="{\"readonly\":true,\"target_path\":\"/var/lib/kubelet/pods/4cba3203-ccc2-409d-bd55-035a08e0e98b/volumes/kubernetes.io~csi/tls/mount\",\"volume_capability\":{\"access_mode\":{\"mode\":\"SINGLE_NODE_WRITER\"},\"mount\":{}},\"volume_context\":{\"csi.cert-manager.io/dns-names\":\"${POD_NAME}.${POD_NAMESPACE}.svc.cluster.local\",\"csi.cert-manager.io/issuer-kind\":\"Issuer\",\"csi.cert-manager.io/issuer-name\":\"ca-issuer\",\"csi.storage.k8s.io/ephemeral\":\"true\",\"csi.storage.k8s.io/pod.name\":\"my-csi-app\",\"csi.storage.k8s.io/pod.namespace\":\"sandbox\",\"csi.storage.k8s.io/pod.uid\":\"4cba3203-ccc2-409d-bd55-035a08e0e98b\",\"csi.storage.k8s.io/serviceAccount.name\":\"default\"},\"volume_id\":\"csi-e835e60ce4cd95dede2a23142e79d87af0014eb4d150380d051a96e3aeb062e2\"}"
I1209 17:49:59.874101 1 nodeserver.go:87] "Registered new volume with storage backend" logger="driver" pod_name="my-csi-app"
I1209 17:49:59.875575 1 manager.go:396] "Processing issuance" logger="manager" volume_id="csi-e835e60ce4cd95dede2a23142e79d87af0014eb4d150380d051a96e3aeb062e2"
E1209 17:50:59.881999 1 server.go:109] "failed processing request" err="waiting for request: request \"d7b6e2ae-9b49-44d0-97af-d29aab494ba7\" is pending: Referenced \"Issuer\" not found: issuer.cert-manager.io \"ca-issuer\" not found" logger="driver" rpc_method="/csi.v1.Node/NodePublishVolume" request="{\"readonly\":true,\"target_path\":\"/var/lib/kubelet/pods/4cba3203-ccc2-409d-bd55-035a08e0e98b/volumes/kubernetes.io~csi/tls/mount\",\"volume_capability\":{\"access_mode\":{\"mode\":\"SINGLE_NODE_WRITER\"},\"mount\":{}},\"volume_context\":{\"csi.cert-manager.io/dns-names\":\"${POD_NAME}.${POD_NAMESPACE}.svc.cluster.local\",\"csi.cert-manager.io/issuer-kind\":\"Issuer\",\"csi.cert-manager.io/issuer-name\":\"ca-issuer\",\"csi.storage.k8s.io/ephemeral\":\"true\",\"csi.storage.k8s.io/pod.name\":\"my-csi-app\",\"csi.storage.k8s.io/pod.namespace\":\"sandbox\",\"csi.storage.k8s.io/pod.uid\":\"4cba3203-ccc2-409d-bd55-035a08e0e98b\",\"csi.storage.k8s.io/serviceAccount.name\":\"default\"},\"volume_id\":\"csi-e835e60ce4cd95dede2a23142e79d87af0014eb4d150380d051a96e3aeb062e2\"}"
I1209 17:51:01.006739 1 nodeserver.go:87] "Registered new volume with storage backend" logger="driver" pod_name="my-csi-app"
I1209 17:51:01.007089 1 manager.go:396] "Processing issuance" logger="manager" volume_id="csi-e835e60ce4cd95dede2a23142e79d87af0014eb4d150380d051a96e3aeb062e2"
There was a problem hiding this comment.
I then created the ca-issuer, from the docs, and observed that the metrics were updated:
certmanager_csi_certificate_request_expiration_timestamp_seconds{issuer_group="cert-manager.io",issuer_kind="Issuer",issuer_name="ca-issuer",name="d7b6e2ae-9b49-44d0-97af-d29aab494ba7",namespace="sandbox"} 1.773078929e+09
certmanager_csi_certificate_request_ready_status{condition="False",issuer_group="cert-manager.io",issuer_kind="Issuer",issuer_name="ca-issuer",name="d7b6e2ae-9b49-44d0-97af-d29aab494ba7",namespace="sandbox"} 0
certmanager_csi_certificate_request_ready_status{condition="True",issuer_group="cert-manager.io",issuer_kind="Issuer",issuer_name="ca-issuer",name="d7b6e2ae-9b49-44d0-97af-d29aab494ba7",namespace="sandbox"} 1
certmanager_csi_certificate_request_ready_status{condition="Unknown",issuer_group="cert-manager.io",issuer_kind="Issuer",issuer_name="ca-issuer",name="d7b6e2ae-9b49-44d0-97af-d29aab494ba7",namespace="sandbox"} 0
certmanager_csi_certificate_request_renewal_timestamp_seconds{issuer_group="cert-manager.io",issuer_kind="Issuer",issuer_name="ca-issuer",name="d7b6e2ae-9b49-44d0-97af-d29aab494ba7",namespace="sandbox"} 1.770486929e+09
certmanager_csi_driver_issue_call_count_total{node="54bb9576d6",volume="csi-e835e60ce4cd95dede2a23142e79d87af0014eb4d150380d051a96e3aeb062e2"} 7
certmanager_csi_driver_issue_error_count_total{node="54bb9576d6",volume="csi-e835e60ce4cd95dede2a23142e79d87af0014eb4d150380d051a96e3aeb062e2"} 6
certmanager_csi_managed_certificate_request_count_total{node="54bb9576d6"} 1
certmanager_csi_managed_volume_count_total{node="54bb9576d6"} 1
The logs showed success:
I1209 17:55:31.268580 1 nodeserver.go:87] "Registered new volume with storage backend" logger="driver" pod_name="my-csi-app"
I1209 17:55:31.269595 1 manager.go:396] "Processing issuance" logger="manager" volume_id="csi-e835e60ce4cd95dede2a23142e79d87af0014eb4d150380d051a96e3aeb062e2"
I1209 17:55:31.274292 1 nodeserver.go:104] "Volume registered for management" logger="driver" pod_name="my-csi-app"
I1209 17:55:31.274354 1 nodeserver.go:117] "Ensuring data directory for volume is mounted into pod..." logger="driver" pod_name="my-csi-app"
I1209 17:55:31.278790 1 nodeserver.go:136] "Bind mounting data directory to the pod's mount namespace" logger="driver" pod_name="my-csi-app"
I1209 17:55:31.299293 1 nodeserver.go:142] "Volume successfully provisioned and mounted" logger="driver" pod_name="my-csi-app"
There was a problem hiding this comment.
I deleted the Pod and observed that the certmanager_csi_managed_certificate_request_count_total returned to 0, is that expected? The help text "The total number of managed certificate requests by the csi driver" lead me to expect the number to represent the total number of CertificateRequests created since start up.
certmanager_csi_driver_issue_call_count_total{node="54bb9576d6",volume="csi-e835e60ce4cd95dede2a23142e79d87af0014eb4d150380d051a96e3aeb062e2"} 7
certmanager_csi_driver_issue_error_count_total{node="54bb9576d6",volume="csi-e835e60ce4cd95dede2a23142e79d87af0014eb4d150380d051a96e3aeb062e2"} 6
certmanager_csi_managed_certificate_request_count_total{node="54bb9576d6"} 0
certmanager_csi_managed_volume_count_total{node="54bb9576d6"} 0
There was a problem hiding this comment.
Thank you @wallrj-cyberark for your review.
Per my understanding, these are all expected behavior.
managed_volume_count is for existing successfully attached volume with issued cert.
- If the cert was not issued, it won't count
- If the pod was deleted, it will be removed from the count
- certmanager_csi_managed_volume_count_total → certmanager_csi_managed_volume_count - certmanager_csi_managed_certificate_request_count_total → certmanager_csi_managed_certificate_request_count Signed-off-by: Jing Liu <jingliu@apple.com>
|
The code looks good to me now. Thank you @7ing for making the changes!! |
|
@hjoshi123: adding LGTM is restricted to approvers and reviewers in OWNERS files. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
/approve |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: wallrj, wallrj-cyberark The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Based on cert-manager/csi-lib#73 Signed-off-by: Jing Liu <jingliu@apple.com> Signed-off-by: Richard Wall <richard.wall@cyberark.com>
Based on cert-manager/csi-lib#73 Signed-off-by: Jing Liu <jingliu@apple.com> Signed-off-by: Richard Wall <richard.wall@cyberark.com>
|
Released in https://github.com/cert-manager/csi-lib/releases/tag/v0.10.0. Please test and report back. |
This PR enables csi-lib metrics feature in cert-manager/csi-lib#73 Signed-off-by: Jing Liu <jingliu@apple.com>
This PR enables csi-lib metrics feature in cert-manager/csi-lib#73 Signed-off-by: Jing Liu <jingliu@apple.com>
certmanager_csi_certificate_request_expiration_timestamp_seconds
certmanager_csi_certificate_request_ready_status
certmanager_csi_certificate_request_renewal_timestamp_seconds
certmanager_csi_issue_requests_total
certmanager_csi_issue_errors_total
certmanager_csi_managed_certificate_request_count
certmanager_csi_managed_volume_count
fixes: #60
CyberArk tracker: VC-46169