diff --git a/docs/img/valkey-and-cluster-ratelimit.svg b/docs/img/valkey-and-cluster-ratelimit.svg new file mode 100644 index 0000000000..937bdaf415 --- /dev/null +++ b/docs/img/valkey-and-cluster-ratelimit.svg @@ -0,0 +1 @@ + \ No newline at end of file diff --git a/docs/kubernetes/ingress-controller.md b/docs/kubernetes/ingress-controller.md index 4938c2a7ef..81401203f1 100644 --- a/docs/kubernetes/ingress-controller.md +++ b/docs/kubernetes/ingress-controller.md @@ -642,6 +642,7 @@ line option `-enable-swarm` and `-enable-ratelimits`. The rest depends on the implementation, that can be: - [Redis](https://redis.io) +- [Valkey](https://valkey.io) - alpha version: [SWIM](https://www.cs.cornell.edu/projects/Quicksilver/public_pdfs/SWIM.pdf) ### Redis based @@ -654,7 +655,7 @@ resolve redis hostnames as shown in the example, if skipper does not have `dnsPolicy: ClusterFirstWithHostNet` in its Pod spec, see also [DNS policy in the official Kubernetes documentation](https://kubernetes.io/docs/concepts/services-networking/dns-pod-service/#pod-s-dns-policy). -This setup is considered experimental and should be carefully tested +This setup is considered stable, but should be carefully tested before running it in production. Example redis statefulset with headless service: @@ -722,6 +723,88 @@ spec: type: ClusterIP ``` +### Valkey based + +Additionally you have to add `-swarm-valkey-urls` to skipper +`args:`. For example: `-swarm-valkey-urls=skipper-valkey-0.skipper-valkey.kube-system.svc.cluster.local:6379,skipper-valkey-1.skipper-valkey.kube-system.svc.cluster.local:6379`. + +Running skipper with `hostNetwork` in kubernetes will not be able to +resolve valkey hostnames as shown in the example, if skipper does not +have `dnsPolicy: ClusterFirstWithHostNet` in its Pod spec, see also +[DNS policy in the official Kubernetes documentation](https://kubernetes.io/docs/concepts/services-networking/dns-pod-service/#pod-s-dns-policy). + +This setup is considered stable, but should be carefully tested +before running it in production. + +Example valkey statefulset with headless service: + +```yaml +apiVersion: apps/v1 +kind: StatefulSet +metadata: + labels: + application: skipper-ingress + component: valkey + version: 9-alpine3.22-20260330 + name: skipper-valkey + namespace: kube-system +spec: + replicas: 2 + selector: + matchLabels: + statefulset: skipper-ingress-valkey + serviceName: skipper-ingress-valkey + template: + metadata: + labels: + application: skipper-ingress + component: valkey + version: 9-alpine3.22-20260330 + spec: + containers: + - image: container-registry.zalando.net/library/valkey-9-alpine:9-alpine3.22-20260330 + name: skipper-valkey + ports: + - containerPort: 6379 + protocol: TCP + readinessProbe: + exec: + command: + - valkey-cli + - ping + failureThreshold: 3 + initialDelaySeconds: 10 + periodSeconds: 60 + successThreshold: 1 + timeoutSeconds: 1 + resources: + limits: + cpu: 100m + memory: 100Mi + dnsPolicy: ClusterFirst + restartPolicy: Always + schedulerName: default-scheduler +--- +apiVersion: v1 +kind: Service +metadata: + labels: + application: skipper-ingress + component: valkey + name: skipper-ingress-valkey + namespace: kube-system +spec: + clusterIP: None + ports: + - port: 6379 + protocol: TCP + targetPort: 6379 + selector: + application: skipper-ingress + component: valkey + type: ClusterIP +``` + ### SWIM based diff --git a/docs/operation/operation.md b/docs/operation/operation.md index b1038ee909..412c116d0b 100644 --- a/docs/operation/operation.md +++ b/docs/operation/operation.md @@ -714,6 +714,32 @@ by the default, and exposed among the timers via the following keys: See more details about rate limiting at [Rate limiting](../reference/filters.md#clusterclientratelimit). +### Valkey - Rate limiting metrics + +System metrics exposed by the valkey client: + +Prometheus query to get the number of known Valkey shards by the skipper ringclient: +``` +skipper_custom_gauges{key =~ "^swarm[.]valkey[.]shards"} +``` + +Timer metrics for the latencies and errors of the communication with the auxiliary Valkey instances are enabled +by the default, and exposed among the timers via the following keys: + +``` +sum(rate(skipper_filter_request_duration_seconds_count{filter=~"cluster.*"}[1m])) +``` + +- skipper.swarm.valkey.query.allow.success: successful allow requests to the rate limiter, ungrouped +- skipper.swarm.valkey.query.allow.failure: failed allow requests to the rate limiter, ungrouped, where the valkey + communication failed +- skipper.swarm.valkey.query.retryafter.success.: successful allow requests to the rate limiter, grouped + by the rate limiter group name when used +- skipper.swarm.valkey.query.retryafter.failure.: failed allow requests to the rate limiter, ungrouped, + where the Valkey communication failed, grouped by the rate limiter group name when used + +See more details about rate limiting at [Rate limiting](../reference/filters.md#clusterclientratelimit). + ### Open Policy Agent metrics If Open Policy Agent filters are enabled, the following counters show up in the `/metrics` endpoint. The bundle-name is the first parameter of the filter so that for example increased error codes can be attributed to a specific source bundle / system. diff --git a/docs/reference/filters.md b/docs/reference/filters.md index 3a98689658..924435622d 100644 --- a/docs/reference/filters.md +++ b/docs/reference/filters.md @@ -2605,8 +2605,8 @@ with `429 Too Many Requests` when limit is reached. ### clusterLeakyBucketRatelimit -Implements leaky bucket rate limit algorithm that uses Redis as a storage. -Requires command line flags `-enable-ratelimits`, `-enable-swarm` and `-swarm-redis-urls` to be set. +Implements leaky bucket rate limit algorithm that uses Redis or Valkey as a storage. +Requires command line flags `-enable-ratelimits`, `-enable-swarm` and either `-swarm-redis-urls` or `-swarm-valkey-urls` to be set. The leaky bucket is an algorithm based on an analogy of how a bucket with a constant leak will overflow if either the average rate at which water is poured in exceeds the rate at which the bucket leaks or if more water than @@ -2706,7 +2706,7 @@ Path("/expensive") -> clusterLeakyBucketRatelimit("user-${request.cookie.Authori ### ratelimitFailClosed This filter changes the failure mode for all rate limit filters of the route. -By default rate limit filters fail open on infrastructure errors (e.g. when redis is down) and allow requests. +By default rate limit filters fail open on infrastructure errors (e.g. when Redis or Valkey is down) and allow requests. When this filter is present on the route, rate limit filters will fail closed in case of infrastructure errors and deny requests. Examples: @@ -2715,7 +2715,7 @@ fail_open: * -> clusterRatelimit("g",10, "1s") fail_closed: * -> ratelimitFailClosed() -> clusterRatelimit("g", 10, "1s") ``` -In case `clusterRatelimit` could not reach the swarm (e.g. redis): +In case `clusterRatelimit` could not reach the swarm (e.g. Redis or Valkey): * Route `fail_open` will allow the request * Route `fail_closed` will deny the request diff --git a/docs/tutorials/operations.md b/docs/tutorials/operations.md index 198fef86ae..1522982636 100644 --- a/docs/tutorials/operations.md +++ b/docs/tutorials/operations.md @@ -96,16 +96,27 @@ based on X-Forwarded-For headers, you can also ignore this. Ratelimits can be calculated for the whole cluster instead of having only the instance based ratelimits. The common term we use in skipper documentation is [cluster ratelimit](ratelimit.md#cluster-ratelimit). -There are two option, but we highly recommend the use of Redis based -cluster ratelimits. To support redis based cluster ratelimits you have to -use `-enable-swarm` and add a list of URLs to redis -`-swarm-redis-urls=skipper-ingress-redis-0.skipper-ingress-redis.kube-system.svc.cluster.local:6379,skipper-ingress-redis-1.skipper-ingress-redis.kube-system.svc.cluster.local:6379`. We -run [redis as -statefulset](https://github.com/zalando-incubator/kubernetes-on-aws/blob/beta/cluster/manifests/skipper/skipper-redis.yaml) -with a [headless -service](https://github.com/zalando-incubator/kubernetes-on-aws/blob/beta/cluster/manifests/skipper/skipper-redis-service.yaml) -to have predictable names. We chose to not use a persistent volume, -because storing the data in memory is good enough for this use case. +There are two option, but we highly recommend the use of Valkey based +cluster ratelimits. To support Valkey based cluster ratelimits you have to +use `-enable-swarm` and add you can either use a static list of URLs to Valkey +`-swarm-valkey-urls=skipper-ingress-valkey-0.skipper-ingress-valkey.kube-system.svc.cluster.local:6379,skipper-ingress-valkey-1.skipper-ingress-valkey.kube-system.svc.cluster.local:6379` or use autoscaling. + +We run [valkey as statefulset](https://github.com/zalando-incubator/kubernetes-on-aws/blob/stable/cluster/manifests/skipper/skipper-valkey.yaml) +with a [headless service](https://github.com/zalando-incubator/kubernetes-on-aws/blob/stable/cluster/manifests/skipper/skipper-valkey-service.yaml) +and [horizontal pod autoscaler](https://github.com/zalando-incubator/kubernetes-on-aws/blob/stable/cluster/manifests/skipper/hpa-valkey.yaml). + +To use autoscaling with routeserv you can use `-swarm-valkey-remote=http://skipper-ingress-routesrv.kube-system.svc.cluster.local/swarm/valkey/shards"`, depending on how you expose routesrv into your cluster. +For more simple setup that do not run routesrv you can use these arguments to get valkey instance from kubernetes automatically updated: + +``` +-kubernetes-valkey-service-namespace=kube-system +-kubernetes-valkey-service-name=skipper-ingress-valkey +``` + +To run valkey, we chose to not use a persistent volume, because +storing the data in memory is good enough for the rate limiting use +case. + #### East West diff --git a/docs/tutorials/ratelimit.md b/docs/tutorials/ratelimit.md index 92300dbb42..a2cc669e60 100644 --- a/docs/tutorials/ratelimit.md +++ b/docs/tutorials/ratelimit.md @@ -84,9 +84,10 @@ have a powerful tool like the provided `clientRatelimit`. A cluster ratelimit computes all requests for all skipper peers. This requires, that you run skipper with `-enable-swarm` and select one of -the two implementations: +the three implementations: - [Redis](https://redis.io) +- [Valkey](https://valkey.io/) - [SWIM](https://www.cs.cornell.edu/projects/Quicksilver/public_pdfs/SWIM.pdf) Make sure all requirements, that are dependent on the implementation @@ -152,6 +153,69 @@ following Redis commands: ![Picture showing Skipper with Redis based swarm and ratelimit](../img/redis-and-cluster-ratelimit.svg) +### Valkey based Cluster Ratelimits + +This solution is independent of the dataclient being used. +You have to run one or more [Valkey](https://valkey.io/) instances. +See also [Running with Valkey based Cluster Ratelimits](../kubernetes/ingress-controller.md#valkey-based). + +There are 3 different configurations to assign Valkey instances as a Skipper Valkey swarm. + +#### Static + +Specify `-swarm-valkey-urls`, multiple instances can be separated by comma, +for example: `-swarm-valkey-urls=valkey1:6379,valkey2:6379`. +Use this if you don't need to scale your Valkey instances. + +#### Kubernetes Service Selector + +Specify `-kubernetes-valkey-service-namespace=`, `-kubernetes-valkey-service-name=` +and optional `-kubernetes-valkey-service-port=`. + +Skipper will update Valkey addresses every 10 seconds from specified service endpoints. +This allows you to dynamically scale Valkey instances. +Note that when `-kubernetes` is set Skipper also fetches `Ingresses` and `RouteGroups` for routing, +see [ingress-controller deployment docs](../kubernetes/ingress-controller.md). + +#### HTTP Endpoint + +Specify `-swarm-valkey-remote=http://127.0.0.1/valkey/endpoints`, + +Skipper will update Valkey addresses every 10 seconds from this remote URL +that should return data in the following JSON format: +```json +{ + "endpoints": [ + {"address": "10.2.0.1:6379"}, {"address": "10.2.0.2:6379"}, + {"address": "10.2.0.3:6379"}, {"address": "10.2.0.4:6379"}, + {"address": "10.2.0.5:6379"} + ] +} +``` + +If you have [routesrv proxy](https://opensource.zalando.com/skipper/kubernetes/ingress-controller/#routesrv) enabled, +you need to configure Skipper with the flag `-swarm-valkey-remote=http://..svc.cluster.local/swarm/valkey/shards`. +`Routesrv` will be responsible for collecting Valkey endpoints and Skipper will poll them from it. + +#### Implementation + +The implementation use [Valkey-Go +library](https://github.com/valkey-io/valkey-go) and use a client side +hash ring implementation on skipper side that is faster than the +go-redis implementation to access a shard via client hashing and +spread the load across multiple Valkey instances. Like this we are be +able to scale out the shared rate limit storage. + +The ratelimit algorithm is a sliding window and makes use of the +following Valkey commands: + +- [ZREMRANGEBYSCORE](https://valkey.io/commands/zremrangebyscore), +- [ZCARD](https://valkey.io/commands/zcard), +- [ZADD](https://valkey.io/commands/zadd) and +- [ZRANGEBYSCORE](https://valkey.io/commands/zrangebyscore) + +![Picture showing Skipper with Valkey based swarm and ratelimit](../img/valkey-and-cluster-ratelimit.svg) + ### SWIM based Cluster Ratelimits [SWIM](https://www.cs.cornell.edu/projects/Quicksilver/public_pdfs/SWIM.pdf)