Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions docs/img/valkey-and-cluster-ratelimit.svg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
85 changes: 84 additions & 1 deletion docs/kubernetes/ingress-controller.md
Original file line number Diff line number Diff line change
Expand Up @@ -642,6 +642,7 @@ line option `-enable-swarm` and `-enable-ratelimits`.
The rest depends on the implementation, that can be:

- [Redis](https://redis.io)
- [Valkey](https://valkey.io)
- alpha version: [SWIM](https://www.cs.cornell.edu/projects/Quicksilver/public_pdfs/SWIM.pdf)

### Redis based
Expand All @@ -654,7 +655,7 @@ resolve redis hostnames as shown in the example, if skipper does not
have `dnsPolicy: ClusterFirstWithHostNet` in its Pod spec, see also
[DNS policy in the official Kubernetes documentation](https://kubernetes.io/docs/concepts/services-networking/dns-pod-service/#pod-s-dns-policy).

This setup is considered experimental and should be carefully tested
This setup is considered stable, but should be carefully tested
before running it in production.

Example redis statefulset with headless service:
Expand Down Expand Up @@ -722,6 +723,88 @@ spec:
type: ClusterIP
```

### Valkey based

Additionally you have to add `-swarm-valkey-urls` to skipper
`args:`. For example: `-swarm-valkey-urls=skipper-valkey-0.skipper-valkey.kube-system.svc.cluster.local:6379,skipper-valkey-1.skipper-valkey.kube-system.svc.cluster.local:6379`.

Running skipper with `hostNetwork` in kubernetes will not be able to
resolve valkey hostnames as shown in the example, if skipper does not
have `dnsPolicy: ClusterFirstWithHostNet` in its Pod spec, see also
[DNS policy in the official Kubernetes documentation](https://kubernetes.io/docs/concepts/services-networking/dns-pod-service/#pod-s-dns-policy).

This setup is considered stable, but should be carefully tested
before running it in production.

Example valkey statefulset with headless service:

```yaml
apiVersion: apps/v1
kind: StatefulSet
metadata:
labels:
application: skipper-ingress
component: valkey
version: 9-alpine3.22-20260330
name: skipper-valkey
namespace: kube-system
spec:
replicas: 2
selector:
matchLabels:
statefulset: skipper-ingress-valkey
serviceName: skipper-ingress-valkey
template:
metadata:
labels:
application: skipper-ingress
component: valkey
version: 9-alpine3.22-20260330
spec:
containers:
- image: container-registry.zalando.net/library/valkey-9-alpine:9-alpine3.22-20260330
name: skipper-valkey
ports:
- containerPort: 6379
protocol: TCP
readinessProbe:
exec:
command:
- valkey-cli
- ping
failureThreshold: 3
initialDelaySeconds: 10
periodSeconds: 60
successThreshold: 1
timeoutSeconds: 1
resources:
limits:
cpu: 100m
memory: 100Mi
dnsPolicy: ClusterFirst
restartPolicy: Always
schedulerName: default-scheduler
---
apiVersion: v1
kind: Service
metadata:
labels:
application: skipper-ingress
component: valkey
name: skipper-ingress-valkey
namespace: kube-system
spec:
clusterIP: None
ports:
- port: 6379
protocol: TCP
targetPort: 6379
selector:
application: skipper-ingress
component: valkey
type: ClusterIP
```



### SWIM based
Expand Down
26 changes: 26 additions & 0 deletions docs/operation/operation.md
Original file line number Diff line number Diff line change
Expand Up @@ -714,6 +714,32 @@ by the default, and exposed among the timers via the following keys:

See more details about rate limiting at [Rate limiting](../reference/filters.md#clusterclientratelimit).

### Valkey - Rate limiting metrics

System metrics exposed by the valkey client:

Prometheus query to get the number of known Valkey shards by the skipper ringclient:
```
skipper_custom_gauges{key =~ "^swarm[.]valkey[.]shards"}
```

Timer metrics for the latencies and errors of the communication with the auxiliary Valkey instances are enabled
by the default, and exposed among the timers via the following keys:

```
sum(rate(skipper_filter_request_duration_seconds_count{filter=~"cluster.*"}[1m]))
```

- skipper.swarm.valkey.query.allow.success: successful allow requests to the rate limiter, ungrouped
- skipper.swarm.valkey.query.allow.failure: failed allow requests to the rate limiter, ungrouped, where the valkey
communication failed
- skipper.swarm.valkey.query.retryafter.success.<group>: successful allow requests to the rate limiter, grouped
by the rate limiter group name when used
- skipper.swarm.valkey.query.retryafter.failure.<group>: failed allow requests to the rate limiter, ungrouped,
where the Valkey communication failed, grouped by the rate limiter group name when used

See more details about rate limiting at [Rate limiting](../reference/filters.md#clusterclientratelimit).

### Open Policy Agent metrics

If Open Policy Agent filters are enabled, the following counters show up in the `/metrics` endpoint. The bundle-name is the first parameter of the filter so that for example increased error codes can be attributed to a specific source bundle / system.
Expand Down
8 changes: 4 additions & 4 deletions docs/reference/filters.md
Original file line number Diff line number Diff line change
Expand Up @@ -2605,8 +2605,8 @@ with `429 Too Many Requests` when limit is reached.

### clusterLeakyBucketRatelimit

Implements leaky bucket rate limit algorithm that uses Redis as a storage.
Requires command line flags `-enable-ratelimits`, `-enable-swarm` and `-swarm-redis-urls` to be set.
Implements leaky bucket rate limit algorithm that uses Redis or Valkey as a storage.
Requires command line flags `-enable-ratelimits`, `-enable-swarm` and either `-swarm-redis-urls` or `-swarm-valkey-urls` to be set.

The leaky bucket is an algorithm based on an analogy of how a bucket with a constant leak will overflow if either
the average rate at which water is poured in exceeds the rate at which the bucket leaks or if more water than
Expand Down Expand Up @@ -2706,7 +2706,7 @@ Path("/expensive") -> clusterLeakyBucketRatelimit("user-${request.cookie.Authori
### ratelimitFailClosed

This filter changes the failure mode for all rate limit filters of the route.
By default rate limit filters fail open on infrastructure errors (e.g. when redis is down) and allow requests.
By default rate limit filters fail open on infrastructure errors (e.g. when Redis or Valkey is down) and allow requests.
When this filter is present on the route, rate limit filters will fail closed in case of infrastructure errors and deny requests.

Examples:
Expand All @@ -2715,7 +2715,7 @@ fail_open: * -> clusterRatelimit("g",10, "1s")
fail_closed: * -> ratelimitFailClosed() -> clusterRatelimit("g", 10, "1s")
```

In case `clusterRatelimit` could not reach the swarm (e.g. redis):
In case `clusterRatelimit` could not reach the swarm (e.g. Redis or Valkey):

* Route `fail_open` will allow the request
* Route `fail_closed` will deny the request
Expand Down
31 changes: 21 additions & 10 deletions docs/tutorials/operations.md
Original file line number Diff line number Diff line change
Expand Up @@ -96,16 +96,27 @@ based on X-Forwarded-For headers, you can also ignore this.
Ratelimits can be calculated for the whole cluster instead of having
only the instance based ratelimits. The common term we use in skipper
documentation is [cluster ratelimit](ratelimit.md#cluster-ratelimit).
There are two option, but we highly recommend the use of Redis based
cluster ratelimits. To support redis based cluster ratelimits you have to
use `-enable-swarm` and add a list of URLs to redis
`-swarm-redis-urls=skipper-ingress-redis-0.skipper-ingress-redis.kube-system.svc.cluster.local:6379,skipper-ingress-redis-1.skipper-ingress-redis.kube-system.svc.cluster.local:6379`. We
run [redis as
statefulset](https://github.com/zalando-incubator/kubernetes-on-aws/blob/beta/cluster/manifests/skipper/skipper-redis.yaml)
with a [headless
service](https://github.com/zalando-incubator/kubernetes-on-aws/blob/beta/cluster/manifests/skipper/skipper-redis-service.yaml)
to have predictable names. We chose to not use a persistent volume,
because storing the data in memory is good enough for this use case.
There are two option, but we highly recommend the use of Valkey based
cluster ratelimits. To support Valkey based cluster ratelimits you have to
use `-enable-swarm` and add you can either use a static list of URLs to Valkey
`-swarm-valkey-urls=skipper-ingress-valkey-0.skipper-ingress-valkey.kube-system.svc.cluster.local:6379,skipper-ingress-valkey-1.skipper-ingress-valkey.kube-system.svc.cluster.local:6379` or use autoscaling.

We run [valkey as statefulset](https://github.com/zalando-incubator/kubernetes-on-aws/blob/stable/cluster/manifests/skipper/skipper-valkey.yaml)
with a [headless service](https://github.com/zalando-incubator/kubernetes-on-aws/blob/stable/cluster/manifests/skipper/skipper-valkey-service.yaml)
and [horizontal pod autoscaler](https://github.com/zalando-incubator/kubernetes-on-aws/blob/stable/cluster/manifests/skipper/hpa-valkey.yaml).

To use autoscaling with routeserv you can use `-swarm-valkey-remote=http://skipper-ingress-routesrv.kube-system.svc.cluster.local/swarm/valkey/shards"`, depending on how you expose routesrv into your cluster.
For more simple setup that do not run routesrv you can use these arguments to get valkey instance from kubernetes automatically updated:

```
-kubernetes-valkey-service-namespace=kube-system
-kubernetes-valkey-service-name=skipper-ingress-valkey
```

To run valkey, we chose to not use a persistent volume, because
storing the data in memory is good enough for the rate limiting use
case.


#### East West

Expand Down
66 changes: 65 additions & 1 deletion docs/tutorials/ratelimit.md
Original file line number Diff line number Diff line change
Expand Up @@ -84,9 +84,10 @@ have a powerful tool like the provided `clientRatelimit`.

A cluster ratelimit computes all requests for all skipper peers. This
requires, that you run skipper with `-enable-swarm` and select one of
the two implementations:
the three implementations:

- [Redis](https://redis.io)
- [Valkey](https://valkey.io/)
- [SWIM](https://www.cs.cornell.edu/projects/Quicksilver/public_pdfs/SWIM.pdf)

Make sure all requirements, that are dependent on the implementation
Expand Down Expand Up @@ -152,6 +153,69 @@ following Redis commands:

![Picture showing Skipper with Redis based swarm and ratelimit](../img/redis-and-cluster-ratelimit.svg)

### Valkey based Cluster Ratelimits

This solution is independent of the dataclient being used.
You have to run one or more [Valkey](https://valkey.io/) instances.
See also [Running with Valkey based Cluster Ratelimits](../kubernetes/ingress-controller.md#valkey-based).

There are 3 different configurations to assign Valkey instances as a Skipper Valkey swarm.

#### Static

Specify `-swarm-valkey-urls`, multiple instances can be separated by comma,
for example: `-swarm-valkey-urls=valkey1:6379,valkey2:6379`.
Use this if you don't need to scale your Valkey instances.

#### Kubernetes Service Selector

Specify `-kubernetes-valkey-service-namespace=<namespace>`, `-kubernetes-valkey-service-name=<name>`
and optional `-kubernetes-valkey-service-port=<port number>`.

Skipper will update Valkey addresses every 10 seconds from specified service endpoints.
This allows you to dynamically scale Valkey instances.
Note that when `-kubernetes` is set Skipper also fetches `Ingresses` and `RouteGroups` for routing,
see [ingress-controller deployment docs](../kubernetes/ingress-controller.md).

#### HTTP Endpoint

Specify `-swarm-valkey-remote=http://127.0.0.1/valkey/endpoints`,

Skipper will update Valkey addresses every 10 seconds from this remote URL
that should return data in the following JSON format:
```json
{
"endpoints": [
{"address": "10.2.0.1:6379"}, {"address": "10.2.0.2:6379"},
{"address": "10.2.0.3:6379"}, {"address": "10.2.0.4:6379"},
{"address": "10.2.0.5:6379"}
]
}
```

If you have [routesrv proxy](https://opensource.zalando.com/skipper/kubernetes/ingress-controller/#routesrv) enabled,
you need to configure Skipper with the flag `-swarm-valkey-remote=http://<routesrv-service-name>.<routesrv-namespace>.svc.cluster.local/swarm/valkey/shards`.
`Routesrv` will be responsible for collecting Valkey endpoints and Skipper will poll them from it.

#### Implementation

The implementation use [Valkey-Go
library](https://github.com/valkey-io/valkey-go) and use a client side
hash ring implementation on skipper side that is faster than the
go-redis implementation to access a shard via client hashing and
spread the load across multiple Valkey instances. Like this we are be
able to scale out the shared rate limit storage.

The ratelimit algorithm is a sliding window and makes use of the
following Valkey commands:

- [ZREMRANGEBYSCORE](https://valkey.io/commands/zremrangebyscore),
- [ZCARD](https://valkey.io/commands/zcard),
- [ZADD](https://valkey.io/commands/zadd) and
- [ZRANGEBYSCORE](https://valkey.io/commands/zrangebyscore)

![Picture showing Skipper with Valkey based swarm and ratelimit](../img/valkey-and-cluster-ratelimit.svg)

### SWIM based Cluster Ratelimits

[SWIM](https://www.cs.cornell.edu/projects/Quicksilver/public_pdfs/SWIM.pdf)
Expand Down
Loading