Parallel processing with Jetstream

Hey team, apologies for the poor title on this issue but I wanted to start a conversation around some performance bottlenecks we've run into and how we could potentially work around them.

### Background
We are using gnmic as a datasource for a network telemetry API to be used for queries about the state of our internal network. The current method is that we have a simple message (like Arista EOS version, for example) sent to a Jetstream subject `telemetry.inventory`. Specific state messages (BGP status, interface status, etc) are sent to `telemetry.gnmic.<device_name>.<subscription>`. When a message is received on `telemetry.inventory` we spawn a worker goroutine that subscribes to the `telemetry.gnmic.<device_name>.>` for the given device name.

The reason we follow a worker thread per device is that it is important to us to ensure that the ordering of messages is maintained. We don't want two messages in short succession to be received by gnmic in order, and then received in an incorrect order by our application. For example: a flapping interface sends multiple up/down events and in this case it is most important that we store the *last* status for querying.

This solution allows us to maintain ordering per-device, while also processing the total queue in parallel.

### Current Issue
While working through some bottlenecks we've found what we believe to be causing some issues within gnmic itself. From our investigation we see that each Jetstream output has an unbuffered `msgChan` which will block until the message has been processed and successfully written to the Jetstream.

* Single unbuffered `msgChan` - https://github.com/openconfig/gnmic/blob/main/pkg/outputs/nats_outputs/jetstream/jetstream_output.go#L139
* Blocking Behavior - https://github.com/openconfig/gnmic/blob/main/pkg/outputs/nats_outputs/jetstream/jetstream_output.go#L241

We had been seeing some missing messages which we had attributed to NATS itself dropping the messages, but after investigation we believe that gnmic was timing out, blocking the channel, and the messages were being lost. By increasing our `write-timeout` to 30s, all messages are delivered with no issues.

### Question
Do you have any recommendations for how we could utilize gnmic to process messages in parallel while also maintaining ordering?

One idea we had was to have an output per-target or per-subscription. Each output would have the same configuration and be created solely to increase the number of `msgChan` created. This would maintain our ordering requirements within the context of a target or subscription.

We'd looked at increasing the `num-workers` as we currently only use 1, but are concerned that while it would guarantee that gnmic processes the messages in order, it's possible that network writes to the Jetstream could arrive out of order.

Have you seen other users build a similar pattern before? Or do you have an other ideas we could use?


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Parallel processing with Jetstream #342

Background

Current Issue

Question

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Parallel processing with Jetstream #342

Description

Background

Current Issue

Question

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions