Rework Elsa.Util.get_partition_count to optionally retry on failure #11

nathanmonteleone · 2025-05-23T19:27:19Z

https://simplifi.atlassian.net/browse/INT-11108

The basic problem is that creating topics in Kafka is asychronous, so unit tests that create topics tend to fail if we try to produce to them immediately afterwards. It turns out part of the reason this is so fatal, is that brod likes to cache topic non-existence once you've had a failed query.

This PR adds intentional retries when we're trying to pull the partition count from topic metadata. It also uses non-client based brod calls to make these queries (i.e. you just give it endpoints instead of a brod_client), to get around that caching behavior.

…nUse get_partition_count retries in the producer and consumer initializers to get around race conditions with topic creation.

LiruMouse · 2025-05-23T20:02:56Z

lib/elsa/producer/initializer.ex

@@ -1,4 +1,6 @@
 defmodule Elsa.Producer.Initializer do
+  alias Elsa.RetryConfig


This alias is only used once?

LiruMouse · 2025-05-23T20:03:48Z

lib/elsa/producer.ex

  @type message :: {iodata(), iodata()} | binary() | %{key: iodata(), value: iodata()}

  alias Elsa.ElsaRegistry
+  alias Elsa.RetryConfig


This alias is only used once

joshuawscott · 2025-05-23T21:15:47Z

lib/elsa/producer/initializer.ex

+    # Use the non-connection based partition_count.
+    # This circumvents a behavior in brod that caches topics as non-existent,
+    # which would break our ability to retry.
+    {:ok, endpoints} = Elsa.Util.get_endpoints(brod_client)


I think along with what Liru said, the Elsa.Util is used twice here, and isn't aliased (unlike the other modules where we do alias it)

Yeah I agree on both points. Makes me wonder if there's a credo flag to actually check this, that I'm missing...

turns out there is and it's violated all over the place. I'll fix these by hand for now and make a separate PR for the rest.

joshuawscott · 2025-05-23T21:17:23Z

lib/elsa/retry_config.ex

+  @spec no_retry() :: t()
+  def no_retry do
+    %__MODULE__{
+      tries: 1,


I like the config key 'tries'.
it's funny to realize how often I've made a retry module and had 'retries' as an argument to something, and then debated what it meant, when 'tries' was right there the whole time 😂

joshuawscott · 2025-05-23T21:21:05Z

lib/elsa/producer/initializer.ex

+    # This circumvents a behavior in brod that caches topics as non-existent,
+    # which would break our ability to retry.
+    {:ok, endpoints} = Elsa.Util.get_endpoints(brod_client)
+    {:ok, partitions} = Elsa.Util.partition_count(endpoints, topic, retry_config)


We are essentially just calling the ! version of each of these functions because we are pattern matching with {:ok, ...}. We should either use the ! version, or we should do a with or some other construct to handle errors. As-is, if we have a failure, it's going to complain about a pattern match, and that makes it difficult to identify the actual underlying error.

Agreed, let's use !

(If even the retry fails, that means you're trying to create a producer or consumer for a topic that won't return metadata on the partition count -- probably because it doesn't exist. So we're dead in the water at that point, there's not much additional error handling we could do.)

…s where we use them more than once. Also replaced pattern match error checks with the ! version of get_partition_count.

Rework Elsa.Util.get_partition_count to optionally retry on failure.\…

bd84b62

…nUse get_partition_count retries in the producer and consumer initializers to get around race conditions with topic creation.

nathanmonteleone requested review from LiruMouse, ctcline-simplifi and joshuawscott as code owners May 23, 2025 19:27

nathanmonteleone changed the title ~~Rework Elsa.Util.get_partition_count to optionally retry on failure.\…~~ Rework Elsa.Util.get_partition_count to optionally retry on failure May 23, 2025

nathanmonteleone added 2 commits May 23, 2025 14:59

Clarify error log message

ae12c55

Formatting

2da1d54

LiruMouse reviewed May 23, 2025

View reviewed changes

joshuawscott reviewed May 23, 2025

View reviewed changes

PR feedback - tried to normalize our use of aliases to only situation…

8bd744f

…s where we use them more than once. Also replaced pattern match error checks with the ! version of get_partition_count.

joshuawscott approved these changes May 27, 2025

View reviewed changes

ctcline-simplifi approved these changes May 27, 2025

View reviewed changes

LiruMouse approved these changes May 28, 2025

View reviewed changes

nathanmonteleone mentioned this pull request May 28, 2025

Add credo config and fix inconsistent alias usage #12

Merged

nathanmonteleone merged commit b9acf5b into main May 28, 2025
3 checks passed

nathanmonteleone deleted the partition_count_retry branch May 28, 2025 16:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Rework Elsa.Util.get_partition_count to optionally retry on failure #11

Rework Elsa.Util.get_partition_count to optionally retry on failure #11

Uh oh!

nathanmonteleone commented May 23, 2025

Uh oh!

LiruMouse May 23, 2025

Uh oh!

LiruMouse May 23, 2025

Uh oh!

joshuawscott May 23, 2025

Uh oh!

nathanmonteleone May 27, 2025

Uh oh!

nathanmonteleone May 27, 2025

Uh oh!

joshuawscott May 23, 2025

Uh oh!

joshuawscott May 23, 2025

Uh oh!

nathanmonteleone May 27, 2025

Uh oh!

nathanmonteleone May 27, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

		@@ -1,4 +1,6 @@
		defmodule Elsa.Producer.Initializer do
		alias Elsa.RetryConfig

Rework Elsa.Util.get_partition_count to optionally retry on failure #11

Rework Elsa.Util.get_partition_count to optionally retry on failure #11

Uh oh!

Conversation

nathanmonteleone commented May 23, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants