Skip to content

Conversation

@nathanmonteleone
Copy link

@nathanmonteleone nathanmonteleone commented Jun 17, 2025

This should now completely fix the following issue:

#15
(Internal issue tracking): https://simplifi.atlassian.net/browse/INT-11499
https://simplifi.atlassian.net/browse/INT-11494

Two changes in this one:

First, added a callback that you can get when a batch of assignments is complete. This should make it easier to write stable integration tests, i.e. you can subscribe to a topic using latest for the offset, wait for assignments complete, and then publish.

Second, reworked WorkerSupervisor's supervision tree:

  • Fixed where we set restart: :transient -- it was simply in the wrong place before, the intent all along was to have the underlying DynamicSupervisor to be transient, not the module-based Supervisor.
  • Fixed a race condition by no longer putting the underlying DynamicSupervisor in ElsaRegistry. The problem was that ElsaRegistry does not synchronously remove the keys of dead processes, so we were intermittently getting an {:error, :already_started} from Supervisor.restart_child. This could have also been fixed with an explicit call to ElsaRegistry.unregister_name/2, but there's no benefit to having this transient particular process in the registry anyway. It just muddies up the picture really.

…ror in the supervision tree of WorkerSupervisor.

config :logger,
handle_sasl_reports: false,
handle_sasl_reports: true,
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I want to leave this in for now -- if the integration tests fail in CI, I'd like to be able to get a dump of process starts/deaths.

processes by the consumer group manager.
"""
use Supervisor, restart: :transient
use Supervisor
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

restart: transient here affects how the module-based supervisor (WorkerSupervisor) is treated by its parent. That's actually not what I intended at all -- if WorkerSupervisor dies I want it to automatically restart.

The intent was for the child DynamicSupervisor not to restart automatically. That needs to be specified below when we create it.

# Make sure the DynamicSupervisor itself is truly cleaned up from the Supervisor's perspective,
# so that it will restart reliably
_ = Supervisor.terminate_child(module_supervisor, :worker_dynamic_supervisor)
if dynamic_worker_supervisor != :undefined do
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If for some reason the dynamic_worker_supervisor isn't alive at this point (ex. because we got that transient "coordinator not available error" and it prevented creation or killed off the supervision tree), we should skip the rest so that we don't crash out the top-level supervisor process.

@nathanmonteleone nathanmonteleone marked this pull request as draft June 17, 2025 19:27
@nathanmonteleone
Copy link
Author

Switching this to a draft while I figure out how to start the DynamicSupervisor correctly

{DynamicSupervisor, :start_link,
[[name: {:via, ElsaRegistry, {registry(connection), :worker_dynamic_supervisor}}]]},
restart: :transient
}
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had simply made a mistake here -- the tuple form of a child spec does not actually set the ID from the supervisor's perspective. You have to use the struct form for that.

I believe the upshot of this mistake was that our calls to Supervisor.terminate_child below, were just failing silently.

…icSupervisor. It was causing a race condition, and I think having multiple points of ownership just muddies things anyway.
@nathanmonteleone nathanmonteleone marked this pull request as ready for review June 18, 2025 23:09
@ctcline-simplifi
Copy link

🎉

@nathanmonteleone nathanmonteleone changed the title Assignments complete callback. Fix error in WorkerSupervisor supervision tree. Assignments complete callback. Fix error in WorkerSupervisor supervision tree. Improve lifecycle callback documentation. Jun 19, 2025
@nathanmonteleone nathanmonteleone merged commit 2e9bd10 into main Jun 19, 2025
3 checks passed
@nathanmonteleone nathanmonteleone deleted the assignments_complete_callback branch June 19, 2025 17:00
@nathanmonteleone nathanmonteleone changed the title Assignments complete callback. Fix error in WorkerSupervisor supervision tree. Improve lifecycle callback documentation. Assignments complete callback. Fix race condition in WorkerSupervisoor. Fix error in WorkerSupervisor supervision tree. Improve lifecycle callback documentation. Jun 19, 2025
@nathanmonteleone nathanmonteleone changed the title Assignments complete callback. Fix race condition in WorkerSupervisoor. Fix error in WorkerSupervisor supervision tree. Improve lifecycle callback documentation. Assignments complete callback. Fix race condition in WorkerSupervisor. Fix error in WorkerSupervisor supervision tree. Improve lifecycle callback documentation. Jun 19, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants