-
Notifications
You must be signed in to change notification settings - Fork 195
Description
Hi 👋
I'm noticing an unexpected accumulation of locks by our worker processes. It seems that under some circumstances, advisory locks are not being released by the locker. Here's a query that hopefully illustrates the problem:
with
pid_locks as
(select pl.pid, count(*) from pg_locks pl inner join que_lockers ql on pl.pid = ql.pid group by pl.pid),
pid_locked_jobs as
(select pl.pid, count(*) from pg_locks pl inner join que_jobs qj on pl.objid = qj.id group by pl.pid)
select ql.pid, worker_count, queues, listening, pid_locks.count as active_locks, pid_locked_jobs.count as active_jobs
from que_lockers ql
left join pid_locks on ql.pid = pid_locks.pid
left join pid_locked_jobs on pid_locked_jobs.pid = ql.pid;(Our job IDs are still in 32-bit range, so I didn't need to use classid in the pid_locked_jobs join.)
pid | worker_count | queues | listening | active_locks | active_jobs
-------+--------------+-----------+-----------+--------------+-------------
15649 | 12 | {default} | t | 4255 | 3
19095 | 12 | {default} | t | 673 | 4
16646 | 12 | {default} | t | 1148 | 6
31188 | 12 | {default} | t | 7909 | 10
(4 rows)
As you can see, at the moment we have 4 Que processes running (on Heroku, running on 4 different dynos), each with 12 workers. Our throughput is on average 400 jobs/minute, peaking occasionally at 800 jobs/minute. Normally our queue is almost empty.
However, all locker processes have hundreds/thousands of unreleased locks. I've seen the total number of locks exceed 50k and I'm suspecting it lead to out of shared memory errors we experienced earlier this week. I can see this problem locally as well, but I still haven't narrowed down on the simple way to reproduce it.
Does/did anybody see similar behavior?
Any ideas on what to explore/test/log would be appreciated!
Stack details:
- Que 1.0.0.beta3 (using the ActiveJob adapter)
- Rails 5.2.4.3
- Postgresql 12
- We have a custom job middleware for reporting job metrics, but I can see the problem happening even when the job middleware disabled