Skip to content

Locker doesn't release locks for processed jobs under some circumstances #290

@sidonath

Description

@sidonath

Hi 👋

I'm noticing an unexpected accumulation of locks by our worker processes. It seems that under some circumstances, advisory locks are not being released by the locker. Here's a query that hopefully illustrates the problem:

with
  pid_locks as
    (select pl.pid, count(*) from pg_locks pl inner join que_lockers ql on pl.pid = ql.pid group by pl.pid),
  pid_locked_jobs as
    (select pl.pid, count(*) from pg_locks pl inner join que_jobs qj on pl.objid = qj.id group by pl.pid)
select ql.pid, worker_count, queues, listening, pid_locks.count as active_locks, pid_locked_jobs.count as active_jobs
from que_lockers ql
left join pid_locks on ql.pid = pid_locks.pid
left join pid_locked_jobs on pid_locked_jobs.pid = ql.pid;

(Our job IDs are still in 32-bit range, so I didn't need to use classid in the pid_locked_jobs join.)

  pid  | worker_count |  queues   | listening | active_locks | active_jobs
-------+--------------+-----------+-----------+--------------+-------------
 15649 |           12 | {default} | t         |         4255 |           3
 19095 |           12 | {default} | t         |          673 |           4
 16646 |           12 | {default} | t         |         1148 |           6
 31188 |           12 | {default} | t         |         7909 |          10
(4 rows)

As you can see, at the moment we have 4 Que processes running (on Heroku, running on 4 different dynos), each with 12 workers. Our throughput is on average 400 jobs/minute, peaking occasionally at 800 jobs/minute. Normally our queue is almost empty.

However, all locker processes have hundreds/thousands of unreleased locks. I've seen the total number of locks exceed 50k and I'm suspecting it lead to out of shared memory errors we experienced earlier this week. I can see this problem locally as well, but I still haven't narrowed down on the simple way to reproduce it.

Does/did anybody see similar behavior?

Any ideas on what to explore/test/log would be appreciated!

Stack details:

  • Que 1.0.0.beta3 (using the ActiveJob adapter)
  • Rails 5.2.4.3
  • Postgresql 12
  • We have a custom job middleware for reporting job metrics, but I can see the problem happening even when the job middleware disabled

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions