Skip to content

Bug: Index Workers Not Consuming RMQ Tasks After Reconnecting #257

@artntek

Description

@artntek

Summary:

After a long (~1hr) disconnect, index workers appear to successfully reconnect to RMQ, but no longer consume enqueued tasks

Not sure how to reproduce this behavior, but documenting it for posterity. It was seen after we had an ADC outage on 7/2/25 (PDC) for about an hour, which involved pods being moved to different nodes, and some network disruptions. Only 2 of the 20 ADC indexer pods were restarted, though.

Here's roughly what happened:

  1. There was a network and/or rabbitmq disruption, so indexer pods couldn't connect to rmq for about an hour (see example log below). This ended around 11pm PDT.
  2. The indexer pods seemed to reconnect to rmq after normal service had been resumed
  3. Just after 9am PDT on 7/3, metacat started submitting tasks to the queue, and the queue size grew steadily to about 40 around 1pm, and stayed there without being consumed. RMQ dashboard showed all 20 connections as running, and all 20 channels as idle.
  4. I then started a reindexing script, that submitted 1 task per second to the queue around 1:43pm, and then all 4 of the "stuck" tasks were immediately consumed, along with these new tasks.

See RMQ dashboard screenshot below:

Image

Here is a sample log extract from one of the indexers:

 8595 dataone-indexer 20250703-04:19:19: [ERROR]: The RabbitMQ connection or channel were closed. DataONE-indexer has a mechanism to restore them. However, if this error message shows
 8596 dataone-indexer 20250703-04:19:19: [ERROR]: DataONE-indexer cannot recreate the RabbitMQ connections/channels since Exception trying to re-initialize connection and channel: met
 8597 dataone-indexer 20250703-04:19:23: [ERROR]: Caught an exception during connection recovery! [com.rabbitmq.client.impl.ForgivingExceptionHandler:log:119]
 8598 java.net.UnknownHostException: metacatarctic-rabbitmq-3-13-7-headless: No address associated with hostname
 8599   at java.net.Inet6AddressImpl.lookupAllHostAddr(Native Method) ~[?:?]
 8600   at java.net.InetAddress$PlatformNameService.lookupAllHostAddr(Unknown Source) ~[?:?]
 8601   at java.net.InetAddress.getAddressesFromNameService(Unknown Source) ~[?:?]
 8602   at java.net.InetAddress$NameServiceAddresses.get(Unknown Source) ~[?:?]
 8603   at java.net.InetAddress.getAllByName0(Unknown Source) ~[?:?]
 8604   at java.net.InetAddress.getAllByName(Unknown Source) ~[?:?]
 8605   at java.net.InetAddress.getAllByName(Unknown Source) ~[?:?]
 8606   at com.rabbitmq.client.DnsRecordIpAddressResolver.resolveIpAddresses(DnsRecordIpAddressResolver.java:83) ~[dataone-index-worker-3.1.4-shaded.jar:?]
 8607   at com.rabbitmq.client.DnsRecordIpAddressResolver.getAddresses(DnsRecordIpAddressResolver.java:73) ~[dataone-index-worker-3.1.4-shaded.jar:?]
 8608   at com.rabbitmq.client.impl.recovery.RecoveryAwareAMQConnectionFactory.newConnection(RecoveryAwareAMQConnectionFactory.java:64) ~[dataone-index-worker-3.1.4-shaded.jar:?]
 8609   at com.rabbitmq.client.impl.recovery.AutorecoveringConnection.recoverConnection(AutorecoveringConnection.java:628) ~[dataone-index-worker-3.1.4-shaded.jar:?]
 8610   at com.rabbitmq.client.impl.recovery.AutorecoveringConnection.beginAutomaticRecovery(AutorecoveringConnection.java:589) ~[dataone-index-worker-3.1.4-shaded.jar:?]
 8611   at com.rabbitmq.client.impl.recovery.AutorecoveringConnection.lambda$addAutomaticRecoveryListener$3(AutorecoveringConnection.java:524) ~[dataone-index-worker-3.1.4-shaded.jar:
 8612   at com.rabbitmq.client.impl.AMQConnection.notifyRecoveryCanBeginListeners(AMQConnection.java:839) [dataone-index-worker-3.1.4-shaded.jar:?]
 8613   at com.rabbitmq.client.impl.AMQConnection.doFinalShutdown(AMQConnection.java:816) [dataone-index-worker-3.1.4-shaded.jar:?]
 8614   at com.rabbitmq.client.impl.AMQConnection$MainLoop.run(AMQConnection.java:700) [dataone-index-worker-3.1.4-shaded.jar:?]
 8615   at java.lang.Thread.run(Unknown Source) [?:?]
 8616 dataone-indexer 20250703-04:19:23: [ERROR]: Caught an exception during connection recovery! [com.rabbitmq.client.impl.ForgivingExceptionHandler:log:119]
 8617 java.net.UnknownHostException: metacatarctic-rabbitmq-3-13-7-headless
 8618   at java.net.InetAddress$CachedAddresses.get(Unknown Source) ~[?:?]
 8619   at java.net.InetAddress.getAllByName0(Unknown Source) ~[?:?]
 8620   at java.net.InetAddress.getAllByName(Unknown Source) ~[?:?]
 8621   at java.net.InetAddress.getAllByName(Unknown Source) ~[?:?]
 8622   at com.rabbitmq.client.DnsRecordIpAddressResolver.resolveIpAddresses(DnsRecordIpAddressResolver.java:83) ~[dataone-index-worker-3.1.4-shaded.jar:?]
 8623   at com.rabbitmq.client.DnsRecordIpAddressResolver.getAddresses(DnsRecordIpAddressResolver.java:73) ~[dataone-index-worker-3.1.4-shaded.jar:?]
 8624   at com.rabbitmq.client.impl.recovery.RecoveryAwareAMQConnectionFactory.newConnection(RecoveryAwareAMQConnectionFactory.java:64) ~[dataone-index-worker-3.1.4-shaded.jar:?]
 8625   at com.rabbitmq.client.impl.recovery.AutorecoveringConnection.recoverConnection(AutorecoveringConnection.java:628) ~[dataone-index-worker-3.1.4-shaded.jar:?]
 8626   at com.rabbitmq.client.impl.recovery.AutorecoveringConnection.beginAutomaticRecovery(AutorecoveringConnection.java:589) ~[dataone-index-worker-3.1.4-shaded.jar:?]
 8627   at com.rabbitmq.client.impl.recovery.AutorecoveringConnection.lambda$addAutomaticRecoveryListener$3(AutorecoveringConnection.java:524) ~[dataone-index-worker-3.1.4-shaded.jar:
 8628   at com.rabbitmq.client.impl.AMQConnection.notifyRecoveryCanBeginListeners(AMQConnection.java:839) [dataone-index-worker-3.1.4-shaded.jar:?]
 8629   at com.rabbitmq.client.impl.AMQConnection.doFinalShutdown(AMQConnection.java:816) [dataone-index-worker-3.1.4-shaded.jar:?]
 8630   at com.rabbitmq.client.impl.AMQConnection$MainLoop.run(AMQConnection.java:700) [dataone-index-worker-3.1.4-shaded.jar:?]
 8631   at java.lang.Thread.run(Unknown Source) [?:?]
 8632 dataone-indexer 20250703-04:19:25: [ERROR]: Caught an exception during connection recovery! [com.rabbitmq.client.impl.ForgivingExceptionHandler:log:119]
 8633 java.net.UnknownHostException: metacatarctic-rabbitmq-3-13-7-headless
 8634   at java.net.InetAddress$CachedAddresses.get(Unknown Source) ~[?:?]
 8635   at java.net.InetAddress.getAllByName0(Unknown Source) ~[?:?]
 8636   at java.net.InetAddress.getAllByName(Unknown Source) ~[?:?]
 8637   at java.net.InetAddress.getAllByName(Unknown Source) ~[?:?]
 8638   at com.rabbitmq.client.DnsRecordIpAddressResolver.resolveIpAddresses(DnsRecordIpAddressResolver.java:83) ~[dataone-index-worker-3.1.4-shaded.jar:?]
 8639   at com.rabbitmq.client.DnsRecordIpAddressResolver.getAddresses(DnsRecordIpAddressResolver.java:73) ~[dataone-index-worker-3.1.4-shaded.jar:?]
 8640   at com.rabbitmq.client.impl.recovery.RecoveryAwareAMQConnectionFactory.newConnection(RecoveryAwareAMQConnectionFactory.java:64) ~[dataone-index-worker-3.1.4-shaded.jar:?]
 8641   at com.rabbitmq.client.impl.recovery.AutorecoveringConnection.recoverConnection(AutorecoveringConnection.java:628) ~[dataone-index-worker-3.1.4-shaded.jar:?]
 8642   at com.rabbitmq.client.impl.recovery.AutorecoveringConnection.beginAutomaticRecovery(AutorecoveringConnection.java:589) ~[dataone-index-worker-3.1.4-shaded.jar:?]
 8643   at com.rabbitmq.client.impl.recovery.AutorecoveringConnection.lambda$addAutomaticRecoveryListener$3(AutorecoveringConnection.java:524) ~[dataone-index-worker-3.1.4-shaded.jar:
 8644   at com.rabbitmq.client.impl.AMQConnection.notifyRecoveryCanBeginListeners(AMQConnection.java:839) [dataone-index-worker-3.1.4-shaded.jar:?]
 8645   at com.rabbitmq.client.impl.AMQConnection.doFinalShutdown(AMQConnection.java:816) [dataone-index-worker-3.1.4-shaded.jar:?]
 8646   at com.rabbitmq.client.impl.AMQConnection$MainLoop.run(AMQConnection.java:700) [dataone-index-worker-3.1.4-shaded.jar:?]
 8647   at java.lang.Thread.run(Unknown Source) [?:?]
 8648 dataone-indexer 20250703-04:19:29: [ERROR]: The RabbitMQ connection or channel were closed. DataONE-indexer has a mechanism to restore them. However, if this error message shows
 8649 dataone-indexer 20250703-04:19:29: [ERROR]: DataONE-indexer cannot recreate the RabbitMQ connections/channels since Exception trying to re-initialize connection and channel: Con
 8650 dataone-indexer 20250703-04:19:39: [ERROR]: The RabbitMQ connection or channel were closed. DataONE-indexer has a mechanism to restore them. However, if this error message shows
 8651 dataone-indexer 20250703-17:08:19: [ERROR]: The RabbitMQ connection or channel were closed. DataONE-indexer has a mechanism to restore them. However, if this error message shows
 8652 dataone-indexer 20250703-19:10:19: [ERROR]: The RabbitMQ connection or channel were closed. DataONE-indexer has a mechanism to restore them. However, if this error message shows
 8653 dataone-indexer 20250703-19:45:39: [ERROR]: The RabbitMQ connection or channel were closed. DataONE-indexer has a mechanism to restore them. However, if this error message shows
 8654 dataone-indexer 20250703-20:40:29: [DEBUG]: The RabbitMQ connection and channel are healthy. [org.dataone.cn.indexer.IndexWorker:lambda$startReadinessProbe$1:526]
 8655 dataone-indexer 20250703-20:40:39: [DEBUG]: The RabbitMQ connection and channel are healthy. [org.dataone.cn.indexer.IndexWorker:lambda$startReadinessProbe$1:526]

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions