-
Notifications
You must be signed in to change notification settings - Fork 2
Open
Labels
bugSomething isn't workingSomething isn't working
Description
Summary:
After a long (~1hr) disconnect, index workers appear to successfully reconnect to RMQ, but no longer consume enqueued tasks
Not sure how to reproduce this behavior, but documenting it for posterity. It was seen after we had an ADC outage on 7/2/25 (PDC) for about an hour, which involved pods being moved to different nodes, and some network disruptions. Only 2 of the 20 ADC indexer pods were restarted, though.
Here's roughly what happened:
- There was a network and/or rabbitmq disruption, so indexer pods couldn't connect to rmq for about an hour (see example log below). This ended around 11pm PDT.
- The indexer pods seemed to reconnect to rmq after normal service had been resumed
- Just after 9am PDT on 7/3, metacat started submitting tasks to the queue, and the queue size grew steadily to about 40 around 1pm, and stayed there without being consumed. RMQ dashboard showed all 20 connections as
running, and all 20 channels asidle. - I then started a reindexing script, that submitted 1 task per second to the queue around 1:43pm, and then all 4 of the "stuck" tasks were immediately consumed, along with these new tasks.
See RMQ dashboard screenshot below:
Here is a sample log extract from one of the indexers:
8595 dataone-indexer 20250703-04:19:19: [ERROR]: The RabbitMQ connection or channel were closed. DataONE-indexer has a mechanism to restore them. However, if this error message shows
8596 dataone-indexer 20250703-04:19:19: [ERROR]: DataONE-indexer cannot recreate the RabbitMQ connections/channels since Exception trying to re-initialize connection and channel: met
8597 dataone-indexer 20250703-04:19:23: [ERROR]: Caught an exception during connection recovery! [com.rabbitmq.client.impl.ForgivingExceptionHandler:log:119]
8598 java.net.UnknownHostException: metacatarctic-rabbitmq-3-13-7-headless: No address associated with hostname
8599 at java.net.Inet6AddressImpl.lookupAllHostAddr(Native Method) ~[?:?]
8600 at java.net.InetAddress$PlatformNameService.lookupAllHostAddr(Unknown Source) ~[?:?]
8601 at java.net.InetAddress.getAddressesFromNameService(Unknown Source) ~[?:?]
8602 at java.net.InetAddress$NameServiceAddresses.get(Unknown Source) ~[?:?]
8603 at java.net.InetAddress.getAllByName0(Unknown Source) ~[?:?]
8604 at java.net.InetAddress.getAllByName(Unknown Source) ~[?:?]
8605 at java.net.InetAddress.getAllByName(Unknown Source) ~[?:?]
8606 at com.rabbitmq.client.DnsRecordIpAddressResolver.resolveIpAddresses(DnsRecordIpAddressResolver.java:83) ~[dataone-index-worker-3.1.4-shaded.jar:?]
8607 at com.rabbitmq.client.DnsRecordIpAddressResolver.getAddresses(DnsRecordIpAddressResolver.java:73) ~[dataone-index-worker-3.1.4-shaded.jar:?]
8608 at com.rabbitmq.client.impl.recovery.RecoveryAwareAMQConnectionFactory.newConnection(RecoveryAwareAMQConnectionFactory.java:64) ~[dataone-index-worker-3.1.4-shaded.jar:?]
8609 at com.rabbitmq.client.impl.recovery.AutorecoveringConnection.recoverConnection(AutorecoveringConnection.java:628) ~[dataone-index-worker-3.1.4-shaded.jar:?]
8610 at com.rabbitmq.client.impl.recovery.AutorecoveringConnection.beginAutomaticRecovery(AutorecoveringConnection.java:589) ~[dataone-index-worker-3.1.4-shaded.jar:?]
8611 at com.rabbitmq.client.impl.recovery.AutorecoveringConnection.lambda$addAutomaticRecoveryListener$3(AutorecoveringConnection.java:524) ~[dataone-index-worker-3.1.4-shaded.jar:
8612 at com.rabbitmq.client.impl.AMQConnection.notifyRecoveryCanBeginListeners(AMQConnection.java:839) [dataone-index-worker-3.1.4-shaded.jar:?]
8613 at com.rabbitmq.client.impl.AMQConnection.doFinalShutdown(AMQConnection.java:816) [dataone-index-worker-3.1.4-shaded.jar:?]
8614 at com.rabbitmq.client.impl.AMQConnection$MainLoop.run(AMQConnection.java:700) [dataone-index-worker-3.1.4-shaded.jar:?]
8615 at java.lang.Thread.run(Unknown Source) [?:?]
8616 dataone-indexer 20250703-04:19:23: [ERROR]: Caught an exception during connection recovery! [com.rabbitmq.client.impl.ForgivingExceptionHandler:log:119]
8617 java.net.UnknownHostException: metacatarctic-rabbitmq-3-13-7-headless
8618 at java.net.InetAddress$CachedAddresses.get(Unknown Source) ~[?:?]
8619 at java.net.InetAddress.getAllByName0(Unknown Source) ~[?:?]
8620 at java.net.InetAddress.getAllByName(Unknown Source) ~[?:?]
8621 at java.net.InetAddress.getAllByName(Unknown Source) ~[?:?]
8622 at com.rabbitmq.client.DnsRecordIpAddressResolver.resolveIpAddresses(DnsRecordIpAddressResolver.java:83) ~[dataone-index-worker-3.1.4-shaded.jar:?]
8623 at com.rabbitmq.client.DnsRecordIpAddressResolver.getAddresses(DnsRecordIpAddressResolver.java:73) ~[dataone-index-worker-3.1.4-shaded.jar:?]
8624 at com.rabbitmq.client.impl.recovery.RecoveryAwareAMQConnectionFactory.newConnection(RecoveryAwareAMQConnectionFactory.java:64) ~[dataone-index-worker-3.1.4-shaded.jar:?]
8625 at com.rabbitmq.client.impl.recovery.AutorecoveringConnection.recoverConnection(AutorecoveringConnection.java:628) ~[dataone-index-worker-3.1.4-shaded.jar:?]
8626 at com.rabbitmq.client.impl.recovery.AutorecoveringConnection.beginAutomaticRecovery(AutorecoveringConnection.java:589) ~[dataone-index-worker-3.1.4-shaded.jar:?]
8627 at com.rabbitmq.client.impl.recovery.AutorecoveringConnection.lambda$addAutomaticRecoveryListener$3(AutorecoveringConnection.java:524) ~[dataone-index-worker-3.1.4-shaded.jar:
8628 at com.rabbitmq.client.impl.AMQConnection.notifyRecoveryCanBeginListeners(AMQConnection.java:839) [dataone-index-worker-3.1.4-shaded.jar:?]
8629 at com.rabbitmq.client.impl.AMQConnection.doFinalShutdown(AMQConnection.java:816) [dataone-index-worker-3.1.4-shaded.jar:?]
8630 at com.rabbitmq.client.impl.AMQConnection$MainLoop.run(AMQConnection.java:700) [dataone-index-worker-3.1.4-shaded.jar:?]
8631 at java.lang.Thread.run(Unknown Source) [?:?]
8632 dataone-indexer 20250703-04:19:25: [ERROR]: Caught an exception during connection recovery! [com.rabbitmq.client.impl.ForgivingExceptionHandler:log:119]
8633 java.net.UnknownHostException: metacatarctic-rabbitmq-3-13-7-headless
8634 at java.net.InetAddress$CachedAddresses.get(Unknown Source) ~[?:?]
8635 at java.net.InetAddress.getAllByName0(Unknown Source) ~[?:?]
8636 at java.net.InetAddress.getAllByName(Unknown Source) ~[?:?]
8637 at java.net.InetAddress.getAllByName(Unknown Source) ~[?:?]
8638 at com.rabbitmq.client.DnsRecordIpAddressResolver.resolveIpAddresses(DnsRecordIpAddressResolver.java:83) ~[dataone-index-worker-3.1.4-shaded.jar:?]
8639 at com.rabbitmq.client.DnsRecordIpAddressResolver.getAddresses(DnsRecordIpAddressResolver.java:73) ~[dataone-index-worker-3.1.4-shaded.jar:?]
8640 at com.rabbitmq.client.impl.recovery.RecoveryAwareAMQConnectionFactory.newConnection(RecoveryAwareAMQConnectionFactory.java:64) ~[dataone-index-worker-3.1.4-shaded.jar:?]
8641 at com.rabbitmq.client.impl.recovery.AutorecoveringConnection.recoverConnection(AutorecoveringConnection.java:628) ~[dataone-index-worker-3.1.4-shaded.jar:?]
8642 at com.rabbitmq.client.impl.recovery.AutorecoveringConnection.beginAutomaticRecovery(AutorecoveringConnection.java:589) ~[dataone-index-worker-3.1.4-shaded.jar:?]
8643 at com.rabbitmq.client.impl.recovery.AutorecoveringConnection.lambda$addAutomaticRecoveryListener$3(AutorecoveringConnection.java:524) ~[dataone-index-worker-3.1.4-shaded.jar:
8644 at com.rabbitmq.client.impl.AMQConnection.notifyRecoveryCanBeginListeners(AMQConnection.java:839) [dataone-index-worker-3.1.4-shaded.jar:?]
8645 at com.rabbitmq.client.impl.AMQConnection.doFinalShutdown(AMQConnection.java:816) [dataone-index-worker-3.1.4-shaded.jar:?]
8646 at com.rabbitmq.client.impl.AMQConnection$MainLoop.run(AMQConnection.java:700) [dataone-index-worker-3.1.4-shaded.jar:?]
8647 at java.lang.Thread.run(Unknown Source) [?:?]
8648 dataone-indexer 20250703-04:19:29: [ERROR]: The RabbitMQ connection or channel were closed. DataONE-indexer has a mechanism to restore them. However, if this error message shows
8649 dataone-indexer 20250703-04:19:29: [ERROR]: DataONE-indexer cannot recreate the RabbitMQ connections/channels since Exception trying to re-initialize connection and channel: Con
8650 dataone-indexer 20250703-04:19:39: [ERROR]: The RabbitMQ connection or channel were closed. DataONE-indexer has a mechanism to restore them. However, if this error message shows
8651 dataone-indexer 20250703-17:08:19: [ERROR]: The RabbitMQ connection or channel were closed. DataONE-indexer has a mechanism to restore them. However, if this error message shows
8652 dataone-indexer 20250703-19:10:19: [ERROR]: The RabbitMQ connection or channel were closed. DataONE-indexer has a mechanism to restore them. However, if this error message shows
8653 dataone-indexer 20250703-19:45:39: [ERROR]: The RabbitMQ connection or channel were closed. DataONE-indexer has a mechanism to restore them. However, if this error message shows
8654 dataone-indexer 20250703-20:40:29: [DEBUG]: The RabbitMQ connection and channel are healthy. [org.dataone.cn.indexer.IndexWorker:lambda$startReadinessProbe$1:526]
8655 dataone-indexer 20250703-20:40:39: [DEBUG]: The RabbitMQ connection and channel are healthy. [org.dataone.cn.indexer.IndexWorker:lambda$startReadinessProbe$1:526]Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working
