Confusion about using RNN in such MASAC

It's a great work to develop a multi-agent version of SAC. But I'm confused about using RNN in such MASAC. More specifically, if we employ a GRUCell in the Actor, how can we sample a new action during training? The hidden state used in the execution may not match the policy in the training, especially when using the off-policy paradigm.