Hey,
unlike in the paper, implementation has this part with subtracting the action probabilities from Q:
|
if self.min_q_version == 3: |
My guess that the effect would be to have less focus of the loss on a single high-Q action, should policy focus on such. But then we already have temperature parameter. Not sure author will answer, so anybody who knows, I'd appreciate your insights :)