I’ve looked at various articles and I’m still very confused, I understand the normal double Q learning about having two Action value estimates that use two different set of samples
But coming to neural networks I’m confused
The normal DQN algorithm uses our target network for both action selection and evaluation when performing updates
I was told that they initialize them to small weights to prevent overestimation, if that’s the case why exactly are we now using the online network to select actions
I mean our target network can handle that since it’ll never overestimate values
Can someone shed more light on this?
Thank you in advance!