I started learning about Q table from this blog post Introduction to reinforcement learning and OpenAI Gym, by Justin Francis, which has a line as below –
After so many episodes, the algorithm will converge and determine the optimal action for every state using the Q table, ensuring the highest possible reward. We now consider the environment problem solved.
The Q table was updated by Q-learning formula
Q[state,action] += alpha * (reward + np.max(Q[state2]) - Q[state,action])
I ran 100000 episodes of which I got the following –
Episode 99250 Total Reward: 9 Episode 99300 Total Reward: 7 Episode 99350 Total Reward: 6 Episode 99400 Total Reward: 14 Episode 99450 Total Reward: 10 Episode 99500 Total Reward: 10 Episode 99550 Total Reward: 9 Episode 99600 Total Reward: 14 Episode 99650 Total Reward: 5 Episode 99700 Total Reward: 7 Episode 99750 Total Reward: 3 Episode 99800 Total Reward: 5
I don’t know what the highest reward is. It does not look like it has converged. Yet, the graph shows a trend in convergence but it was plotted for a larger scale.
What should be the sequence of actions to be taken when the game is reset() but the “learned” Q table is available? How do we know that and the reward in that case?