I am implementing OpenAI gym’s cartpole problem using Deep Q-Learning (DQN). I followed tutorials (video and otherwise) and learned all about it. I implemented a code for myself and I thought it should work, but the agent is not learning. I will really really really appreciate if someone can pinpoint where I am doing wrong. […]

- Tags _ = env.step(action) agent.update_replay_memory(current_state, 'Action', "steps": 11 }, *state.shape))[0] ''' We start here''' agent = DQNAgents(STATE_SIZE, ACTION_SIZE) for e in range(EPISODES): done = False current_state = env.reset() time = 0 total_reward = 0 while no, action_size): self.state_size = state_size self.action_size = action_size self.replay_memory = deque(maxlen = 2000), activation = 'linear')) model.compile(loss = 'mse', activation = 'relu')) model.add(Dense(10, activation = 'relu')) model.add(Dense(self.action_size, batch_size = batch_size, BATCH_SIZE) #Picks the current states from the randomly selected minibatch current_states = np.array([t[0] for t in, but the agent is not learning. I will really really really appreciate if someone can pinpoint where I am doing wrong. Note that I have a tar, current_state, done., done) if len(agent.replay_memory) < BATCH_SIZE: pass else: agent.train(done), done) in enumerate(minibatch): if not done: new_q = reward + DISCOUNT * np.max(future_qs_list), done): self.replay_memory.append((current_state, done)) def train(self, epsilon : {agent.epsilon}') if agent.epsilon > agent.epsilon_min: agent.epsilon *= agent.epsilon_decay Results for first 40, epsilon : 0.778312557068642, epsilon : 0.7822236754458713 episode : 50, epsilon : 0.7861544476842928 episode : 49, epsilon : 0.7901049725470279 episode : 48, epsilon : 0.7940753492934954 episode : 47, epsilon : 0.798065677681905 episode : 46, epsilon : 0.8020760579717637 episode : 45, epsilon : 0.8061065909263957 episode : 44, epsilon : 0.810157377815473 episode : 43, epsilon : 0.8142285204175609 episode : 42, epsilon : 0.8183201210226743 episode : 41, epsilon : 0.8224322824348486 episode : 40, epsilon : 0.8265651079747222 episode : 39, epsilon : 0.8307187014821328 episode : 38, epsilon : 0.8348931673187264 episode : 37, epsilon : 0.8390886103705794 episode : 36, epsilon : 0.8433051360508336 episode : 35, epsilon : 0.8475428503023453 episode : 34, epsilon : 0.851801859600347 episode : 33, epsilon : 0.8560822709551227 episode : 32, epsilon : 0.8603841919146962 episode : 31, epsilon : 0.8647077305675338 episode : 30, epsilon : 0.8690529955452602 episode : 29, epsilon : 0.8734200960253871 episode : 28, epsilon : 0.8778091417340573 episode : 27, epsilon : 0.8822202429488013 episode : 26, epsilon : 0.8866535105013078 episode : 25, epsilon : 0.8911090557802088 episode : 24, epsilon : 0.8955869907338783 episode : 23, epsilon : 0.9000874278732445 episode : 22, epsilon : 0.9046104802746175 episode : 21, epsilon : 0.9091562615825302 episode : 20, epsilon : 0.9137248860125932 episode : 19, epsilon : 0.918316468354365 episode : 18, epsilon : 0.9229311239742362 episode : 17, epsilon : 0.9275689688183278 episode : 16, epsilon : 0.9322301194154049 episode : 15, epsilon : 0.9369146928798039 episode : 14, epsilon : 0.9416228069143757 episode : 13, epsilon : 0.946354579813443 episode : 12, epsilon : 0.9511101304657719 episode : 11, epsilon : 0.9558895783575597 episode : 10, epsilon : 0.960693043575437 episode : 9, epsilon : 0.9655206468094844 episode : 8, epsilon : 0.9703725093562657 episode : 7, epsilon : 0.9752487531218751 episode : 6, epsilon : 0.9801495006250001 episode : 5, epsilon : 0.985074875 episode : 4, epsilon : 0.990025 episode : 3, epsilon : 0.995 episode : 2, epsilon : 1 episode : 1, I am implementing OpenAI gym's cartpole problem using Deep Q-Learning (DQN). I followed tutorials (video and otherwise) and learned all about, i.e. reducing the loss using gradient descent self.model.fit(np.array(X), input_dim = self.state_size, next_state, np.array(y)], optimizer = Adam(lr = 0.001)) return model def update_replay_memory(self, reward, shuffle = False) # Update target network counter every episode if terminal_state: self.target_update_coun, state_size, state): return self.model.predict(np.array(state).reshape(-1, steps {time}, steps 10, steps 12, steps 13, steps 14, steps 15, steps 16, steps 17, steps 18, steps 19, steps 20, steps 21, steps 24, steps 26, steps 29, steps 30, steps 31, steps 32, steps 33, steps 34, steps 37, steps 39, steps 40, steps 42, steps 53, steps 60, steps 9, terminal_state): # Sample from replay memory minibatch = random.sample(self.replay_memory, they should be increasing and should reach a maximum of 199) episode : 0, update target network with weights of main network if self.target_update_counter > UPDATE_TARGET_EVERY: self.target_model, verbose = 0