Deep Q Learning in Pytorch

PURAV_SHAH · April 12, 2020, 10:27pm

Hi,
I am trying to implement the Deep Q Learning algorithm using a target network and a policy network in pytorch but it seems that my model is not able to learn anything.
I have looked at the tutorial for this on the pytorch official page and i cant seem to find any mistake in my code.
I am attaching part of my code here and also the results for reference.
If anyone has any experience in deep Q Learning, some pointers would be very helpful.

 def chooseAction(self,observation):
        if np.random.random()>self.epsilon:
            self.Qupdate.eval()
            with torch.no_grad():
                observation=torch.Tensor(observation)
                action=torch.argmax(self.Qupdate(observation)).item()
            self.Qupdate.train()
        else:
            action=env.action_space.sample()
        return action
    def decrementEpsilon(self):
        if self.epsilon>self.epsilonMin:
            self.epsilon-=self.epsilonDec
        else:
            self.epsilon=self.epsilonMin
        return
    def updateQ(self,observation,newObservation,action,reward):
        self.Qupdate.train()
        self.QTarget.eval()
        self.Qupdate.optimizer.zero_grad()
        observation=torch.Tensor(observation)
        newObservation=torch.Tensor(newObservation)
        action=torch.tensor(action)
        reward=torch.tensor(reward)
        qOld=self.Qupdate(observation)[action]
        if self.step%self.replace==0:
            self.QTarget.load_state_dict(self.Qupdate.state_dict())
        #with torch.no_grad():
        qNew=torch.max(self.QTarget(newObservation)).detach()
        #print(self.QTarget(newObservation),self.Qupdate(newObservation),qNew)
        y=reward+self.gamma*qNew
        #print(y,qOld)
        loss=self.Qupdate.criterion(qOld,y)
        #print(loss)
        loss.backward()
        self.Qupdate.optimizer.step()
        self.decrementEpsilon()
        self.step+=1

and my main loop is as follows–

env=gym.make('CartPole-v1')
scores=[]
winPct=[]
nGames=10000
agent=Agent2(0.99,1,1e-5,0.01,env.observation_space.shape,env.action_space.n,env)
for i in range(nGames):
    done=False
    score=0
    observation=env.reset()
    while not done:
        action=agent.chooseAction(observation)
        newObservation,reward,done,info=env.step(action)
        score+=reward
        agent.updateQ(observation,newObservation,action,reward)
        observation=newObservation
    scores.append(score)
    if i%100==0:
        winPct.append(np.mean(scores[-100:]))
        print(f'epsiode {i},winPct {np.mean(scores[-100:]):.2f},score {score},epsilon {agent.epsilon:.2f}')

and the results are as follows–

epsiode 0,winPct 15.00,score 15.0,epsilon 1.00
epsiode 100,winPct 22.72,score 32.0,epsilon 0.98
epsiode 200,winPct 22.57,score 74.0,epsilon 0.95
epsiode 300,winPct 20.67,score 17.0,epsilon 0.93
epsiode 400,winPct 22.96,score 16.0,epsilon 0.91
epsiode 500,winPct 23.18,score 17.0,epsilon 0.89
epsiode 600,winPct 19.84,score 20.0,epsilon 0.87
epsiode 700,winPct 22.15,score 12.0,epsilon 0.85
epsiode 800,winPct 19.47,score 24.0,epsilon 0.83
epsiode 900,winPct 20.63,score 24.0,epsilon 0.81
epsiode 1000,winPct 20.29,score 38.0,epsilon 0.79
epsiode 1100,winPct 21.63,score 19.0,epsilon 0.76
epsiode 1200,winPct 20.20,score 14.0,epsilon 0.74
epsiode 1300,winPct 17.31,score 13.0,epsilon 0.73
epsiode 1400,winPct 18.39,score 16.0,epsilon 0.71
epsiode 1500,winPct 17.79,score 19.0,epsilon 0.69
epsiode 1600,winPct 17.33,score 21.0,epsilon 0.67

When i avoid using the target network defined by ‘QTarget’ and just use the policy network defined as ‘Qupdate’ and do not use the .detach() function in the ‘updateQ’ function i get good results but as soon as i start using the .detach() my performance drops substantially.
The official pytorch documentation on Deep Q Learning has also used .detach(),so done know whats going on with my code.

Thanks.