Masked DQN randomly stuck with no error

I am quite new to RL and have been following the DQN tutorial
trying to adapt it to the Bao board game:

  • the environment accepts any opponent but for now I am using a random opponent
  • board is 4x12, with the bottom two rows being the player’s side

I have been training it but sometimes it gets stuck at a learning loop. It doesn’t get progressively slower, it just randomly stops.

When terminating it, I can see that it terminated at the following function:

    def allowed_actions_mask(state):
        my_status = state[:, 2:, :]  # bottom two lines are my part of the board
        action_mask = torch.zeros(my_status.shape, dtype=bool, device=my_status.device)
        normal_game = (torch.amax(my_status, dim=(1,2)) > 1)
        action_mask[normal_game, :] = (my_status[normal_game, :] > 1)
        action_mask[~normal_game, :] = (my_status[~normal_game, :] > 0)
        return action_mask

This is used by masking the result of the Q network as follows:

    def forward(self, state):
        flattened_state = torch.flatten(state, start_dim=1)
        encoded_state = torch.nn.functional.one_hot(flattened_state, num_classes=self.max_stones).float()
        prediction =, 2, self.width)
        allowed = self.allowed_actions_mask(state)
        prediction[~allowed] = float('-inf')
        del allowed, flattened_state, encoded_state
        return prediction
  • Any idea on why learning would be stuck randomly and with no error at a learning step?
  • How should I approach debugging it?

I checked GPU utilization and ram and it’s well below maximum usage…


Hey @sesquipedale
thanks for your question.
What’s the error message? Can you give more details about your training loop?
What does it mean “it gets stuck at a training loop”?
At which line does it stop?
What version of pytorch are you using? What CUDA? Can you run this

# For security purposes, please check the contents of before running it.

and give us the output?

Hi vmoens, thanks for answering.

This is the training loop I am running (using tqdm for tracking progress).
Sometimes, “the loop is stuck”, meaning the counter i doesn’t increase in several hours, without giving any error message. At this point I usually kill the process and PyCharm tells me that it was killed at the function “allowed_actions_mask” I posted above. I am not even sure that this means anything about the allowed_actions_mask function.
Any kind of indication on how to debug this issue would be greatly appreciated!

    for i in tqdm(range(num_episodes)):
        state = env.reset()
        while True:
            action = agent.act(state)
            next_state, reward, done, info = env.step(action)
            agent.cache(state, next_state, action, reward, done)
            state = copy.deepcopy(next_state)
            if done:
                if info['won']:
                    game_result[i] = 1
                elif info['lost']:
                    game_result[i] = 0

The “learn” function is fairly standard (I mostly took it from the tutorial), as follows:

    def learn(self):
        if self.curr_step % self.sync_every == 0:  # sync target_NN with online_NN
        if self.curr_step < self.burn_in:  # skip if within burn_in period
            return None, None
        if self.curr_step % self.learn_every != 0:  # skip if not learning interval
            return None, None
        # sample batch from memory
        state, next_state, action, reward, done = self.recall()
        # Get TD Estimate
        td_est = self.td_estimate(state, action)
        # Get TD Target
        td_tgt = self.td_target(reward, next_state, done)
        # Backpropagate loss for the online_NN:
        # - compute loss
        huber_loss = torch.nn.SmoothL1Loss()
        loss = huber_loss(td_est, td_tgt)
        # - reset all gradients to zero (since by default pytorch accumulates gradients at every call of backward()
        # - compute all gradients
        # - update parameters
        del loss

Running gives:

Collecting environment information...
PyTorch version: 1.13.0
Is debug build: False
CUDA used to build PyTorch: 11.7
ROCM used to build PyTorch: N/A

OS: Microsoft Windows 10 Home
GCC version: ( GCC-6.3.0-1) 6.3.0
Clang version: Could not collect
CMake version: Could not collect
Libc version: N/A

Python version: 3.9.15 (main, Nov 24 2022, 14:39:17) [MSC v.1916 64 bit (AMD64)] (64-bit runtime)
Python platform: Windows-10-10.0.19045-SP0
Is CUDA available: True
CUDA runtime version: 11.7.99
GPU models and configuration: GPU 0: NVIDIA GeForce GTX 1660 Ti
Nvidia driver version: 522.06
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

Versions of relevant libraries:
[pip3] numpy==1.23.1
[pip3] torch==1.13.0
[pip3] torchaudio==0.13.0
[pip3] torchvision==0.14.0
[conda] blas                      1.0                         mkl  
[conda] mkl                       2021.4.0           haa95532_640  
[conda] mkl-service               2.4.0            py39h2bbff1b_0  
[conda] mkl_fft                   1.3.1            py39h277e83a_0  
[conda] mkl_random                1.2.2            py39hf11a4ad_0  
[conda] numpy                     1.23.1           py39h7a0a035_0  
[conda] numpy-base                1.23.1           py39hca35cd5_0  
[conda] pytorch                   1.13.0          py3.9_cuda11.7_cudnn8_0    pytorch
[conda] pytorch-cuda              11.7                 h67b0de4_0    pytorch
[conda] pytorch-mutex             1.0                        cuda    pytorch
[conda] torchaudio                0.13.0                   pypi_0    pypi
[conda] torchvision               0.14.0                   pypi_0    pypi

Is it possible in this game to get into a situation where neither player can win and a “stalemate” should be declared?

I know in Chess, if both players get down to their kings, you have to typically rule it as a stalemate after 30 turns if no side takes any new pieces.

Perhaps you could make such a rule, if the game can get in such a state.

[quote=“J_Johnson, post:4, topic:167857”]
ou could make such a rule, if the game can get in s

It’s not possible to have a stalemate…

I am wondering if it could be a problem with the (correct) statement

prediction[~allowed] = float('-inf')

since other tutorials I saw multiply with a mask instead of changing the values… Could there be problems with the numerical stability of “float(‘-inf’)”? Any suggestion is welcome…