# Calculating various outputs at the same time in Reinforcement Learning tutorial

I am currently implementing a DQN-learning architecture for a Ludo game, and I’ve been following the tutorial that can be found here as I am a newbie to PyTorch.

There is one particular part of the tutorial that I am not able to understand completely. This is the part of the following section of code:

``````    if len(memory) < BATCH_SIZE:
return
transitions = memory.sample(BATCH_SIZE)
# Transpose the batch (see https://stackoverflow.com/a/19343/3343043 for
# detailed explanation). This converts batch-array of Transitions
# to Transition of batch-arrays.
batch = Transition(*zip(*transitions))

# Compute a mask of non-final states and concatenate the batch elements
# (a final state would've been the one after which simulation ended)
non_final_mask = torch.tensor(tuple(map(lambda s: s is not None,
batch.next_state)), device=device, dtype=torch.bool)
non_final_next_states = torch.cat([s for s in batch.next_state
if s is not None])
state_batch = torch.cat(batch.state)
action_batch = torch.cat(batch.action)
reward_batch = torch.cat(batch.reward)

# Compute Q(s_t, a) - the model computes Q(s_t), then we select the
# columns of actions taken. These are the actions which would've been taken
# for each batch state according to policy_net
state_action_values = policy_net(state_batch).gather(1, action_batch)

# Compute V(s_{t+1}) for all next states.
# Expected values of actions for non_final_next_states are computed based
# on the "older" target_net; selecting their best reward with max(1)[0].
# This is merged based on the mask, such that we'll have either the expected
# state value or 0 in case the state was final.
next_state_values = torch.zeros(BATCH_SIZE, device=device)
# Compute the expected Q values
expected_state_action_values = (next_state_values * GAMMA) + reward_batch

# Compute Huber loss
loss = F.smooth_l1_loss(state_action_values, expected_state_action_values.unsqueeze(1))

# Optimize the model
Is it calculating at the same time the output for all the state samples? If I’m not wrong, where it says `state_batch = torch.cat(batch.state)` it is concatenating all the states from the samples which are part of the batch, and then it introduces the whole tensor to the neural network. Applying the `gather(1, action_barch)` I guess it is making it take the value for each one of the output tensors the value of the action which was actually executed or not (although I don’t understand exactly how it does it, because in DQN, shouldn’t we take only into consideration the gradient for the action that was actually taken, and make the gradient for the values that represent other actions equal to zero?).
After that, it seems like it does the same but for the `next_states`, and multiplies the whole vector by gamma and adds the reward.