Calculating various outputs at the same time in Reinforcement Learning tutorial

I am currently implementing a DQN-learning architecture for a Ludo game, and I’ve been following the tutorial that can be found here as I am a newbie to PyTorch.

There is one particular part of the tutorial that I am not able to understand completely. This is the part of the following section of code:

    if len(memory) < BATCH_SIZE:
    transitions = memory.sample(BATCH_SIZE)
    # Transpose the batch (see for
    # detailed explanation). This converts batch-array of Transitions
    # to Transition of batch-arrays.
    batch = Transition(*zip(*transitions))

    # Compute a mask of non-final states and concatenate the batch elements
    # (a final state would've been the one after which simulation ended)
    non_final_mask = torch.tensor(tuple(map(lambda s: s is not None,
                                          batch.next_state)), device=device, dtype=torch.bool)
    non_final_next_states =[s for s in batch.next_state
                                                if s is not None])
    state_batch =
    action_batch =
    reward_batch =

    # Compute Q(s_t, a) - the model computes Q(s_t), then we select the
    # columns of actions taken. These are the actions which would've been taken
    # for each batch state according to policy_net
    state_action_values = policy_net(state_batch).gather(1, action_batch)

    # Compute V(s_{t+1}) for all next states.
    # Expected values of actions for non_final_next_states are computed based
    # on the "older" target_net; selecting their best reward with max(1)[0].
    # This is merged based on the mask, such that we'll have either the expected
    # state value or 0 in case the state was final.
    next_state_values = torch.zeros(BATCH_SIZE, device=device)
    next_state_values[non_final_mask] = target_net(non_final_next_states).max(1)[0].detach()
    # Compute the expected Q values
    expected_state_action_values = (next_state_values * GAMMA) + reward_batch

    # Compute Huber loss
    loss = F.smooth_l1_loss(state_action_values, expected_state_action_values.unsqueeze(1))

    # Optimize the model
    for param in policy_net.parameters():, 1)

Is it calculating at the same time the output for all the state samples? If I’m not wrong, where it says state_batch = it is concatenating all the states from the samples which are part of the batch, and then it introduces the whole tensor to the neural network. Applying the gather(1, action_barch) I guess it is making it take the value for each one of the output tensors the value of the action which was actually executed or not (although I don’t understand exactly how it does it, because in DQN, shouldn’t we take only into consideration the gradient for the action that was actually taken, and make the gradient for the values that represent other actions equal to zero?).

After that, it seems like it does the same but for the next_states, and multiplies the whole vector by gamma and adds the reward.

Another thing that I don’t totally get is how it calculates the loss. Does the loss function, at the moment an input of various tensors and not only one calculates an independent loss, calculates an average loss, …?