Input and hidden tensors are not on the same device

Hello.

I’m getting the following error during testing of an InfoGAN network. I know it’s been asked many times but I got stuck.

I checked out my training script. Everything that’s supposed to be working on the GPU seems to be there.

Based on the above stacktrace, I’m thinking the problem is at my encoder structure. Although, I can’t spot the problem.

Since I think the encoder is the problem and the codebase is big, I’m only sharing the encoder part but if you spot additional issues based on the stacktrace, I would gladly share them.

Thanks in advance.

class Encoder(nn.Module):
    """
        Encoder model that is used both in the Generator and
    Discriminator layers.
    """
    def __init__(
        self,
        dimension_embedding = 64,
        dimension_hidden = 64,
        dimension_mlp = 1024,
        count_layer = 1,
        rate_dropout = 0.0
    ):
        super(Encoder, self).__init__()

        self.dimension_mlp = 1024
        # self.dimension_mlp = dimension_mlp
        self.dimension_hidden = dimension_hidden
        self.dimension_embedding = dimension_embedding
        self.count_layer = count_layer

        self.encoder = nn.LSTM(
            dimension_embedding,
            dimension_hidden,
            count_layer,
            dropout = rate_dropout
        )

        self.spatial_embedding = nn.Linear(2, dimension_embedding)

    def _initialise_hidden(self, batch):
        """
        Generates a tuple of tensors with zeros in it.

        Output:
            - a tuple of zero tensors
        """
        return (
            torch.zeros(
                self.count_layer,
                batch,
                self.dimension_hidden,
                device = EnvironmentTrain().device
            ),
            torch.zeros(
                self.count_layer,
                batch,
                self.dimension_hidden,
                device = EnvironmentTrain().device
            )
        )

    def forward(self, trajectory_observation):
        """
        What the encoder does upon each move.

        Inputs:
            - trajectory_observed: Tensor of shape (length_observed, batch, 2)

        Output:
            - final_h: Tensor of shape (self.count_layer, batch, self.dimension_hidden)
        """
        batch = trajectory_observation.size(1)
        trajectory_observation_embedding = self.spatial_embedding(trajectory_observation.reshape(-1, 2))
        trajectory_observation_embedding = trajectory_observation_embedding.view(
            -1,
            batch,
            self.dimension_embedding
        )
        state_tuple = self._initialise_hidden(batch)
        _, state = self.encoder(trajectory_observation_embedding, state_tuple)
        final_hidden = state[0]

        return final_hidden

Based on the code snippet I would guess EnvironmentTrain().device is not returning the correct device and seems to initialize the hidden states on the CPU.
Could you add debug print statements to _initialise_hidden and check which device is used?

I put a breakpoint into _initialise_hidden() and checked the device within pdb. It gets yielded as:

device(type= ‘cuda’, index= 0)

In that case add debug statements to the actual forward method as the error is raised in model(arguments) and then apparently in tragectpry_-predction_fake_relative = generator(...) which isn’t defined here.

Alright, I’ve solved my issue. It was a silly mistake of mine. It was all about a wrong instantiation sequence.

More detail for the curious:

Some background. I use a singleton object to have access to stuff that’s needed throughout the codebase so that I don’t have to keep passing them as arguments and dirty my codebase. One example is the device variable. It’s needed in a lot of places but there are multiple levels of function calls. It not a good programming practice.

My prior requirements had me to code two different singleton objects for training and testing operations. Obviously the testing operation wasn’t/couldn’t making use of the training singleton when it got to the generator as generator was hard-coded to use the training environment.

I unified the environments as they can be done so now and everything works fine. Thank you @ptrblck for your time.

All the best.