GPU forward pass is ~40 times slower than CPU

I have a ModuleDict model like the one shown below:

NeuralNetwork(
  (linears): ModuleDict(
    (C): Sequential(
      (0): Linear(in_features=384, out_features=192, bias=True)
      (1): ReLU()
      (2): Linear(in_features=192, out_features=192, bias=True)
      (3): ReLU()
      (4): Linear(in_features=192, out_features=96, bias=True)
      (5): ReLU()
      (6): Linear(in_features=96, out_features=48, bias=True)
      (7): ReLU()
      (8): Linear(in_features=48, out_features=1, bias=True)
    )
    (H): Sequential(
      (0): Linear(in_features=384, out_features=192, bias=True)
      (1): ReLU()
      (2): Linear(in_features=192, out_features=192, bias=True)
      (3): ReLU()
      (4): Linear(in_features=192, out_features=96, bias=True)
      (5): ReLU()
      (6): Linear(in_features=96, out_features=48, bias=True)
      (7): ReLU()
      (8): Linear(in_features=48, out_features=1, bias=True)
    )
    (N): Sequential(
      (0): Linear(in_features=384, out_features=192, bias=True)
      (1): ReLU()
      (2): Linear(in_features=192, out_features=192, bias=True)
      (3): ReLU()
      (4): Linear(in_features=192, out_features=96, bias=True)
      (5): ReLU()
      (6): Linear(in_features=96, out_features=48, bias=True)
      (7): ReLU()
      (8): Linear(in_features=48, out_features=1, bias=True)
    )
    (O): Sequential(
      (0): Linear(in_features=384, out_features=192, bias=True)
      (1): ReLU()
      (2): Linear(in_features=192, out_features=192, bias=True)
      (3): ReLU()
      (4): Linear(in_features=192, out_features=96, bias=True)
      (5): ReLU()
      (6): Linear(in_features=96, out_features=48, bias=True)
      (7): ReLU()
      (8): Linear(in_features=48, out_features=1, bias=True)
    )
    (P): Sequential(
      (0): Linear(in_features=384, out_features=192, bias=True)
      (1): ReLU()
      (2): Linear(in_features=192, out_features=192, bias=True)
      (3): ReLU()
      (4): Linear(in_features=192, out_features=96, bias=True)
      (5): ReLU()
      (6): Linear(in_features=96, out_features=48, bias=True)
      (7): ReLU()
      (8): Linear(in_features=48, out_features=1, bias=True)
    )
    (S): Sequential(
      (0): Linear(in_features=384, out_features=192, bias=True)
      (1): ReLU()
      (2): Linear(in_features=192, out_features=192, bias=True)
      (3): ReLU()
      (4): Linear(in_features=192, out_features=96, bias=True)
      (5): ReLU()
      (6): Linear(in_features=96, out_features=48, bias=True)
      (7): ReLU()
      (8): Linear(in_features=48, out_features=1, bias=True)
    )
  )
)

The forward that use to train it looks like this:

    def forward(self, X, device=None):
        """Forward propagation

        This is forward propagation and it returns the atomic energy.

        Parameters
        ----------
        X : dict
            Dictionary of inputs in the feature space.

        Returns
        -------
        outputs : tensor
            A list of tensors with energies per image.
        """

        outputs = []

        for hash in X:
            image = X[hash]
            atomic_energies = []

            for symbol, x in image:
                if isinstance(symbol, bytes):
                    symbol = symbol.decode("utf-8")
                try:
                    x = self.linears[symbol](x)
                except RuntimeError:
                    x = self.linears[symbol](x.to(device))

                intercept_name = "intercept_" + symbol
                slope_name = "slope_" + symbol
                slope = getattr(self, slope_name)
                intercept = getattr(self, intercept_name)

                x = (slope * x) + intercept
                atomic_energies.append(x)

            atomic_energies = torch.cat(atomic_energies)
            image_energy = torch.sum(atomic_energies)
            outputs.append(image_energy)
        outputs = torch.stack(outputs)
        return outputs

Running that forward() in a single CPU takes 1.74 seconds and in a single GPU 38.33 seconds. Why is this difference? I have some hypothesis:

  1. The structure of X should be changed to improve efficiency and avoid the for loop.
  2. Movement of tensors to CUDA device should be avoided inside the forward() function and instead done only once before calling forward.

I would appreciate any advice. Thanks.

You two hypothesis are very good candidates. Keep in mind that GPU are really good at doing a single large op but very bad at doing many small ones. So removing the for-loop will definitely help.
Moving Tensors between cpu and gpu is quite expensive as well and should be avoided yes.

1 Like

Thanks for your reply. This will represent my first GPU experiment :slight_smile: I will try to keep this thread updated with my findings so that people also getting started could find something helpful.

1 Like

First step forward. It turns out that doing .to(device) to all data first to GPU instead of inside the forward pass reduces GPU time to 2.74 from 38.33. Still, the CPU is slightly faster. Next try is vectorization to avoid the for loop and I will report back.