PyTorch vs Tensorflow Speed Difference (Transpose Conv)

I am trying to implement a simple auto encoder in PyTorch and (for comparison) in Tensorflow. As I noticed some performance issues in PyTorch, I removed all the training code and still get ~40% more runtime for the PyTorch version. Here’s the basic training setup:

    loader = DataLoader(dataset, batch_size=64, shuffle=True, num_workers=8,
                        pin_memory=True)
    iterator = iter(loader)
    model = ConvolutionalAE(384).to(device)

    for i in range(5):
        x = next(iterator).to(device)
        model(x)

    times = []
    x_prefetched = next(iterator).to(device)
    for _ in range(100):
        start = time()
        x = x_prefetched
        x_prefetched = next(iterator).to(device, non_blocking=True)
        model(x)
        times.append(time() - start)

    print(np.mean(times), '+-', np.std(times))

Some things I’ve tried:

  • The “prefetching” shown above didn’t really make a difference
  • loading the data into CPU memory doesn’t make a difference, so I am sure it’s GPU bound (seeing 95% GPU load).
  • When I increase the input size to 128 I get a CUDA out of memory error, which doesn’t happen with the Tensorflow version

The full code can be found here for the PyTorch version and here for the TF version.

Edit: When I only run the convolutional part (without the transpose convolutions) both versions perform very similar. So is there anything specific with the transpose convolutions in pytorch?

Could you change the for loop to:

for data in loader:
    ...
    model(data)
    ...

and time it again?

I’m not sure about the OOM issue.

1 Like

Thanks for the reply. Tried that before. I included the prefixing as that was one of the things I suspected made the pytorch version slow. Adding the prefetching made it slightly faster. As noted above, I now suspect the transpose convolution to be the culprit.

Maybe you can try torch.utils.bottleneck.

1 Like