PyTorch vs Tensorflow Speed Difference (Transpose Conv)


#1

I am trying to implement a simple auto encoder in PyTorch and (for comparison) in Tensorflow. As I noticed some performance issues in PyTorch, I removed all the training code and still get ~40% more runtime for the PyTorch version. Here’s the basic training setup:

    loader = DataLoader(dataset, batch_size=64, shuffle=True, num_workers=8,
                        pin_memory=True)
    iterator = iter(loader)
    model = ConvolutionalAE(384).to(device)

    for i in range(5):
        x = next(iterator).to(device)
        model(x)

    times = []
    x_prefetched = next(iterator).to(device)
    for _ in range(100):
        start = time()
        x = x_prefetched
        x_prefetched = next(iterator).to(device, non_blocking=True)
        model(x)
        times.append(time() - start)

    print(np.mean(times), '+-', np.std(times))

Some things I’ve tried:

  • The “prefetching” shown above didn’t really make a difference
  • loading the data into CPU memory doesn’t make a difference, so I am sure it’s GPU bound (seeing 95% GPU load).
  • When I increase the input size to 128 I get a CUDA out of memory error, which doesn’t happen with the Tensorflow version

The full code can be found here for the PyTorch version and here for the TF version.

Edit: When I only run the convolutional part (without the transpose convolutions) both versions perform very similar. So is there anything specific with the transpose convolutions in pytorch?


#2

Could you change the for loop to:

for data in loader:
    ...
    model(data)
    ...

and time it again?

I’m not sure about the OOM issue.


#3

Thanks for the reply. Tried that before. I included the prefixing as that was one of the things I suspected made the pytorch version slow. Adding the prefetching made it slightly faster. As noted above, I now suspect the transpose convolution to be the culprit.


#4

Maybe you can try torch.utils.bottleneck.