I am trying to implement a simple auto encoder in PyTorch and (for comparison) in Tensorflow. As I noticed some performance issues in PyTorch, I removed all the training code and still get ~40% more runtime for the PyTorch version. Here’s the basic training setup:
loader = DataLoader(dataset, batch_size=64, shuffle=True, num_workers=8, pin_memory=True) iterator = iter(loader) model = ConvolutionalAE(384).to(device) for i in range(5): x = next(iterator).to(device) model(x) times =  x_prefetched = next(iterator).to(device) for _ in range(100): start = time() x = x_prefetched x_prefetched = next(iterator).to(device, non_blocking=True) model(x) times.append(time() - start) print(np.mean(times), '+-', np.std(times))
Some things I’ve tried:
- The “prefetching” shown above didn’t really make a difference
- loading the data into CPU memory doesn’t make a difference, so I am sure it’s GPU bound (seeing 95% GPU load).
- When I increase the input size to 128 I get a CUDA out of memory error, which doesn’t happen with the Tensorflow version
Edit: When I only run the convolutional part (without the transpose convolutions) both versions perform very similar. So is there anything specific with the transpose convolutions in pytorch?