I am trying to implement a simple auto encoder in PyTorch and (for comparison) in Tensorflow. As I noticed some performance issues in PyTorch, I removed all the training code and still get ~40% more runtime for the PyTorch version. Here’s the basic training setup:
loader = DataLoader(dataset, batch_size=64, shuffle=True, num_workers=8,
pin_memory=True)
iterator = iter(loader)
model = ConvolutionalAE(384).to(device)
for i in range(5):
x = next(iterator).to(device)
model(x)
times = []
x_prefetched = next(iterator).to(device)
for _ in range(100):
start = time()
x = x_prefetched
x_prefetched = next(iterator).to(device, non_blocking=True)
model(x)
times.append(time() - start)
print(np.mean(times), '+-', np.std(times))
Some things I’ve tried:
- The “prefetching” shown above didn’t really make a difference
- loading the data into CPU memory doesn’t make a difference, so I am sure it’s GPU bound (seeing 95% GPU load).
- When I increase the input size to 128 I get a CUDA out of memory error, which doesn’t happen with the Tensorflow version
The full code can be found here for the PyTorch version and here for the TF version.
Edit: When I only run the convolutional part (without the transpose convolutions) both versions perform very similar. So is there anything specific with the transpose convolutions in pytorch?