resnet50 fp32 pytorch inference speed : 6.5ms in a 3060 rtx with 64x64x3
resnet50 fp32 pytorch inference speed : 6.5ms in a 3060 rtx with 224x224x3
also when benchmark vgg16 224x224x3, got same speed, where vgg16 is almost 4x - 5x bigger model
why both are the same?
Your workload might be CPU-limited, e.g. caused by a generally slow CPU, or heavy CPU processing, data loading etc., which would starve your GPU. Without any details it’s pure speculation.
def benchmark(model, input_shape=(1024, 1, 32, 32), dtype='fp32', nwarmup=50, nruns=1000):
input_data = torch.randn(input_shape)
input_data = input_data.to("cuda")
if dtype=='fp16':
input_data = input_data.half()
print("Warm up ...")
with torch.no_grad():
for _ in range(nwarmup):
features = model(input_data)
torch.cuda.synchronize()
print("Start timing ...")
timings = []
with torch.no_grad():
for i in range(1, nruns+1):
start_time = time.time()
output = model(input_data)
torch.cuda.synchronize()
end_time = time.time()
timings.append(end_time - start_time)
if i%100==0:
print('Iteration %d/%d, avg batch time %.2f ms'%(i, nruns, np.mean(timings)*1000))
print("Input shape:", input_data.size())
print("Output shape:", output.shape)
print('Average batch time: %.2f ms'%(np.mean(timings)*1000))
code I am using for benchmarking
Use a visual profiler to compare the timelines of all runs to check the kernel execution times, their launches, as well as CPU bottlenecks.