Pytorch performance

wjaskowski · May 16, 2017, 7:41am

I’ve been recently doing some benchmarking comparing the performance of pytorch, theano and tensorflow. Here is what I have found:

for small conv nets (e.g., 96x96, f=64;k=3;s=1 f=128;k=3;s=2 f=256;k=3;s=2 512 16, bs=128) all frameworks have roughly the same performance (±20%). Pytorch has usually the quickest forward pass and the roughly equal backprop.
for larger conv nets (e.g., 96x96, f=64;k=3;s=1 f=128;k=3;s=2 f=256;k=3;s=1 f=256;k=3;s=1 f=256;k=3;s=1 f=256;k=3;s=1 512 512 16 bs=128) Tensorflow is quicker of forward pass (ca. 10-30%) and much quicker (even 80%) on backprop.

I checked that on Python 3.6, Cuda 8.0, Cudnn 5.1, Ubuntu 16.04 with both Titan X and 1080 Ti.

Has anybody a similar experience?

smth · May 20, 2017, 6:53pm

for larger convnets, use the flag: torch.backends.cudnn.benchmark=True, which helps. For example:

github.com

pytorch/examples/blob/master/imagenet/main.py#L95


        model = torch.nn.DataParallel(model).cuda()
else:
    model.cuda()
    model = torch.nn.parallel.DistributedDataParallel(model)


# define loss function (criterion) and optimizer
criterion = nn.CrossEntropyLoss().cuda()


optimizer = torch.optim.SGD(model.parameters(), args.lr,
                            momentum=args.momentum,
                            weight_decay=args.weight_decay)


# optionally resume from a checkpoint
if args.resume:
    if os.path.isfile(args.resume):
        print("=> loading checkpoint '{}'".format(args.resume))
        checkpoint = torch.load(args.resume)
        args.start_epoch = checkpoint['epoch']
        best_prec1 = checkpoint['best_prec1']
        model.load_state_dict(checkpoint['state_dict'])
        optimizer.load_state_dict(checkpoint['optimizer'])

wjaskowski · May 21, 2017, 12:21pm

Thank you! This improved the performance significantly. Now pytorch is in par with tensorflow (max 15% slower for some models).

heilaw · May 25, 2017, 7:34pm

Is there any reason that the default value of torch.backends.cudnn.benchmark is False instead of True?

wjaskowski · May 25, 2017, 8:01pm

It takes more memory and requires a benchmark phase which can be costly if
you change the computation graph often.

heilaw · May 25, 2017, 8:11pm

However, the computation graph is built dynamically anyway in PyTorch. Why changing the computation graph often would cause a problem?

fmassa · May 25, 2017, 8:16pm

In benchmark mode, for each input size, cudnn will perform a bunch of computations to infer the fastest algorithm for that specific case, and caches the result. This brings some overhead, and if your input dimensions change all the time, using benchmark will actually slow down things because of this overhead.

wjaskowski · May 26, 2017, 9:33am

Would not it be better to set benchmark=True by default and heuristically turn it off in case too many cache misses?

fmassa · May 26, 2017, 11:19am

Not sure it would be better to come up with some heuristics. Maybe just better document the benchmark option?
But this is something that might change in the future, as for the moment pytorch doesn’t give a way to choose which algorithms to use with cudnn.

Nick_Brandaleone · June 16, 2017, 5:04pm

Are there any benchmarks between Torch and PyTorch? I am curious if the performance is the same, or which sort of differences are inherent in the two platforms.

Thanks,

Nick (new Torch/PyTorch fan!)

smth · June 22, 2017, 4:51am

speed is the same, memory usage is much lower.

suke · January 11, 2018, 10:20am

when I do cudnn.benchmark = True in my test program.
it warns me "RuntimeError: CUDNN_STATUS_INTERNAL_ERROR"
how can I resolve this problem.