Low GPU utilization when training an ensemble

zer0ne · February 13, 2019, 10:06am

I would like to train an ensemble of same network architectures. Currently, I’m defining it this way.

def make_net(network_specs):
    # return nn.Module

class Ensemble(nn.Module):
    def __init__(self, network_specs, ensemble_size):
        super().__init__()
        self.model = nn.ModuleList([make_net(network_specs) for _ in range(ensemble_size)])
        
    def forward(self, x):
        return torch.stack([self.model[i](x[i]) for i in range(ensemble_size)])

However, backprop of the stack operator doesn’t seem to be parallelized (in the same GPU). The GPU utilization is very low, around 15%.

Thinking that dynamic graph might be the cause (similar thread here), I’ve recently tried using @torch.jit. The performance still doesn’t improve.

What am I doing wrong here? How can I improve the performance of my ensemble model?

Thanks.

albanD · February 13, 2019, 10:24am

Hi,

All the opreations that run on the GPU are asynchronous. So if the GPU usage is very low, it’s most likely because your networks are not big enough to use all the GPU.

dawwdd · February 13, 2019, 10:50am

If your network is big enough you can try making bigger batches and check CPU and Disk usage maybe you are doing some data manipulation inside of train loop so its bottleneck.
And if you dont move big enough piece of data into .cuda() before training it also can be bottleneck and cpu usage should be pretty high.

zer0ne · February 14, 2019, 6:30pm

Hi,

If the network is small, we would expect the GPU utilization to get higher when we increase the ensemble size. However, the GPU utilization doesn’t increase when we scale the ensemble size. The total run time increases almost linearly with respect to the ensemble size.

Note that this is a reinforcement learning task (on simple environments), so data processing/transfer is not a bottleneck.

Here is the visualization of my network when the ensemble size is 4.

And here’s our profiling result for different ensemble sizes.

time_forward, time_backward: run 50 times
Backward, Forward: time_B128/time_b4 ~ 28-30

Bootstrap_size: 4:
-----------------------------------------
time_backward 0.07882976531982422
mean_time_backward 0.0015765953063964844
time_forward 0.05576205253601074
mean_time_forward 0.0011152410507202148
time_backward 0.07231521606445312
mean_time_backward 0.0014463043212890624
time_forward 0.05587363243103027
mean_time_forward 0.0011174726486206056
time_backward 0.07005977630615234
mean_time_backward 0.0014011955261230469
time_forward 0.05555391311645508
mean_time_forward 0.0011110782623291015
time_backward 0.07131695747375488
mean_time_backward 0.0014263391494750977
time_forward 0.055143117904663086
mean_time_forward 0.0011028623580932617
time_backward 0.06970882415771484
mean_time_backward 0.001394176483154297
time_forward 0.05509185791015625
mean_time_forward 0.001101837158203125
time_backward 0.0810239315032959
mean_time_backward 0.001620478630065918
time_forward 0.05518746376037598
mean_time_forward 0.0011037492752075195
time_backward 0.07718276977539062
mean_time_backward 0.0015436553955078126
time_forward 0.05403590202331543
mean_time_forward 0.0010807180404663085

Bootstrap_size: 32:
-----------------------------------------
time_backward 0.48969507217407227
mean_time_backward 0.009793901443481445
time_forward 0.4311997890472412
mean_time_forward 0.008623995780944825
time_backward 0.4772953987121582
mean_time_backward 0.009545907974243165
time_forward 0.516700029373169
mean_time_forward 0.01033400058746338
time_backward 0.4743640422821045
mean_time_backward 0.00948728084564209
time_forward 0.5470066070556641
mean_time_forward 0.01094013214111328
time_backward 0.5156633853912354
mean_time_backward 0.010313267707824708
time_forward 0.5515599250793457
mean_time_forward 0.011031198501586913
time_backward 0.48656153678894043
mean_time_backward 0.009731230735778808
time_forward 0.5587642192840576
mean_time_forward 0.011175284385681153
time_backward 0.48267650604248047
mean_time_backward 0.009653530120849609
time_forward 0.549140214920044
mean_time_forward 0.01098280429840088
time_backward 0.493422269821167
mean_time_backward 0.00986844539642334
time_forward 0.546377420425415
mean_time_forward 0.0109275484085083

Bootstrap_size: 128:
-----------------------------------------
time_backward 2.0336191654205322
mean_time_backward 0.040672383308410644
time_forward 2.0258209705352783
mean_time_forward 0.040516419410705565
time_backward 2.0157926082611084
mean_time_backward 0.04031585216522217
time_forward 1.716789960861206
mean_time_forward 0.03433579921722412
time_backward 1.9942104816436768
mean_time_backward 0.039884209632873535
time_forward 1.6753108501434326
mean_time_forward 0.033506217002868655
time_backward 2.0784974098205566
mean_time_backward 0.04156994819641113
time_forward 1.6769888401031494
mean_time_forward 0.033539776802062986
time_backward 1.9966001510620117
mean_time_backward 0.03993200302124023
time_forward 1.6629443168640137
mean_time_forward 0.033258886337280275
time_backward 1.9680683612823486
mean_time_backward 0.039361367225646975
time_forward 1.679962158203125
mean_time_forward 0.0335992431640625
time_backward 2.00929856300354
mean_time_backward 0.0401859712600708
time_forward 1.664689302444458
mean_time_forward 0.03329378604888916

albanD · February 15, 2019, 10:01am

Hi,

The thing with GPUs is that they are very good at doing one very parrallel task but not many small parrallel tasks
You can use nvidia visual profiler nvvp if you want to look more in details how your code runs on the gpu. But you’re most certainly going to have “low core usage” if you have small tasks.

bkoyuncu · November 29, 2022, 9:18am

hey Alban, is it also the same for a single GPU? Let’s say that I have a for loop over the enseble elements each of them doing f:x->y. Are they going to be run asynchronous even if they run on the same gpu?