Model loading times

iiSeymour · November 23, 2019, 4:22pm

Does the performance of loading a model here seem correct?

The initial called to load a 26MB model on line 86 is ~100 times slower than loading the 26MB checkpoint on 87 and 5.45 seconds is a lot longer than I would expect?

Total time: 5.4545 s
Function: load_model at line 81

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
    81                                            @profile
    82                                            def load_model(dirname, device):
    82                                            """ load model from disk """
    83         1         90.0      90.0     0.0       device = torch.device(device)
    84         1         26.0      26.0     0.0       modelfile = os.path.join(dirname, 'model.py')
    85         1          7.0       7.0     0.0       weights = os.path.join(dirname, 'weights_%s.tar' % weights)
    86         1    5379756.0 5379756.0     98.6      model = torch.load(modelfile, map_location=device)
    87         1      72966.0   72966.0     1.3       model.load_state_dict(torch.load(weights, map_location=device))
    88         1       1648.0    1648.0     0.0       model.eval()
    89         1          1.0       1.0     0.0       return model

I’m using pytorch 1.2 and a device="cuda" and have confirmed the same performance with 1.3.1. Adding in calls to sync shows that all of the time is actually spend in torch.load.

    86         1    5395545.0 5395545.0     98.6      model = torch.load(modelfile, map_location=device)
    87         1         76.0      76.0      0.0      torch.cuda.synchronize(device=device)
    88         1      72403.0   72403.0      1.3      model.load_state_dict(torch.load(weights, map_location=device))
    89         1         52.0      52.0      0.0      torch.cuda.synchronize(device=device)
    90         1       1640.0    1640.0      0.0      model.eval()
    91         1         21.0      21.0      0.0      torch.cuda.synchronize(device=device)

Am I doing something wrong here or is this typical?

ptrblck · November 23, 2019, 8:33pm

I’m never using the first approach, as it might break in several ways.
That being said, it could be slower due to some potential parsing, I guess. As far as I know, the file structure has to be identical, if you load the model directly, which points to some Python parsing (but it’s an assumption).

iiSeymour · November 23, 2019, 9:00pm

Thanks @ptrblck, instantiating model directly and calling model.to(device) take a similar amount of time?

Total time: 5.33244 s
Function: load_model at line 85

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
    85                                           @profile
    86                                           def load_model(dirname, device, weights=None):
    94         1         17.0     17.0      0.0      weights = os.path.join(dirname, 'weights_%s.tar' % weights)
    95         1         81.0     81.0      0.0      device = torch.device(device)
    96                                           
    97                                               #modelfile = os.path.join(dirname, 'model.py')
    98                                               #model = torch.load(modelfile, map_location=device)
    99                                           
   100         1          8.0      8.0      0.0      config = os.path.join(dirname, 'config.toml')
   101         1     105144.0 105144.0      2.0      model = Model(toml.load(config))
   102         1    5154339.0 5154339.0     96.7      model.to(device)
   103                                           
   104         1      71250.0  71250.0      1.3      model.load_state_dict(torch.load(weights, map_location=device))
   105         1       1600.0   1600.0      0.0      model.eval()
   106         1          1.0      1.0      0.0      return model

iiSeymour · November 23, 2019, 9:45pm

Okay, so model loading is a red herring. It’s always the first copy that is very expensive regardless of the size.

import time
import torch

t0 = time.perf_counter()
torch.tensor((10), device="cuda")
time.perf_counter() - t0
> 5.2702017779956805
t0 = time.perf_counter()
torch.tensor((10), device="cuda")
time.perf_counter() - t0
> 0.00011218398867640644

Is this initial cost unavoidable?

ptrblck · November 23, 2019, 11:45pm

Timing without synchronization might give you wrong answered and you mentioned you are already synchronizing the code.

That being said, you need to initialize the CUDA context in your first CUDA operation, which cannot be avoided.