One thing is that you will need a torch.cuda.synchronize()
before calling time.time()
to make sure all pending CUDA kernals in the stream are finished.
You can also use elapsed_time
to measure. See discussion here.
If you are looking for the most performant solution, DistributedDataParallel
should be the way to go. [example]
When I uncomment the line
self.tools.SomeFunc(self.model, self.features)
and set the flag toTrue
, I receive the following error:
Looks like self.model
is a DataParallel
instance? If so, DataParallel
does not have the first_term
attribute. If this attribute is on the model instance you passed to DataParallel
, you can access the original model instance through self.model.module
(see DataParallel code here) which should have the first_term
attribute.