Training forward pass and loss calculation per batch different for the same model in two reposirtories

I have almost everything similar (as much as I have tried to look for differences) and I have a stark difference in running time (calculate by time.time()). in one repo, it takes around 3 seconds, and in the other around 15 seconds. The model, loss function, and dataset are the same. I checked to see if the model and loss and data are on the GPU and they are in the slow case. The only difference I see if in how the loss and model are defined as a class attribute by self.ass_module(‘loss’, loss). I add the code. Any idea why I see this difference?

This is how the faster model is implemented:

class ModelAndLoss(nn.Module):
    def __init__(self, args, model, loss):
        super(ModelAndLoss, self).__init__()
        self.add_module("_model", model)
        self.add_module("_loss", loss)
        self._parallel = False
        if 'parallel' in args.device:
            self._parallel = True

    @property
    def loss(self):
        return self._loss

    @property
    def model(self):
        return self._model

  
    
    def forward(self, example_dict):
        # -------------------------------------
        # Run forward pass
        # -------------------------------------
        output_dict = self._model(example_dict)

        # -------------------------------------
        # Compute losses
        # -------------------------------------
        loss_dict = self._loss(output_dict, example_dict)

        # -------------------------------------
        # Return losses and outputs
        # -------------------------------------
        return loss_dict, output_dict


def configure_model_and_loss(args, is_coreset_only=False):
    # ----------------------------------------------------
    # Dynamically load model and loss class with parameters
    # passed in via "--model_[param]=[value]" or "--loss_[param]=[value]" arguments
    # ----------------------------------------------------
    log_addition = " [Coreset Only]" if is_coreset_only else ""
    with logger.LoggingBlock(f"Model and Loss" + log_addition, emph=True):

        # ----------------------------------------------------
        # Model
        # ----------------------------------------------------
        kwargs = typeinf.kwargs_from_args(args, "model")
        kwargs["args"] = args
        model = typeinf.instance_from_kwargs(args.model_class, kwargs)

        # ----------------------------------------------------
        # Training loss
        # ----------------------------------------------------
        loss = None
        if args.loss is not None:
            kwargs = typeinf.kwargs_from_args(args, "loss")
            kwargs["args"] = args
            loss = typeinf.instance_from_kwargs(args.loss_class, kwargs)

        # ----------------------------------------------------
        # Model and loss
        # ----------------------------------------------------
        model_and_loss = ModelAndLoss(args, model, loss)

  

    return model_and_loss

Did you synchronize the code before starting and stopping the host timers on both machines?
If not, your profile output would be invalid since CUDA operations are executed asynchronously.
Your timers would capture the dispatching and kernel launches in the best case and a random number of operations in the worst case.

Thank you for your reply. Can you explain a bit further the synchronization? How should it be done?
What I did was calculate the elapsed time (time.time()) before and after the forward pass and before and after the loss calculation in both repositories.

You would need to add torch.cuda.synchronize() before starting timers on the host, alternatively use CUDA events, or use the torch.utils.benchmark utils. which will also add warmup iterations and the needed synchronizations for you.