Soft Ensembling in Pytorch using GPU

what is the efficient way of doing soft ensembling (during the inference) in pytorch? Assume that I have models loaded on CPU in list named models and I want to speed it up with GPU.
This is my naive implementation (I guess it is nor very efficient to load model on/off memory in every epoch) and it is not working since even when I execute lines model = model.cpu() and torch.cuda.empty_cache(), model still takes GPU and after few iterations code runs out of memory.

        for i, batch in enumerate(dev_iter):
            pred_logits = None
            for model in models:
                model = model.cuda()
                pred_logits_per_model = model(batch)
                if pred_logits is None:
                    pred_logits = F.softmax(pred_logits_per_model,-1)

                model = model.cpu()

            loss = lossfunction(pred_logits, batch.stance_label)