What is the correct way to use DS and DL in predict?

I have some code for a predict method (shown below) that I got to work after some trial and error. Is there some way to simplify this code? I am running the predict method in batches because if I don’t too much memory is used up. Still, I do not understand why I have to use the innermost loop. Instead of the innermost loop I tried batch_preds = self.model.forward(xb), but that fails. Why? Is there a better way to do this? Here is the code:

def predict(self, X):
    if self.gpuid is not None:
        device = torch.device(f"cuda:{self.gpuid}")
    else:
        device = torch.device("cuda")
    self.model.to(device)
    self.model.eval()
    with torch.no_grad():
        X = torch.tensor(X).float().to(device)
        predict_ds = TensorDataset(X)
        if self.predict_by_batch:
            predict_dl = DataLoader(predict_ds, batch_size=self.batch_size)
            preds = []
            for xb in predict_dl:
                for x in xb:
                    batch_preds = self.model.forward(x)
                    batch_preds = batch_preds.to('cpu')
                    preds.extend(list(batch_preds.numpy()))
            preds = np.asarray(preds)
        else:
            preds = self.model.forward(X).to('cpu').numpy()
    return np.squeeze(preds)

I should add that it seems to run slowly.

So I am really new at pytorch and did not really know where to start. For my problem all of the data fits into memory so just using simple indexing of the batches worked. That runs very fast.

The inner loop should not be necessary. What kind of error are you seeing?