Efficient on-GPU random tensor perturbation


I’ve been using PyTorch for neuroevolution and while it has been successful, I have ran into some performance bottlenecks.

Specifically, I need to derive a number of perturbations for a base tensor stack through a pseudo-random number generator and evaluate if that perturbation improves the loss or not. The critical part is for me to be able to retrieve the random seed used for the perturbation and be able to re-derive the tensor perturbation from it.

So far, to do it, I have performed a deepcopy of the original tensor, followed by a random perturbation, generation of a new perturbed tensor and a forward-propagation pass.

The approximate code of that procedure would be:

base_state_dict = deepcopy(active_net.main.state_dict()) 
# active_net is the nn.Module containing the ANN model

fixed_rand_seed = secrets.randbelow(< operations to conform to pytorch seed sizing>)

new_state_dict = OrderedDict()

for _name, _tensor in base_state_dict.items():
    perturbation_tensor = torch.normal(0.0, tensor_av*diameter, size=_tensor.shape).cuda()
    new_state_dict[_name] = _tensor + perturbation_tensor


loss_fn = nn.CrossEntropyLoss(ignore_index=-1)

for img, lbl in dataloader:
        img = img.cuda()
        lbl = lbl.cuda()

        predict = active_net.forward(img)
        loss = loss_fn(predict, lbl)
        valid_loss += loss.item() * img.size(0)

        valid_loss = valid_loss / len(test_ds_loader.sampler)

However, this causes a lot of transitions between the CPU and GPU, that does not seem to be easy to remove, given that setting a fixed random seed in pytorch requires a cryptographically secure random seed, that is called using CPU-executed python. Unfortunately, this transfer rapidly becomes a bottleneck in my computational experiments.

Is there a way to achieve the same result more efficiently, without that GPU-CPU-GPU transfer and data re-loading on the GPU?