Whether to use detach or not while cloning intermediate tensors during training?

The task at hand, is to classify a given “set” of images as positive or negative (binary classification using “sets” of images as input). For this purpose, I am training a VGG16 architecture to generate a representation for each image, combining representation from multiple images, and using cross-entropy loss to predict the set label as positive or negative.

For the training data, I only have positive sets of images available. Therefore, I decided to create negative sets by combining randomly sampled images from different positive sets to create a negative set. This procedure of creating negative sets from randomly sampled images from positive sets is done in each mini-batch (online generation of negative sets while training).

Procedure for generating negative sets:

  1. positive_items= vgg16(positive_images)
  2. positive_items_copy= positive_items.clone()
  3. negative_items = items_copy[torch.randperm(positive_items_copy.shape[0])]

Finally, I just use the positive and negative items to calculate the loss (binary cross-entropy in this case), basically predicting which set of images is positive and negative.

My question is:
Am I correct to simply clone the representation of positive images (in step 2 in the code above), to finally randomly shuffle them and get a negative set of representations, without using .detach(). I expect the loss function to back-propagate gradients from both positive and negative examples to the weights of VGG16, so I think this is the right way to do it.

Another way (which I think is wrong), would be to first clone the positive items while using detach, and then set requires_grad=True (like mentioned below), but I think doing this will not allow the errors from the negative examples to update the weights of the VGG16 network.

  1. positive_items= vgg16(positive_images)
  2. positive_items_copy= positive_items.detach().clone()
    4.negative_items = items_copy[torch.randperm(positive_items_copy.shape[0])]

I know this is a much longer than usual question, but any help would be really appreciated. Thanks in advance!