Omitting backprop for random batch samples for specified part of a NN


sorry if a question like this has been posed before, I could not find anything unfortunately, but was not certain of the correct terminology. I would like to implement the following to work on a multiple-instance-learning (MIL) task:

I have a CNN that consists of three parts: a feature extraction network, a fully connected attention-network that predicts weights for bag instances, which are then used to compute a weighted average of the instance-wise feature vectors, followed by a classifier network for this bag-level representation. This setup benefits from very large batch sizes. I would like to make a forward pass with a very large batch, then update the weights of the attention network and classifier network with gradients from all bag instances/samples, but then only update the feature extractor CNN with a very small percentage of the samples, i.e. 10%.

Is it possible to do this to reduce the GPU memory requirements? If so, could you point me in the right direction of how I could do this? In my naive thinking, I would simply need something like a drop-out in the backward path of a layer between feature extractor CNN and fully connected NNs that is only an identity function in a forward pass.

I am admittedly clueless of how to approach this. Any pointers would be much appreciated!