I have two matrices of sizes (32, 512, 7,7) and (32, 512) respectively, where 32 is the batch size and 512 is the channel size. Let us call them A and B.
Now what I need to do is this:

for i in range(batch_size):
for j in range(channel_size):
A[i,j,:,:]=A[i,j,:,:]*B[i,j]

In other words, I want to have each elements of the 7x7 matrix (of i th batch and j th channel) multiplied by the corresponding value from B matrix’s corresponding scalar for i th batch and j th channel number.

Using for loops becomes higly inefficient when I have to increasing batch size.
Is there any other way of doing the same?

Thanks @tom for you reply. Is there be any other solution to this problem? I have shown a simple example here. My actual implementation is very much similar, except it requires out of place implementation (C=A * B[:, :, None, None]). As I have seen that the out place implementation of the solution given by @tom, in my case is highly memory inefficient for a batch size of 64. The error messege is:

RuntimeError: CUDA out of memory. Tried to allocate 256.00 GiB (GPU 0; 23.65 GiB total capacity; 2.12 GiB already allocated; 20.16 GiB free; 2.20 GiB reserved in total by PyTorch)