Efficient operation for subtraction all rows of one matrix from the other?

I have two vectors

a.shape = [4,6000,1000]
b.shape = [4,6000,1000]

And I would like to preform subtraction between all row-pairs.
I.e for

[[1,2,3]     -  [[3,4,5]
 [3,4,5]]        [6,7,8]]

[[-2,-2,-2],[-5,-5,-5]
 [0,0,0],[-3,-3,-3]]

Using unsqueeze I was able to do so, with the following scheme:
a.unsqueeze(1) - b.unsqueeze(2)
And it works as expected.
Unfortunately, with the specified shapes, I have
CUDA out of memory. Tried to allocate 724.37 GiB

Is there a way to do it, even if slower, in a vectorized manner?

If a and b are of the shape (k,n,m) then the shape of your output is (k, n, n, m). This means that, if you are using int32, you’ll need to allocate at least 4*k*n*n*m bytes. In your case, it counts up to 4 * 4 * 6000 * 6000 * 1000 bytes, or 536 Gb, it seems a little too much for me. So maybe you should implement some lazy calculation of your output in order to fit it in any modern GPU.