Relationship between GPU Memory Usage and Batch Size

Hi ptrblck! I’ve had a different experience (where I get OOM from gradient) when I am getting many (random) outputs from a model & accumulating them inside a tensor. I would’ve imaged based on your comment this would behave comparably to having larger batch sizes where gradient doesn’t take additional memory… Could you take a look at my question?