TPU: Resource exhausted, although there seems to be enough memory

I am trying to train a ViT on a cloud TPU using a pytorch implementation. For small datasets everything works fine. As soon as I use a larger dateset (Imagenet) I run in to an error I do not understand:

2021-11-10 17:01:24.065497: W tensorflow/core/framework/] OP_REQUIRES failed at : Resource exhausted: Attempting to reserve 5.88G at the bottom of memory. That was not possible. There are 14.74G free, 0B reserved, and 4.62G reservable.

I am aware that the memory requirements increase with larger images and batch sizes but what I don’t understand is the error message. How can there be 14.74G free with 0B reserved but only 4.62G reservable and what direction could I look into to resolve this issue?

Some details to my setup:

image_size: 224 x 224
batch_size: 128
model parameters: ~86M

TPU: v3-8

Python 3.8.10
Ubuntu 20.04.2 LTS

Relevant Packages
torch-xla @