TPU: Resource exhausted, although there seems to be enough memory

gcascio · November 10, 2021, 8:08pm

I am trying to train a ViT on a cloud TPU using a pytorch implementation. For small datasets everything works fine. As soon as I use a larger dateset (Imagenet) I run in to an error I do not understand:

2021-11-10 17:01:24.065497: W tensorflow/core/framework/op_kernel.cc:1692] OP_REQUIRES failed at tpu_execute_op.cc:266 : Resource exhausted: Attempting to reserve 5.88G at the bottom of memory. That was not possible. There are 14.74G free, 0B reserved, and 4.62G reservable.

I am aware that the memory requirements increase with larger images and batch sizes but what I don’t understand is the error message. How can there be 14.74G free with 0B reserved but only 4.62G reservable and what direction could I look into to resolve this issue?

Some details to my setup:

Model/Training
image_size: 224 x 224
batch_size: 128
model parameters: ~86M

Hardware
TPU: v3-8

Environment
Python 3.8.10
Ubuntu 20.04.2 LTS

Relevant Packages
tf-nightly==2.6.0
torch==1.9.0
torch-xla @ https://storage.googleapis.com/tpu-pytorch/wheels/tpuvm/torch_xla-1.9-cp38-cp38-linux_x86_64.whl
torchvision==0.10.0