I am trying to train a ViT on a cloud TPU using a pytorch implementation. For small datasets everything works fine. As soon as I use a larger dateset (Imagenet) I run in to an error I do not understand:
2021-11-10 17:01:24.065497: W tensorflow/core/framework/op_kernel.cc:1692] OP_REQUIRES failed at tpu_execute_op.cc:266 : Resource exhausted: Attempting to reserve 5.88G at the bottom of memory. That was not possible. There are 14.74G free, 0B reserved, and 4.62G reservable.
I am aware that the memory requirements increase with larger images and batch sizes but what I don’t understand is the error message. How can there be 14.74G free with 0B reserved but only 4.62G reservable and what direction could I look into to resolve this issue?
Some details to my setup:
Model/Training
image_size: 224 x 224
batch_size: 128
model parameters: ~86M
Hardware
TPU: v3-8
Environment
Python 3.8.10
Ubuntu 20.04.2 LTS
Relevant Packages
tf-nightly==2.6.0
torch==1.9.0
torch-xla @ https://storage.googleapis.com/tpu-pytorch/wheels/tpuvm/torch_xla-1.9-cp38-cp38-linux_x86_64.whl
torchvision==0.10.0