Libprotobuf error causes crash in the middle of training

I’m training an object detection model in pytorch on Ubuntu 16.04 with a TitanX Pascal gpu. In the middle of training (after several thousand training iterations e.g. 70k), the training crashes with the message:

[libprotobuf FATAL google/protobuf/wire_format.cc:830] CHECK failed: (output->ByteCount()) == (expected_endpoint): : Protocol message serialized to a size different from what was originally expected.  Perhaps it was modified by another thread during serialization?
terminate called after throwing an instance of 'google::protobuf::FatalException'
  what():  CHECK failed: (output->ByteCount()) == (expected_endpoint): : Protocol message serialized to a size different from what was originally expected.  Perhaps it was modified by another thread during serialization?
Command terminated by signal 6

I’ve searched, but couldn’t find anyone experiencing similar errors. I’m using tensorboard-pytorch to visualize the training artifacts. Any ideas on how to resolve this issue?

can you check if by any chance you are running out of memory or disk space?
It looks like one of those errors where it tried to write X bytes, but could only write bytes less than X due to some unknown constraints.

It’s training now, and the gpu is only using 7293MiB / 12189MiB. Since the crash happens almost at random, it’s difficult to pin down. I’m currently scaling the shortest image side to 800 pixels, but could try reducing that to 600 pixels and seeing whether the issue arises. My batch size is already 1 image.

I just noticed that I had a zombie process taking up some memory, so this may have been the culprit. After killing the zombie process, my gpu memory is down to 4228MiB / 12189MiB. Hopefully this was the issue.

The error was actually related to tensorboard. I removed the code portion that writes tfevents and the training completed just fine.

1 Like