Instance segmentation training runs _ epochs and hits index error

JohnPolo · June 14, 2021, 4:26pm

I am following the Finetuning Instance Segmentation tutorial with my own data.

I am at the point in the tutorial where the model is trained for 10 epochs. All the code up to this point seems to work. When I run this section, the results are mixed. Once, it ran for one epoch and then gave an error. Then I tried it again and it gave the same error as previous attempt but before finishing one epoch. Now it ran for two epochs and it returns the error again. In the first two failures this line had “…process 0”. Now in the third attempt:

IndexError: Caught IndexError in DataLoader worker process 1.

In all three attempts, the following line is always the same:

IndexError: too many indices for tensor of dimension 1

This is the stacktrace:

Epoch: [0]  [ 0/19]  eta: 0:13:06  lr: 0.000282  loss: 8.1070 (8.1070)  loss_classifier: 0.4619 (0.4619)  loss_box_reg: 0.3113 (0.3113)  loss_mask: 7.3135 (7.3135)  loss_objectness: 0.0185 (0.0185)  loss_rpn_box_reg: 0.0019 (0.0019)  time: 41.3855  data: 3.8544  max mem: 0
Epoch: [0]  [10/19]  eta: 0:04:21  lr: 0.003058  loss: 1.5146 (2.5890)  loss_classifier: 0.3459 (0.3226)  loss_box_reg: 0.2827 (0.2941)  loss_mask: 0.7114 (1.8825)  loss_objectness: 0.0403 (0.0862)  loss_rpn_box_reg: 0.0034 (0.0036)  time: 29.0771  data: 0.3538  max mem: 0

---------------------------------------------------------------------------

IndexError                                Traceback (most recent call last)

<ipython-input-15-973f272f5759> in <module>()
      4 for epoch in range(num_epochs):
      5     # train for one epoch, printing every 10 iterations
----> 6     train_one_epoch(model, optimizer, data_loader, device, epoch, print_freq=10)
      7     # update the learning rate
      8     lr_scheduler.step()

5 frames

/content/engine.py in train_one_epoch(model, optimizer, data_loader, device, epoch, print_freq)
     24         lr_scheduler = utils.warmup_lr_scheduler(optimizer, warmup_iters, warmup_factor)
     25 
---> 26     for images, targets in metric_logger.log_every(data_loader, print_freq, header):
     27         images = list(image.to(device) for image in images)
     28         targets = [{k: v.to(device) for k, v in t.items()} for t in targets]

/content/utils.py in log_every(self, iterable, print_freq, header)
    199         ])
    200         MB = 1024.0 * 1024.0
--> 201         for obj in iterable:
    202             data_time.update(time.time() - end)
    203             yield obj

/usr/local/lib/python3.7/dist-packages/torch/utils/data/dataloader.py in __next__(self)
    515             if self._sampler_iter is None:
    516                 self._reset()
--> 517             data = self._next_data()
    518             self._num_yielded += 1
    519             if self._dataset_kind == _DatasetKind.Iterable and \

/usr/local/lib/python3.7/dist-packages/torch/utils/data/dataloader.py in _next_data(self)
   1197             else:
   1198                 del self._task_info[idx]
-> 1199                 return self._process_data(data)
   1200 
   1201     def _try_put_index(self):

/usr/local/lib/python3.7/dist-packages/torch/utils/data/dataloader.py in _process_data(self, data)
   1223         self._try_put_index()
   1224         if isinstance(data, ExceptionWrapper):
-> 1225             data.reraise()
   1226         return data
   1227 

/usr/local/lib/python3.7/dist-packages/torch/_utils.py in reraise(self)
    427             # have message field
    428             raise self.exc_type(message=msg)
--> 429         raise self.exc_type(msg)
    430 
    431 

IndexError: Caught IndexError in DataLoader worker process 1.
Original Traceback (most recent call last):
  File "/usr/local/lib/python3.7/dist-packages/torch/utils/data/_utils/worker.py", line 202, in _worker_loop
    data = fetcher.fetch(index)
  File "/usr/local/lib/python3.7/dist-packages/torch/utils/data/_utils/fetch.py", line 44, in fetch
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/usr/local/lib/python3.7/dist-packages/torch/utils/data/_utils/fetch.py", line 44, in <listcomp>
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/usr/local/lib/python3.7/dist-packages/torch/utils/data/dataset.py", line 330, in __getitem__
    return self.dataset[self.indices[idx]]
  File "<ipython-input-1-4d5623c80b35>", line 62, in __getitem__
    area = (boxes[:, 3] - boxes[:, 1]) * (boxes[:, 2] - boxes[:, 0])
IndexError: too many indices for tensor of dimension 1

This is not my image exactly, I can’t share, but this is close.
And this is one of my masks.

ptrblck · June 15, 2021, 2:40am

This error is raised, if the index tensor has a single dimension only while you are indexing it at dim1:

boxes = torch.randn(4)
out = (boxes[:, 3] - boxes[:, 1]) * (boxes[:, 2] - boxes[:, 0])
> IndexError: too many indices for tensor of dimension 1

boxes = torch.randn(1, 4)
out = (boxes[:, 3] - boxes[:, 1]) * (boxes[:, 2] - boxes[:, 0]) # works

You could check the shape of boxes inside the Dataset and make sure it has at least 2 dimensions.

JohnPolo · June 15, 2021, 3:38pm

Thanks, @ptrblck. The masks I created had some noise in them, for lack of a better term, and so In the Dataset class creation, it was hard to get the bounding boxes created. Each mask file is supposed to have four masks/bounding boxes. By changing the index selection to “[-4:]”, the problem is fixed.

mask = Image.open(mask_path)

# Convert from image object to array
mask = np.array(mask)
        
obj_ids = np.unique(mask)

obj_ids = obj_ids[-4:] #Previously, I had [6:] to avoid the noise. This was causing my problem.