ZeroDivisionError when trying to train MaskRCNN

I am trying to fine tune MaskRCNN instance segmentation following this tutorial. My custom dataset and model build seem to pass, but I run into this ZeroDivisionError upon training. Any hints on how to debug are appreciated. Happy to include more code, did not want to overwhelm with irrelevant …


ZeroDivisionError Traceback (most recent call last)
in ()
4 for epoch in range(num_epochs):
5 # train for one epoch, printing every 10 iterations
----> 6 train_one_epoch(model, optimizer, data_loader, device, epoch,print_freq=100)
7 # update the learning rate
8 lr_scheduler.step()

1 frames
/content/utils.py in log_every(self, iterable, print_freq, header)
216 total_time_str = str(datetime.timedelta(seconds=int(total_time)))
217 print(’{} Total time: {} ({:.4f} s / it)’.format(
→ 218 header, total_time_str, total_time / len(iterable)))
219
220

ZeroDivisionError: float division by zero

Based on the stacktrace I guess your log_every method creates the ZeroDivisionError and I also guess that len(iterable) is zero.
Check what iterable is (I would guess it might be the DataLoader) and try to debug why its length is zero. If that’s expected, skip this logging method in your code or adapt the logging mechanism to avoid dividing by zero.

Thanks. Your comment jives with another error I get if I run the DataLoader on my custom dataset in shuffle mode (not an issue if Shuffle=False). Do I need to set a non-null defailt value to annotations, so that RandomSampler doesn’t run this with Null annotations? Bit out of my depth here, so appreciate any other pointers …

class SkinEczemaDataset(torch.utils.data.Dataset):
    # annotations are the raw-json file exported thru LabelBox'es 'export labels'
    def __init__(self, annotations, transforms=None):
        #load the raw-json structure that contains both image and mask metadata
        self.annotations = annotations 
        self.transforms = transforms
        
# as alluded to here - https://pytorch.org/tutorials/intermediate/torchvision_tutorial.html
# getitem should return a PIL image and a target dict consisting of boxes/labels/
    def __getitem__(self, idx):
        # load images ad masks
        img_uri = annotations[idx]['Labeled Data'] # get image URI from raw json metadata
        resp = urllib.request.urlopen(img_uri)  #get image array
        image_array = np.asarray(bytearray(resp.read()), dtype="uint8") #convert to numpy array
        image = cv2.imdecode(image_array, cv2.IMREAD_COLOR) #convert numpy array to opencv2 image
        img = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)  #change color map to RGB
        img = Image.fromarray(img) #translate from opencv to PIL image. what pytorch dataloader expects

        mask_uri = annotations[idx]['Label']['objects'][0]['instanceURI']
        #mask_path = os.path.join(self.root, "PedMasks", self.masks[idx])
        resp = urllib.request.urlopen(mask_uri)
        mask_image = np.asarray(bytearray(resp.read()), dtype="uint8")
        obj_ids = np.unique(mask)
        # first id is the background, so remove it
        obj_ids = obj_ids[1:]
        # split the color-encoded mask into a set
        # of binary masks
        masks = mask == obj_ids[:, None, None]

        # get bounding box coordinates for each mask
        num_objs = len(obj_ids)
        boxes = []
        for i in range(num_objs):
            pos = np.where(masks[i])
            xmin = np.min(pos[1])
            xmax = np.max(pos[1])
            ymin = np.min(pos[0])
            ymax = np.max(pos[0])
            boxes.append([xmin, ymin, xmax, ymax])

        boxes = torch.as_tensor(boxes, dtype=torch.float32)
        # there is only one class
        labels = torch.ones((num_objs,), dtype=torch.int64)
        masks = torch.as_tensor(masks, dtype=torch.uint8)

        image_id = torch.tensor([idx])
        area = (boxes[:, 3] - boxes[:, 1]) * (boxes[:, 2] - boxes[:, 0])
        # suppose all instances are not crowd
        iscrowd = torch.zeros((num_objs,), dtype=torch.int64)

        target = {}
        target["boxes"] = boxes
        target["labels"] = labels
        target["masks"] = masks
        target["image_id"] = image_id
        target["area"] = area
        target["iscrowd"] = iscrowd

        if self.transforms is not None:
            img, target = self.transforms(img, target)

        return img, target

    def __len__(self):
        return len(self.annotations)

The error message is on using this dataset with a DataLoader in shuffle=True mode is as below …

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-21-91f1554f7cdb> in <module>()
     27 data_loader = torch.utils.data.DataLoader(
     28     dataset, batch_size=2, shuffle=True, num_workers=2,
---> 29     collate_fn=utils.collate_fn)
     30 
     31 data_loader_test = torch.utils.data.DataLoader(

1 frames
/usr/local/lib/python3.7/dist-packages/torch/utils/data/sampler.py in __init__(self, data_source, replacement, num_samples, generator)
    101         if not isinstance(self.num_samples, int) or self.num_samples <= 0:
    102             raise ValueError("num_samples should be a positive integer "
--> 103                              "value, but got num_samples={}".format(self.num_samples))
    104 
    105     @property

ValueError: num_samples should be a positive integer value, but got num_samples=0

I don’t know why shuffle=True should raise the error only, as it seems that your dataset might be empty.
While creating the DataLoader a RandomSampler or SequentialSampler will be created if shuffle is set to True or False, respectively, and no custom sampler is passed as seen here.

In both cases the length of the data_source will be used (in your case the dataset) here or here so I would assume that both use cases fail if the dataset is empty.
Could you check print(len(dataset)) before passing it to the DataLoader and make sure it has a valid length?

thanks for helping me think thru this. Adding print statements identified a (code cut paste) error, unrelated to dataset length

Good to hear you’re isolated the issue! Could you share some details what went wrong as the errors you were seeing were quite confusing in order to narrow down the error?

Candidly the error was a double assignment on the datasets (dataset and dataset_test). The code that over-rode my code (below the comments ‘bad code’ below) was left over from the pytorch instance segmentation reference code … dangers of cut-paste!

from engine import train_one_epoch, evaluate
import utils
import transforms as T


def get_transform(train):
    transforms = []
    # converts the image, a PIL image, into a PyTorch Tensor
    transforms.append(T.ToTensor())
    if train:
        # during training, randomly flip the training images
        # and ground-truth for data augmentation
        transforms.append(T.RandomHorizontalFlip(0.5))
    return T.Compose(transforms)

# use our dataset and defined transformations
dataset = SkinEczemaDataset(annotations=labels, transforms=get_transform(train=True))
dataset_test = SkinEczemaDataset(annotations=labels, transforms=get_transform(train=False))

# split the dataset in train and test set
torch.manual_seed(1)
indices = torch.randperm(len(dataset)).tolist()
#bad code below that was not deleted
dataset = torch.utils.data.Subset(dataset, indices[:-50])
dataset_test = torch.utils.data.Subset(dataset_test, indices[-50:])

# define training and validation data loaders

data_loader = torch.utils.data.DataLoader(
    dataset, batch_size=2, shuffle=True, num_workers=2,
    collate_fn=utils.collate_fn)


data_loader_test = torch.utils.data.DataLoader(
    dataset_test, batch_size=1, shuffle=False, num_workers=2,
    collate_fn=utils.collate_fn)