Proper type casting with ResNet and cross entropy

I am encountering some trouble with the dtypes of the tensors used when trying to train a model with pretrained ResNet18 backbone to do localization (classification and bounding box prediction of images). I use a ResNet18 backbone with 2 heads, both heads are simple fully connected layers. The first head has 120 output features, one for each class in the dataset, the other has 4 output features, one for each coordinate of the bounding box. The loss function used on the output from the first head is cross entropy loss, and for the second head I use mean squared error loss.

When a batch of images, labels, and bounding boxes are loaded using a dataloader, the dtypes of the labels and bounding boxes are int64. In the backpropagation step of the training loop, I get the following error: “RuntimeError: Found dtype Long but expected Float”. I figured this was due to the dtypes of the ground truth labels and bounding boxes, so I cast the tensors to type float32 using .float().

However, if I do this cast before the calculation of the loss from the first head using cross entropy, I get the follow error: “RuntimeError: expected scalar type Long but found Float”. This makes sense, as cross entropy is categorical so it expects a int type, and not a float type. However, if I move the cast to after the losses have been calculated, I get the original error again: “RuntimeError: Found dtype Long but expected Float”. There must be something I am misunderstanding when using backprop or the model, as I cannot see where a tensor with dtype Long is being used.

I have attached the training loop and the model I am using below, together with the attributes needed to understand the training loop. If more code is needed to provide context I will gladly supply it, but figured I should not make the post too long.

Model used:

from einops import reduce
import torch.nn as nn
from torchvision.models import resnet18, ResNet18_Weights

class ResNet18Model(nn.Module):
   def __init__(self, pretrained=True, freeze_backbone=True):
       super().__init__()

       weights = None
       if pretrained:
           weights = ResNet18_Weights.IMAGENET1K_V1

       self.backbone = nn.Sequential(*list(resnet18(weights=weights).children())[:-2])

       if freeze_backbone:
           for parameters in self.backbone.named_parameters():
               parameters[1].requires_grad = False

       self.classification_head = nn.Linear(512, 120)
       self.box_head = nn.Linear(512, 4)

   def forward(self, x):

       backbone_features = self.backbone(x)

       backbone_features = reduce(backbone_features, 'b c h w -> b c', reduction='mean')

       classification = self.classification_head(backbone_features)

       box = self.box_head(backbone_features)

       return classification, box

Context for training loop:

self.loss_classification = F.cross_entropy
self.loss_localization = F.mse_loss
self.optimizer = SGD(model.parameters(), lr=0.001)
self.learning_rate_scheduler = def base_lr_scheduler(t, T, lr): return lr

Training loop that produces error:

   def train(self):
       for epoch in range(self.epochs):
           self.model.train()
           for x_batch, class_batch, box_batch in self.train_loader:

               box_batch = torch.squeeze(box_batch)
               # Update learning rate
               self.optimizer.param_groups[0]['lr'] = self.learning_rate_scheduler(
                   self.current_batch_index, self.total_batches, lr=self.optimizer.param_groups[0]['lr'])

               # Forward pass
               prediction_class, prediction_box = self.model(x_batch)

               print('Prediction class shape:', prediction_class.shape)
               print('Prediction box shape:', prediction_box.shape)
               print('Prediction class type:', prediction_class.dtype)
               print('Prediction box type:', prediction_box.dtype)

               print('Batch class shape:', class_batch.shape)
               print('Batch box shape:', box_batch.shape)
               print('Batch class type:', class_batch.dtype)
               print('Batch box type:', box_batch.dtype)

               loss_class = self.loss_classification(prediction_class, class_batch)
               loss_box = self.loss_localization(prediction_box, box_batch)
               total_loss = loss_class + loss_box

               print('Loss class type', loss_class.dtype)
               print('Loss box type', loss_box.dtype)
               print('Total loss type', total_loss.dtype)

               class_batch = class_batch.float()
               box_batch = box_batch.float()

               print('Batch class type after .float():', class_batch.dtype)
               print('Batch box type after .float():', box_batch.dtype)

               # Backprop
               total_loss.backward()

               # Update model parameters
               self.optimizer.step()
               self.optimizer.zero_grad()

The code above returns the following when training the model:

Prediction class shape: torch.Size([64, 120])
Prediction box shape: torch.Size([64, 4])
Prediction class type: torch.float32
Prediction box type: torch.float32
Batch class shape: torch.Size([64])
Batch box shape: torch.Size([64, 4])
Batch class type: torch.int64
Batch box type: torch.int64
Loss class type torch.float32
Loss box type torch.float32
Total loss type torch.float32
Batch class type after .float(): torch.float32
Batch box type after .float(): torch.float32
Traceback (most recent call last):
 File "localization_test.py", line 33, in <module>
   dog_localizer.train()
 File "/home/nmunch/computer_vision_project/dog_localization_project/dog_localization_utilities/dog_localization_utilities/localization.py", line 113, in train
   total_loss.backward()
 File "/home/nmunch/environments/computer_vision/lib/python3.8/site-packages/torch/_tensor.py", line 492, in backward
   torch.autograd.backward(
 File "/home/nmunch/environments/computer_vision/lib/python3.8/site-packages/torch/autograd/__init__.py", line 251, in backward
   Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
RuntimeError: Found dtype Long but expected Float

I feel like there is something obvious I am missing so I hope any of you can help me. Thanks in advance!

I don’t know why the backward pass raises the error while the forward pass seems to be fine. Could you add the missing code pieces to make it a minimal and executable code snippet?

1 Like

Thank you for the interest! I have made a code snippet with all the code needed to encounter my problem. However, I am unsure how to create sample data that is easy for you to use. I am working with the stanford dogs dataset (Stanford Dogs dataset for Fine-Grained Visual Categorization), where the annotations that contain the bounding boxes for each image is loaded in my custom dataset class BoundingBoxDataSet.

In the code snippet below I have removed the loading part and manually made batches that match the shapes and types I actually get when using my loader. The same problem occurs. Hopefully this is fine for you to run.


from torchvision.transforms import v2
import torch.nn.functional as F
from torch.optim import SGD
import torch
from torch.nn import Softmax
from einops import reduce
import torch.nn as nn
from torchvision.models import resnet18, ResNet18_Weights

class ResNet18Model(nn.Module):
    def __init__(self, pretrained=True, freeze_backbone=True):
        super().__init__()

        weights = None
        if pretrained:
            weights = ResNet18_Weights.IMAGENET1K_V1

        self.backbone = nn.Sequential(*list(resnet18(weights=weights).children())[:-2])

        if freeze_backbone:
            for parameters in self.backbone.named_parameters():
                parameters[1].requires_grad = False

        self.classification_head = nn.Linear(512, 120)
        self.box_head = nn.Linear(512, 4)

    def forward(self, x):

        backbone_features = self.backbone(x)

        backbone_features = reduce(backbone_features, 'b c h w -> b c', reduction='mean')

        classification = self.classification_head(backbone_features)

        box = self.box_head(backbone_features)

        return classification, box

class DogLocalizer:
    def __init__(self, data_transformer, model, optimizer, loss_classification, loss_localization, learning_rate_scheduler, batch_size, epochs):

        self.data_transformer = data_transformer
        self.model = model
        self.optimizer = optimizer
        self.loss_classification = loss_classification
        self.loss_localization = loss_localization
        self.learning_rate_scheduler = learning_rate_scheduler
        self.batch_size = batch_size
        self.epochs = epochs

        self.current_batch_index = 1
        self.batches_per_epoch = None
        self.total_batches = 10

        # Data loaders
        self.train_loader = None
        self.validation_loader = None
        self.test_loader = None

    def classification_accuracy(self, scores, batch_labels):
        score_to_probability = Softmax(dim=1)
        prediction = torch.argmax(score_to_probability(scores), dim=1)
        return (prediction == batch_labels).float().mean()

    def train(self):
        for epoch in range(self.epochs):
            self.model.train()
            
            x_batch = torch.rand((3,3,224,224), dtype=torch.float32)
            class_batch = torch.randint(low = 0, high = 120, size = (3,))
            box_batch = torch.randint(low = 1, high = 224, size = (3,1,4))

            box_batch = torch.squeeze(box_batch)

            # Update learning rate
            self.optimizer.param_groups[0]['lr'] = self.learning_rate_scheduler(
                self.current_batch_index, self.total_batches, lr=self.optimizer.param_groups[0]['lr'])

            # Forward pass
            prediction_class, prediction_box = self.model(x_batch)

            loss_class = self.loss_classification(prediction_class, class_batch)
            loss_box = self.loss_localization(prediction_box, box_batch)
            total_loss = loss_class + loss_box

            class_batch = class_batch.float()
            box_batch = box_batch.float()

            # Backprop
            total_loss.backward()

            # Update model parameters
            self.optimizer.step()
            self.optimizer.zero_grad()

def base_lr_scheduler(t, T, lr):
    return lr

cross_entropy = F.cross_entropy
L2_error = F.mse_loss
model = ResNet18Model()
optimizer = SGD(model.parameters(), lr=0.001)

batch_size = 3
epochs = 1

resnet_mean = [0.485, 0.456, 0.406]
resnet_std = [0.229, 0.224, 0.225]

resnet_transform = v2.Compose([v2.Resize((256, 256), antialias=True), v2.CenterCrop(224), v2.ToDtype(
    torch.float32, scale=True), v2.Normalize(mean=resnet_mean, std=resnet_std)])

dog_localizer = DogLocalizer(data_transformer=resnet_transform, model=model, optimizer=optimizer,
                             loss_classification=cross_entropy, loss_localization=L2_error, learning_rate_scheduler=base_lr_scheduler, batch_size=batch_size, epochs=epochs)

dog_localizer.train()


In case you think the error comes from the loader somehow (I think this is unlikely as the error also occurs when I manually create batches of the same shapes and dtypes), here is the complete code that requires the stanford dogs dataset:


from torchvision.transforms import v2
import torch.nn.functional as F
from torch.optim import SGD
import torch
from torch.utils.data import random_split, DataLoader
from torch.nn import Softmax
from torchvision.datasets import ImageFolder
from os.path import splitext
import xml.etree.ElementTree as ET
from torchvision import tv_tensors
from einops import reduce
import torch.nn as nn
from torchvision.models import resnet18, ResNet18_Weights

class BoundingBoxDataSet(ImageFolder):
    def __init__(self, image_root, transform, target_transform=None, is_valid_file=None):
        super().__init__(root=image_root, transform=transform, target_transform=target_transform, is_valid_file=is_valid_file)

    def __getitem__(self, index: int):
        """
        Args:
            index (int): Index

        Returns:
            tuple: (sample, target) where target is class_index of the target class.
        """
        image_path, target = self.samples[index]

        annotation_path = splitext(image_path.replace('images', 'annotation'))[0]

        annotations = ET.parse(annotation_path
                               )
        annotations_root = annotations.getroot()

        bounding_box_coordinates = annotations_root.find('object').find('bndbox')
        xmin = int(bounding_box_coordinates.find('xmin').text)
        ymin = int(bounding_box_coordinates.find('ymin').text)
        xmax = int(bounding_box_coordinates.find('xmax').text)
        ymax = int(bounding_box_coordinates.find('ymax').text)

        image = tv_tensors.Image(self.loader(image_path))

        bounding_box = tv_tensors.BoundingBoxes([xmin, ymin, xmax, ymax],
                                                format=tv_tensors.BoundingBoxFormat.XYXY,
                                                canvas_size=image.shape[-2:])

        if self.transform is not None:
            image, bounding_box = self.transform(image, bounding_box)
        if self.target_transform is not None:
            target = self.target_transform(target)

        return image, target, bounding_box


class ResNet18Model(nn.Module):
    def __init__(self, pretrained=True, freeze_backbone=True):
        super().__init__()

        weights = None
        if pretrained:
            weights = ResNet18_Weights.IMAGENET1K_V1

        self.backbone = nn.Sequential(*list(resnet18(weights=weights).children())[:-2])

        if freeze_backbone:
            for parameters in self.backbone.named_parameters():
                parameters[1].requires_grad = False

        self.classification_head = nn.Linear(512, 120)
        self.box_head = nn.Linear(512, 4)

    def forward(self, x):

        backbone_features = self.backbone(x)

        backbone_features = reduce(backbone_features, 'b c h w -> b c', reduction='mean')

        classification = self.classification_head(backbone_features)

        box = self.box_head(backbone_features)

        return classification, box

class DogLocalizer:
    def __init__(self, image_path, data_transformer, model, optimizer, loss_classification, loss_localization, learning_rate_scheduler, batch_size, epochs):

        self.image_path = image_path
        self.data_transformer = data_transformer
        self.model = model
        self.optimizer = optimizer
        self.loss_classification = loss_classification
        self.loss_localization = loss_localization
        self.learning_rate_scheduler = learning_rate_scheduler
        self.batch_size = batch_size
        self.epochs = epochs

        self.current_batch_index = 1
        self.batches_per_epoch = None

        # Data loaders
        self.train_loader = None
        self.validation_loader = None
        self.test_loader = None

    def setup_data_loaders(self, rename=True, seed=12345, data_split=[0.7, 0.2, 0.1], verbose=True):

        dataset = BoundingBoxDataSet(self.image_path, transform=self.data_transformer)

        if rename:
            # Rename classes to remove annoying characters
            def rename(name):
                return ' '.join(' '.join(name.split('-')[1:]).split('_'))
            for i, class_name in enumerate(dataset.classes):
                dataset.classes[i] = rename(class_name)

        # same split each time
        manual_seed_generator = torch.Generator().manual_seed(seed)

        train_data, validation_data, test_data = random_split(dataset, data_split, generator=manual_seed_generator)

        train_loader = DataLoader(train_data, batch_size=self.batch_size, shuffle=True, num_workers=2, drop_last=True)
        validation_loader = DataLoader(validation_data, batch_size=self.batch_size,
                                       shuffle=True, num_workers=2, drop_last=True)
        test_loader = DataLoader(test_data, batch_size=self.batch_size, shuffle=False, num_workers=2, drop_last=True)

        self.train_loader = train_loader
        self.validation_loader = validation_loader
        self.test_loader = test_loader
        self.batches_per_epoch = len(train_loader)
        self.total_batches = len(train_loader)*self.epochs

    def classification_accuracy(self, scores, batch_labels):
        score_to_probability = Softmax(dim=1)
        prediction = torch.argmax(score_to_probability(scores), dim=1)
        return (prediction == batch_labels).float().mean()

    def train(self):
        for epoch in range(self.epochs):
            self.model.train()
            for x_batch, class_batch, box_batch in self.train_loader:

                box_batch = torch.squeeze(box_batch)

                print(class_batch.argmax())
                # Update learning rate
                self.optimizer.param_groups[0]['lr'] = self.learning_rate_scheduler(
                    self.current_batch_index, self.total_batches, lr=self.optimizer.param_groups[0]['lr'])

                # Forward pass
                prediction_class, prediction_box = self.model(x_batch)

                loss_class = self.loss_classification(prediction_class, class_batch)
                loss_box = self.loss_localization(prediction_box, box_batch)
                total_loss = loss_class + loss_box

                class_batch = class_batch.float()
                box_batch = box_batch.float()

                # Backprop
                total_loss.backward()

                # Update model parameters
                self.optimizer.step()
                self.optimizer.zero_grad()


image_path = '/home/nmunch/computer_vision_project/data/images' # Path to images folder in stanford dogs dataset

def base_lr_scheduler(t, T, lr):
    return lr

cross_entropy = F.cross_entropy
L2_error = F.mse_loss
model = ResNet18Model()
optimizer = SGD(model.parameters(), lr=0.001)

batch_size = 3
epochs = 1

resnet_mean = [0.485, 0.456, 0.406]
resnet_std = [0.229, 0.224, 0.225]

resnet_transform = v2.Compose([v2.Resize((256, 256), antialias=True), v2.CenterCrop(224), v2.ToDtype(
    torch.float32, scale=True), v2.Normalize(mean=resnet_mean, std=resnet_std)])

dog_localizer = DogLocalizer(image_path=image_path, data_transformer=resnet_transform, model=model, optimizer=optimizer,
                             loss_classification=cross_entropy, loss_localization=L2_error, learning_rate_scheduler=base_lr_scheduler, batch_size=batch_size, epochs=epochs)

dog_localizer.setup_data_loaders(data_split=[0.899, 0.1, 0.001], verbose=False)
dog_localizer.train()


Hi Niels!

The short story is that you should cast only your ground-truth bounding boxes
(box_batch, the target passed to mse_loss()) to float32 before computing
your losses, leaving class_batch (the target passed to cross_entropy())
as int64.

In normal usage,* mse_loss() expects its target (as well as its input) to be
float32 (or float64), while cross_entropy() expects its target to be int64.
So if you cast neither class_batch nor box_batch to float32, mse_loss()
breaks, while if you cast both to float32, cross_entropy() breaks.

Cast just box_batch and you should be good.

*) The confusing point – which is probably a minor design bug – is that
mse_loss() will still work on its forward pass if either its input or its target
is of type int64, but will then fail on the backward pass. (Go figure …)
This is why you didn’t get the “Found dtype Long” error until you called
total_loss.backward(). (And to make matters even more inconsistent,
mse_loss() will fail on the forward pass if both its input and target are
int64. (Go figure …))

Best.

K. Frank

1 Like

Hello KFrank

Thank you for the very thorough response. I tried only casting the box_batch to float32, and do it before calculating the loss, and that seemed to do the trick!

Best, Niels