Invalid Argument when batch size 256 and different error with batch size 512

Error with batch size 256

Traceback (most recent call last):
  File "main_supcon.py", line 265, in <module>
    main()
  File "main_supcon.py", line 247, in main
    loss = train(train_loader, model, criterion, optimizer, epoch, opt)
  File "main_supcon.py", line 205, in train
    loss = criterion(features1,features2)
  File "/DATA/rani.1/miniconda3/envs/sim_test/lib/python3.6/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/DATA/rani.1/SupSim/losses.py", line 106, in forward
    sim_ij = torch.diag(similarity_matrix, self.batch_size)
RuntimeError: invalid argument 2: invalid size at /opt/conda/conda-bld/pytorch_1623448233824/work/aten/src/THC/THCStorage.cpp:26

when batch size 512 , error

 denominator = self.negatives_mask * torch.exp(similarity_matrix / self.temperature)
RuntimeError: The size of tensor a (1024) must match the size of tensor b (672) at non-singleton dimension 1

@ptrblck Can you help with this?

Can you give more information with respect to the model?

A short snippet of code that reproduces your error could be very helpful for people to help you!

1 Like

Model I am using is

class Net(nn.Module):
    """backbone + projection head"""
    def __init__(self, name='resnet50', head='mlp', feat_dim=128):
        super(Net, self).__init__()
        self.encoder = []
        for name, module in resnet50(pretrained=True).named_children():
            if name == 'conv1':
                module = nn.Conv2d(3, 64, kernel_size=3, stride=1, padding=1, bias=False)
            if not isinstance(module, nn.Linear) and not isinstance(module, nn.MaxPool2d):
                self.encoder.append(module)
        # encoder
        self.encoder = nn.Sequential(*self.encoder)
        if head == 'linear':
            self.head = nn.Linear(dim_in, feat_dim)

        elif head == 'mlp':
            self.head = nn.Sequential(nn.Linear(2048, 512, bias=False), nn.BatchNorm1d(512),
                               nn.ReLU(inplace=True), nn.Linear(512, feat_dim, bias=True))
         
        else:
            raise NotImplementedError(
                'head not supported: {}'.format(head))

    def forward(self, x):
        feat = self.encoder(x)
        feat = torch.flatten(feat, start_dim=1)
        feat = F.normalize(self.head(feat), dim=-1)
        return feat

I am using mlp head , so the output is of dimension [batchsize,128] , as feat_dim=128 .

I am passing x=torch.cat([batch_size,128],[batch_size,128]) , i.e the output of the above model
through the loss function ref given below:

 self.batch_size = batch_size
        self.register_buffer("temperature", torch.tensor(temperature))
        self.register_buffer("negatives_mask", (~torch.eye(batch_size * 2, batch_size * 2, dtype=bool)).float())
            
    def forward(self, emb_i,emb_j,labels=None, mask=None):
        
        """emb_i and emb_j are batches of embeddings, where corresponding indices are pairs
        z_i, z_j as per SimCLR paper"""
        
        z_i = F.normalize(emb_i, dim=1)
        z_j = F.normalize(emb_j, dim=1)

        #print(z_i.shape)
        representations = torch.cat([z_i, z_j], dim=0)
        #representations= representations.view(256,256)
        #print(representations.shape)
        similarity_matrix = F.cosine_similarity(representations.unsqueeze(1), representations.unsqueeze(0), dim=2)
        
        #print(representations.shape)
        sim_ij = torch.diag(similarity_matrix, self.batch_size)
        sim_ji = torch.diag(similarity_matrix, -self.batch_size)
        positives = torch.cat([sim_ij, sim_ji], dim=0)
        
        nominator = torch.exp(positives / self.temperature)
    
        denominator = self.negatives_mask * torch.exp(similarity_matrix / self.temperature)
    
        loss_partial = -torch.log(nominator / torch.sum(denominator, dim=1))
        loss = torch.sum(loss_partial) / (2 * self.batch_size)
        return loss  

I guess there no issue with the model code.

what would I suggest is to print the size of each intermediate variable, because the error is related to sizes only.

I feel there is a size mismatch with the similarity matrix, look into torch.diag() specifically.

I feel the same , but dimension after cosine similarity is causing the error I guess.
Even program starts running but stops at the last step of epoch when
Dimension are as follows:
Representation: torch.Size([672, 128])
Similarity Matrix torch.Size([672, 672])

Before the last step of a epoch ,dimension is as follow
Representation: torch.Size([1024, 128])
Similarity Matrix torch.Size([1024, 1024])

672 is causing problem

672 is not an issue because the total number of inputs is not multiple of 1024 if remove 672 items it will run that I feel,
you also got a representation matrix and similarity matrix.
the issue I feel is in
sim_ij and sim_ji the torch.diag, because 672 is less than 1024.

are you getting the output of sim_ij and sim_ji in the last step of epoch?

No I m not getting sim_ij and sim_ji for the last step .Also there dim is torch.Size([512])

So you should go through torch.diag operation
https://pytorch.org/docs/stable/generated/torch.diag.html

it return the diagonal from the matrix. for example in 3x3 it can only return 5 diagonals (ranging from -2 to 2)
so in case of of 672x672 matrix how it will return 1024 diagonal.

So look into it and run code step by step.

1 Like

But problem changes with change in batch size , so exactly removing some input wont work

try use ‘drop_last=True’ in DataLoader?