How to use binary cross entropy with logits in binary target and 3d output

Pritesh_Gohil · August 7, 2019, 1:38pm

I have batch size = 5
my network output is given by the following code Output = F.upsample(per_frame_logits, t, mode='linear')

Shape of output is = torch.Size([5, 2, 64])
Shape of target is = torch.Size([5]) (i.e. ex [1.0, 0.0, 0.0, 1.0, 1.0])

Then i pass it to following loss function loss = F.binary_cross_entropy_with_logits(output, target)

I get the following value error

raise ValueError("Target size ({}) must be the same as input size ({})".format(target.size(), input.size()))
ValueError: Target size (torch.Size([5])) must be the same as input size (torch.Size([5, 2, 64]))

ptrblck · August 7, 2019, 1:45pm

Could you explain your current use case a bit?
It seems your model outputs a batch of 5 samples, each containing the logits (probabilities) for 2 classes for a tensor of length 64.
Also, it seems you are dealing with a multi-label classification, i.e. each sample might contain none, one, or both classes. Is this correct?

If so, what are the targets correspond to? It seems you are just passing a binary target for each sample in the batch, which would point to a vanilla binary classification.

Pritesh_Gohil · August 7, 2019, 2:20pm

I’m a beginner to pytorch and implementing i3d network for binary classification. I have RGB video (64 frames simultaneously) input to the network and each video have a single label which is 0 (failure) or 1 (success).

I kept my batch size to 5 just to check if my network or code is working or not. (I would call it a debug run)

Each class have a single label. So I guess I should change network output from 2 to 1.

ptrblck · August 7, 2019, 2:30pm

So the output shape would correspond to [batch_size, nb_classes, frames]?
I’m not that familiar with the i3d model, but I assume the temporal (and spatial) dimensions were reduced somehow?

The current output format would correspond to a frame-wise multi-label classification.

In that case, you could use nn.BCEWithLogitsLoss (or nn.BCELoss + sigmoid) with a since output channel. Alternatively you could keep the two output channels and use nn.CrossEntropyLoss.

Pritesh_Gohil · August 7, 2019, 2:51pm

Below is the link to the author’s i3d network. In their case they frame-wise multi-label classification.
[GitHub - piergiaj/pytorch-i3d](http://i3d Network for charades dataset)

I’m using Visual-Tactile dataset and

I3D is designed on kinetics dataset and I didn’t change default architecture from the above link having file “pytorch_i3d.py”.

I’m also new to this. But according to the author input frames to the network is 64. So each video i have converted to 64 frames. because I do not have to do multilabel classification.

Probably I need to change the final layer since i don’t want multi-label classification.

I still have the issue with dimension because it is clear that target and output are not of the same shape or as expected input to the loss function.

ptrblck · August 8, 2019, 11:29pm

If you would like to classify each video sequence (64 frames) to a single class (binary classification), your output and target should both have the shape [batch_size, 1].
To achieve this you would need to reduce the model’s output, e.g. using an nn.Linear layer as the final classifier.

Pritesh_Gohil · August 11, 2019, 5:46pm

Thank you @ptrblck
Now it’s working. Yes, that’s a possible solution I tried and it worked. But instead disturbing the i3d architecture I converted the output of network into[batch_size, 1] by max-pooling with the dimension of 1 and then squeezed the output which makes my target and output shape the same.