How to use binary cross entropy with logits in binary target and 3d output

I have batch size = 5
my network output is given by the following code Output = F.upsample(per_frame_logits, t, mode='linear')

Shape of output is = torch.Size([5, 2, 64])
Shape of target is = torch.Size([5]) (i.e. ex [1.0, 0.0, 0.0, 1.0, 1.0])

Then i pass it to following loss function loss = F.binary_cross_entropy_with_logits(output, target)

I get the following value error

raise ValueError("Target size ({}) must be the same as input size ({})".format(target.size(), input.size()))
ValueError: Target size (torch.Size([5])) must be the same as input size (torch.Size([5, 2, 64]))

Could you explain your current use case a bit?
It seems your model outputs a batch of 5 samples, each containing the logits (probabilities) for 2 classes for a tensor of length 64.
Also, it seems you are dealing with a multi-label classification, i.e. each sample might contain none, one, or both classes. Is this correct?

If so, what are the targets correspond to? It seems you are just passing a binary target for each sample in the batch, which would point to a vanilla binary classification.

I’m a beginner to pytorch and implementing i3d network for binary classification. I have RGB video (64 frames simultaneously) input to the network and each video have a single label which is 0 (failure) or 1 (success).

I kept my batch size to 5 just to check if my network or code is working or not. (I would call it a debug run)

Each class have a single label. So I guess I should change network output from 2 to 1.

So the output shape would correspond to [batch_size, nb_classes, frames]?
I’m not that familiar with the i3d model, but I assume the temporal (and spatial) dimensions were reduced somehow?

The current output format would correspond to a frame-wise multi-label classification.

In that case, you could use nn.BCEWithLogitsLoss (or nn.BCELoss + sigmoid) with a since output channel. Alternatively you could keep the two output channels and use nn.CrossEntropyLoss.

Below is the link to the author’s i3d network. In their case they frame-wise multi-label classification.
[](http://i3d Network for charades dataset)

I’m using Visual-Tactile dataset and

I3D is designed on kinetics dataset and I didn’t change default architecture from the above link having file “”.

I’m also new to this. But according to the author input frames to the network is 64. So each video i have converted to 64 frames. because I do not have to do multilabel classification.

Probably I need to change the final layer since i don’t want multi-label classification.

I still have the issue with dimension because it is clear that target and output are not of the same shape or as expected input to the loss function.

If you would like to classify each video sequence (64 frames) to a single class (binary classification), your output and target should both have the shape [batch_size, 1].
To achieve this you would need to reduce the model’s output, e.g. using an nn.Linear layer as the final classifier.

Thank you @ptrblck
Now it’s working. Yes, that’s a possible solution I tried and it worked. But instead disturbing the i3d architecture I converted the output of network into[batch_size, 1] by max-pooling with the dimension of 1 and then squeezed the output which makes my target and output shape the same.

1 Like