Below is the link to the author’s i3d network. In their case they frame-wise multi-label classification.
[GitHub - piergiaj/pytorch-i3d](http://i3d Network for charades dataset)
I’m using Visual-Tactile dataset and
I3D is designed on kinetics dataset and I didn’t change default architecture from the above link having file “pytorch_i3d.py”.
I’m also new to this. But according to the author input frames to the network is 64. So each video i have converted to 64 frames. because I do not have to do multilabel classification.
Probably I need to change the final layer since i don’t want multi-label classification.
I still have the issue with dimension because it is clear that target and output are not of the same shape or as expected input to the loss function.