How to make a time dimension of 2 in i3D

I have this code for extracting i3D features:

def train(train_loader, save_dirtrain):
    i3d = InceptionI3d(400, in_channels=3, )
    except RuntimeError as e:
    i3d.train()    # switch to train mode

    for i, (inputs, target, _) in enumerate(train_loader):
        inputs, target, _ = data
        print('underscore', target[0])
        features = []
        input_var = [input.cuda() for input in inputs]
        target_var = target.cuda()
        for frame in inputs: 
            ip = Variable(torch.from_numpy(frame.numpy()).cuda(), volatile=True)
            ip = torch.unsqueeze(ip, dim=4)
            ip = ip.permute(0, 1, 4, 2, 3)
            print('ip shape', ip.shape)
  , target), features.unsqueeze(2).data.cpu().numpy())

However, I get the following error:

\torch\nn\modules\", line 721, in forward
    return F.avg_pool3d(input, self.kernel_size, self.stride,
RuntimeError: input image (T: 1 H: 7 W: 7) smaller than kernel size (kT: 2 kH: 7 kW: 7)

Below is my custom dataloader:

class loadedDataset(Dataset):
	def __init__(self, root_dir, transform=None):
		self.root_dir = root_dir
		self.transform = transform
		self.classes = sorted(os.listdir(self.root_dir))
		self.count = [len(os.listdir(self.root_dir + '/' + c)) for c in self.classes]
		self.acc_count = [self.count[0]]
		for i in range(1, len(self.count)):
				self.acc_count.append(self.acc_count[i-1] + self.count[i])
		# self.acc_count = [self.count[i] + self.acc_count[i-1] for i in range(1, len(self.count))]

	def __len__(self):
		l = np.sum(np.array([len(os.listdir(self.root_dir + '/' + c)) for c in self.classes]))
		return l

	def __getitem__(self, idx):
		for i in range(len(self.acc_count)):
			if idx < self.acc_count[i]:
				label = i

		class_path = self.root_dir + '/' + self.classes[label] 

		if label:
			file_path = class_path + '/' + sorted(os.listdir(class_path))[idx-self.acc_count[label]]
			file_path = class_path + '/' + sorted(os.listdir(class_path))[idx]

		_, file_name = os.path.split(file_path)

		frames = []

		# print os.listdir(file_path)
		file_list = sorted(os.listdir(file_path))
		# print file_list

		# v: maximum translation in every step
		v = 2
		offset = 0
		for i, f in enumerate(file_list):
			frame = + '/' + f)
			offset += random.randrange(-v, v)
			offset = min(offset, 3 * v)
			offset = max(offset, -3 * v)
			frame = frame.transform(frame.size, Image.AFFINE, (1, 0, offset, 0, 1, 0))
			if self.transform is not None:
				frame = self.transform[0](frame)

		return frames, label, file_name

I believe I have to take a temporal stride of 2 but I am loading only a frame at a time. I tried to change the batch size to two, but it loads two sets of video sequences at a time instead of just two frames. Any ideas for me is appreciated. Thank you.

The issue is raised in the pooling layer as the spatial size of the input activation is too small for the kernel size, not the temporal dimension or batch size. Try to increase the spatial size of your inputs and see if this would fix the issue.

1 Like

can i increase the spatial size using torch.Tensor.expand ? Is it a good idea?

Thank you~

No, unless your input has a size of 1 in these dimensions. Based on the error message that’s not the case so you might need to e.g. Resize the input in case it’s an image.

The shape of my input is 1,3,244,244 (BxCxHxW),
when I unsqueezed to dimension 4 to have the T dimension, the shape is now 1, 3, 224, 224, 1,
then I permute it like this permute(0, 1, 4, 2, 3), hence the error that T is 1 .

By default, I don’t have the T dimension. Any ideas for me to generate this? Thank you so much for support.

I think pytorch_i3d expects my input to be video but what I have is video frames hence there I have BCHW and not BCTHW. I will try to stack frames. My code already resizes my image to 224, 224 like below but still get the error:

    transform = (transforms.Compose([

Thanks for the update. Yes, it seems you are missing frames in the temporal dimension. Repeating the single frame should work, but I also wouldn’t know why you would use a model expecting a video input when you only have images.

1 Like

Oh I kind of solved the error, it was because I am iterating through each frame instead of through each group of frames that represent one video. To solve the problem of not having videos as inputs, I just stacked the frames(input). However, if my frames are less than what the 2,7,7 kernel require it throws an error

This dataset is preprocessed as frames and the raw videos are not included I guess its for ease of download