Hi folks, I’m new to ML and pytorch, so please apologies in advance for some very beginner quesstions.
I’m trying to extract features from a video using this model, but I’m bit confused on how to use it.
Reading the docs, it seems the model accepts an input as (B, T, C, H, W), so this is what I’ve done to capture frames using opencv and convert them on the 5D tensor:
def preprocess_video_frames(video_path, frame_size=(224, 224)):
frames = []
capture = cv2.VideoCapture(video_path)
while True:
ret, frame = capture.read()
if not ret:
break
# Resize each frame to the desired frame_size
frame = cv2.resize(frame, frame_size)
#ipdb.set_trace()
# Normalize pixel values to the range [0, 1]
frame = frame / 255.0
#frame = np.transpose(frame, (2, 0, 1))
# Append the preprocessed frame to the list
frames.append(frame)
capture.release()
return frames
def load_and_preprocess_video(video_path):
frame_list = preprocess_video_frames(video_path)
video_frames = np.stack(frame_list) # Convert list into a 4D array
video_tensor = torch.tensor(video_frames, dtype=torch.float32).permute(0, 3, 1, 2)
# Extract features using the pre-trained I3D model
with torch.no_grad():
features = i3d_model(video_tensor)
return features
I get Given groups=1, weight of size [64, 3, 3, 7, 7], expected input[1, 303, 3, 224, 224] to have 3 channels, but got 303 channels instead
This is where I’m a bit confused, I thought that this tensor would have (BxTxCxWxH) and hence the 3 channels dimension being correct.
Could someone shed a light on what I’m doing wrong?
Thank you