Pertained C3D model for video classification

Hi all,
I want to extract video features. Is there a pre-trained C3D ( network available?.


Hi, Gkv,

There are more advanced I3D and P3D pytorch impementations.

P3D: Learning Spatio-Temporal Representation with Pseudo-3D Residual,ICCV 2017

I3D: Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset, CVPR 2017


Hi zhanghaoinf,
Thanks for your kind reply. I am very new to pytorch. Can you provide some links that tell how to use these kinds of implementations in my pytorch code. (I know how to load models using torchvision.models).

I have tested P3D-Pytorch. it’s pretty simple and should share similar process with I3D.

  1. Pre-process: For each frame in a clip, there is pre-process like subtracting means, divide std.
    An example:

import cv2
mean = (104 / 255.0, 117 / 255.0 ,123 / 255.0)
std = (0.225, 0.224, 0.229)
frame = cv2.imread(“a string to image path”)
frame /= 255.0 # [0, 255] -> [0, 1]
frame -= mean # subtract means
frame /= std # div std
frame = frame[:, : , (2, 1, 0)] # covert BGR image to RGB image

  1. A clip is a stack of frames with frame size H x W x 3 and clip size T x H x W x 3
    since P3D requires input in form of 3 x T x H x W, perform:

clip = clip.permte(3, 0, 1, 2).contiguous()

  1. Load P3D model
    (P3D, Bottleneck, p3d_model_path can be found in the mentioned github.)

model = P3D(Bottleneck, [3, 8, 36, 3], modality = ‘RGB’)
p3d_weights = torch.load(“a string to p3d model path”)[‘state_dict’]


Thanks a lot for your fast reply. One more doubt, what are these modalities (RGB and Flow)?

RGB modality refers to frames which are directly filmed by camera.
Flow refers optical flow that calculated between adjacent frames, which reflect motion information.

So in p3d, flow modality is for getting the motion description of the video and rgb is for the spatial features?

yes, you are right.
One more thing, optical flow can reflect motion patterns to some extents, but it is different from motion descriptor like dense trajectory. There exist some implementations to extract optical flow, such as fast-flow, brox-flow and TVL1. This part I didn’t study to much. Here are some references:
[1]. FlowNet 2.0: Evolution of Optical Flow Estimation with Deep Networks [Flow-Net v2] CVPR’17
[2]. Flownet learning optical flow with convolutional networks [Flow-Net v1] ICCV’15
[3]. Fast optical flow using dense inverse search [fast-flow] ECCV’16
[4]. TV-L1 Optical Flow Estimation, Image Processing’13
[5]. High accuracy optical flow estimation based on a theory for warping. [brox-flow] ECCV’04

Hi Zhanhaoinf,

Can the model run in single GPU easily? or does require multiple GPUs?

hi, I meet the same question as you, and I find a pytorch implement of C3D. The link is following:


I am interested in replicating the C3D paper by Du Tran. The original repository is in caffe. I request you to please share the or PyTorch implementation for the same.
I was able to find the following resources:

Repository containing models lor video action recognition, including C3D, R2Plus1D, R3D, inplemented using PyTorch (0.4.0)
Trained on UCF101 and HMDB51 datasets
Pytorch porting of C3D network, with Sports1M weights
Defining the C3D model as per the paper, not the complete implementation


I am aware that developments have been made in the field of action recognition since 2015, but I am specifically interested in the C3D paper by FAIR.
Thanking you (all) in anticipation