Torchvision.io video reader lack of frames

Hi,
I’ve realized that torchvision as well as other libraries suck as skvideo and opencv retrieve less amount of frames than ffmpeg.

Video example: https://drive.google.com/file/d/1DIRsDf1SrLOTGbVejoL-PEIlxDPP0LMC/view?usp=sharing

Context:
I’m rencoding a dataset of youtube videos to 25.0 FPS via ffmpeg

Recording (.mkv) contains audio stream and video stream.
Both streams are same duration (according to metadata info from ffprobe)
Audio stream’s duration match the ones stated by metadata

Extracting frames via unix command line with ffmpeg provides a proper amount of frames (3688 in case of the given video example)

ffmpeg -i /media/jfm/Slave/SkDataset/videos/cello/1u3yHICR_BU.mkv  %05d.bmp

Extracting frames via imageio matches the one from unix ffmpeg.

w = mimread(PATH, memtest=False)

Using other librarias like torchvision.io video reader, skvideo or even opencv videocapture gather less frames. The amount of frames are less than the expected ones. I’ve tried to debug the video reader from torchvision in order to see if it’s skipping frames with negative stamps but seems not to be the case. Altough by-default seeking point is 0. Anyway reproducing the video seems not to generate black frames indicating (I think) that the video stream contains only positive timestamps (which also makes sense since the whole video has been rencoded)

import skvideo.io
import skvideo.datasets
videodata = skvideo.io.vread(PATH)

Any idea? An example where this issue happens is given above.
btw I think @fmassa is developing the video reader.

Unfortunately, I can’t comment on the mismatch of the extracted frames, but don’t think it’s an “OpenCV vs. FFmpeg” issue, since OpenCV can use FFmpeg.
That being said, it seems as if torchvision.io uses pyav as the default reader (line of code), which wraps FFmpeg as seen here.

Hi,
I know (I’ve digged into the torchvision code but not in pyav’s).
It seems all the libraries use ffmpeg as backend. That’s why my insight is it is related to stamps or the way libraries stop decoding. I realized that the frame counter of the library which properly extract the frames count the same frames than the ones which are wrong. So I think other libraries internally stop decoding once they reach that “last” frame.

Anyway thanks for having a look :slight_smile:

Thanks for the clarification, that’s interesting.
In other words: if you are directly using ffmpeg in the terminal, you are getting the expected number of frames, but using a wrapper (via torchvision.io) you get a reduced number of frames?

Yep that’s the idea. I opened an issue github but I have no really time to dig into this. Thanks anyway :slight_smile: