I posted an issue regarding the seek
method of torchaudio.io.StreamReader
in its github repo. It seems that that repo is no longer actively monitored, so I would like to post the link and content here to gain more attentions. (Github link: The seek functionality of StreamReader on the video stream does not return the correct frame if the start_time_stamp of the video stream is nonzero. · Issue #3824 · pytorch/audio · GitHub)
It would be highly appreciated if this issue could be fixed in a future release of torchaudio
.
Content of the issue
The issue is the seek functionality of StreamSmart does not seek to the correct position when the start_time_stamp of the video stream is non-zero. To reproduce the bug, use the code below and put the two attached videos in the same folder of the test script.
from typing import Dict
import torch
from torchaudio.io import StreamReader
class TorchaudioWrapper:
cpu_decoders: Dict[str, str] = {
"av1": "libaom-av1",
"hevc": "hevc",
"h264": "h264",
"prores": "prores",
}
def __init__(self, video_path: str, device: str = 'cpu') -> None:
self.video = StreamReader(video_path)
self.src_stream_info = self.video.get_src_stream_info(self.video.default_video_stream)
self.src_format: str = self.src_stream_info.format
self.codec = self.src_stream_info.codec
if device == 'cpu':
config = {
'buffer_chunk_size': 50,
'stream_index': None,
'decoder': self.cpu_decoders[self.codec],
"decoder_option": {"threads": str(0)},
'filter_desc': "scale=sws_flags=accurate_rnd+full_chroma_int:dst_format=rgb24,format=rgb24",
}
else:
raise ValueError(f'Invalid device: {device}. Torchaudio backend only supports "cpu".')
self.video.add_video_stream(1, **config)
self.stream = self.video.stream()
@property
def fps(self) -> float:
fps = self.src_stream_info.frame_rate
return fps
def __len__(self) -> int:
num_frames = self.src_stream_info.num_frames
return num_frames
def seek(self, time_s: float, mode: str = 'precise') -> None:
self.video.seek(time_s, mode)
self.stream = self.video.stream() # reset stream after seeking
return
def _iterate_stream(self):
(frame, ) = next(self.stream)
frame = torch.squeeze(frame, 0)
return frame
def __next__(self) -> torch.Tensor:
frame = self._iterate_stream()
return frame
def seek_and_get(self, frame_index: int) -> torch.Tensor:
start = frame_index / self.fps
self.seek(start, 'precise')
frame = self._iterate_stream()
return frame
def main():
# Test a video with zero start-time
video_path = 'source.mp4'
vr1 = TorchaudioWrapper(video_path)
vr2 = TorchaudioWrapper(video_path)
print(vr1.fps)
for i in range(len(vr1)):
frame2 = next(vr2)
frame1 = vr1.seek_and_get(i)
assert torch.allclose(frame1, frame2), f'Test failed at frame {i}!'
print('Test succeeded!')
# Test a video with non-zero start-time
video_path = 'test.mp4'
vr1 = TorchaudioWrapper(video_path)
vr2 = TorchaudioWrapper(video_path)
print(vr1.fps)
for i in range(len(vr1)):
frame2 = next(vr2)
frame1 = vr1.seek_and_get(i)
assert torch.allclose(frame1, frame2), f'Test failed at frame {i}!'
print('Test succeeded!')
if __name__ == '__main__':
main()
The start_time_stamp of source.mp4
is zero and we can see that the seek functionality could return the correct frame by seeking to the time_stamp = (frame_index / fps)
. As a result, the first part succeeds. Then we use the command ffmpeg -i tmp/test_bug/source.mp4 -output_ts_offset 0.033333 -c copy tmp/test_bug/test.mp4
to generate test.mp4
from source.mp4
. test.mp4
is exactly the same as source.mp4
except that the former has a non-zero start_time_stamp. However, the seek method of StreamReader returns a different frame and fails the second part of the test script.
I believe this behavior is not desired because a typical user of a video will not be aware of the value of the start_time_stamp. The same call of the seek
method may return two different frames, causing unexpected misalignment problems. I tried another Video Reader, the Decord, and it could handle this issue correctly and always return the same frame whether the start_time_stamp is zero or not. However, Decord does not handle color transform correctly or support GPU decoders.
It would be highly appreciated if this issue could be fixed in a future release of torchaudio
.