How to accelerate resize for video data processing

Hello, I am doing a project related to video. I found video data processing may be the main run time bottleneck.
If I use a batch size of 128, each one is a sequence of 32 images. Then I have to resize 4096 images. If I use PIL.Image or opencv to resize, they do not support batch resize (so they are very slow). If I use torch.nn.functional.interpolate, I need to convert images to float32. As I understand it, we do not need such a high precision for resize, uint8 or float16 may be enough.

If I use more num_workers, the shared memory is not enough (DDP mode, 8x10=80workers). My machine has 100G shm and the training can use 70-90G shm.

Any suggestions? Thanks!

Since your server seems quite big and you are also using DDP, you could have a look at DALI, which provides video operators.