How to combine an image tensor (4D) and a depth tensor (4D) to create a 5D tensor [batch size, channels, depth, height, width]?

Thank you for your response.

This is better illustrated in the figure below – this is only an illustration for the R channel:

The depth values are quantized and then are used to convert the RGB images to RGB-D voxel representations. Currently, the depth tensor contains a single depth value per pixel. The values range from 0 to 400, so they can be quantized in 4 intervals. Do you have any idea or advice on how I can utilize the depth tensor to produce this kind of input representation? Thanks again.