How is minibatch sampling for vectorized envs (consider sequential processing of envs) with off policy RL implemented? Let’s say there are 5 envs. Every step the agent gets 5 transitions and pushes it to the buffer. While doing critic and policy updates, do we randomly sample transitions of batchsize N or do we randomly sample timesteps of batchsize N, thereby effectively giving us N*5 transitions?
Ideally random, if they are stored in the memory buffer.
It really makes no difference to the model what order you are feeding it the data, unless your model contains a memory state. Then you would definitely need the data fed sequentially.