Tensor creation slow on cpu (from replay buffer)

Sample indices directly, use numpy arrays, or store as tensors directly. Please see this similar question and its answer: How to make the replay buffer more efficient?