Tensor creation slow on cpu (from replay buffer)

Hello, I’m implementing Deep Q-learning and my code is slow due to the creation of Tensors from the replay buffer. Here’s how it goes:

I maintain a deque with a size of 10’000 and sample a batch from it everytime I want to do a backward pass. The following line is really slow:

curr_graphs = torch.Tensor(list(state(*zip(*xp_samples.curr_state)).graph))

That I decomposed to see what is really taking time

zipped = zip(*xp_samples.curr_state)
new_s = state(*zipped)
listt = list(new_s.graph)
curr_graphs = torch.Tensor(listt)

To notice that the last line, i.e the tensor creation, is what is taking all the computation. What is happening is that xp_samples and curr_state are named tuples. In this snippet I unpack and then zip and then unpack again to group the data by name from curr_state.

In my opinion it has to assemble data from memory using a lot of pointers to create the Tensor and thus is losing time moving things around. What would be the fastest way to create a tensor from sampled data from a buffer that I maintain? Should I allocate the size of the content of the deque so that it is continous in memory? I feel like it won’t speed up the process.

Here are the details of the deque if that’s relevant:

class ReplayBuffer:
def __init__(self, maxlen):
self.buffer = deque(maxlen=maxlen)

def add(self, new_xp):
self.buffer.append(new_xp)

def sample(self, batch_size):
xps = random.choices(self.buffer, k=batch_size)
return xp(*zip(*xps))

Sample indices directly, use numpy arrays, or store as tensors directly. Please see this similar question and its answer: How to make the replay buffer more efficient?