Faster way to concat tensors with autoregressive model's outputs

Hi, I’m trying to implement an autoregressive model which is a variant of memory-augmented neural network. (More precisely, I’m implementing a model called DKVMN in this paper). With a input of sequence length 100, the model basically do the read and write process 100 times and give 100 outputs in order. However, to compute BCE losses at once, I concat all these 100 outputs as one tensor (so it became (batch_size, 100, 1) tensor from (batch_size, 1) tensors) and this seems to give some computational bottleneck. I tried two approaches: 1) save each 100 tensors in a list and concat them later 2) concat 99 times for each output, and both took a lot of times. Maybe this is just a problem of the model itself (the I may just have to wait) but other RNN or LSTM models doesn’t take this much time for training. (Maybe these are highly optimized?) Is there any tip for boosting up the training time?
Thanks in advance.

cat/stack is pretty fast, you’re likely measuring asynchronous cuda calls incorrectly. Try the built-in profiler. RNNs are optimized (C++), yes. JIT may help you somewhat, but in general recurrent stuff is not fast.