Speed issues, data in tensors?

Hi guys, 3 weeks ago i changed my “Dueling DDQN” system from Keras with Theano/Tensorflow backend to PyTorch. Reason: Speed issues. I was hoping, that i can get faster run times with PyTorch. Currently I’m only using my CPU, so I’m really concerned about speed.

I’m using simple test-data:

  • 2000 Episodes over 2000 data points
  • on every data point i do a training (on a random sample) of batchsize 32 of 5 inputs 32x5

I would assume, that this should be really fast… But it is not:

Keras/Theano backend: [3.38s/ episode] << fast
Keras/Tensorflow backend: [12.71s/ episode] << slower
Pytorch: [15.13s/ episode] << slowest

Maybe one problem is, that the data is currently spread and copyed between Deque/Numpy and PyTorch tensors? Does it make sense to copy ALL DATA (2000 x 5) into a tensor and get slices from that?

1.) So my first question is, which data has to be in tensors and which not - and why?

2.) I already asked something on stackoverflow but no answer:

# How does a (py)torch DDQN know, which action it is updating?

The code is quite complex but i can post parts of it to solve this. Just tell me, what else you need ;)…

Many thanks!

DDQN module:

##################################################################################
import torch
import torch.nn as nn
##################################################################################

class DQNTorch(nn.Module):

    #////////////////////////////////////////////////

    def __init__(self, num_inputs, num_outputs, num_neurons, dueling):

        #////////////////////////////////////////////////

        super(DQNTorch, self).__init__()

        #////////////////////////////////////////////////

        self.feature   = None
        self.value     = None
        self.advantage = None
        self.dueling   = dueling

        #////////////////////////////////////////////////

        if(self.dueling):

            #////////////////////////////////////////////////
            # dueling, add in forward()
            #////////////////////////////////////////////////

            self.feature = nn.Sequential(
                nn.Linear(num_inputs, num_neurons),
                nn.ReLU(),
                nn.Linear(num_neurons, num_neurons),
                nn.ReLU()
                )

            self.value = nn.Sequential(
                nn.Linear(num_neurons, num_neurons),
                nn.ReLU(),
                nn.Linear(num_neurons, 1)
                )

            self.advantage = nn.Sequential(
                nn.Linear(num_neurons, num_neurons),
                nn.ReLU(),
                nn.Linear(num_neurons, num_outputs)
                )

        #////////////////////////////////////////////////

        else:

            #////////////////////////////////////////////////
            # 1 more layer as output
            #////////////////////////////////////////////////

            self.feature = nn.Sequential(
                nn.Linear(num_inputs, num_neurons),
                nn.ReLU(),
                nn.Linear(num_neurons, num_neurons),
                nn.ReLU(),
                nn.Linear(num_neurons, num_outputs)
                )

    #////////////////////////////////////////////////

    def forward(self, state):

        #////////////////////////////////////////////////
        # model(x) == model.forward(x)
        #////////////////////////////////////////////////

        res = None

        #////////////////////////////////////////////////

        if(self.dueling):

            #////////////////////////////////////////////////

            fea = self.feature(state)
            val = self.value(fea)
            adv = self.advantage(fea)

            #////////////////////////////////////////////////

            res = val + (adv - adv.mean())

        #////////////////////////////////////////////////

        else:

            #////////////////////////////////////////////////

            res = self.feature(state)

        #////////////////////////////////////////////////

        return(res)

##################################################################################

class DQNetwork(object):

    #////////////////////////////////////////////////
    """DQNetwork - Deep Q Network"""
    #////////////////////////////////////////////////

    def __init__(self, SYS, target):

        #////////////////////////////////////////////////
        """DQNetwork - Deep Q Network"""
        #////////////////////////////////////////////////

        self.class_name  = self.__class__.__name__

        #////////////////////////////////////////////////

        self.num_inputs  = SYS.NET['STATE_SIZE']
        self.num_neurons = SYS.NET['NEURONS']
        self.num_outputs = SYS.NET['ACTION_SIZE']
        self.batch_size  = SYS.NET["FIT_BATCH_SIZE"]
        self.learn_rate  = SYS.NET['LEARNING_RATE']
        self.dueling     = (SYS.NET['DUELING'] == 1)
        self.verbose     = SYS.NET["VERBOSE"]
        self.target      = target

        #////////////////////////////////////////////////
        # self.device    = torch.device("cuda" if torch.cuda.is_available() else "cpu")
        #////////////////////////////////////////////////

        self.model       = DQNTorch(self.num_inputs, self.num_outputs, self.num_neurons, self.dueling)
        self.optimizer   = torch.optim.Adam(self.model.parameters(), lr=self.learn_rate)
        self.get_loss    = nn.MSELoss()

        #////////////////////////////////////////////////

        print(LINE)
        print(f"Init {self.class_name}({SYS.NET['NAME']}) with params: {self.__dict__}")

    #////////////////////////////////////////////////

    def fit(self, current_q, expected_q):

        #////////////////////////////////////////////////
        """fit the model aka minimize the loss"""
        #////////////////////////////////////////////////

        self.optimizer.zero_grad()

        #////////////////////////////////////////////////

        loss = self.get_loss(current_q, expected_q)

        #////////////////////////////////////////////////

        loss.backward()

        #////////////////////////////////////////////////

        self.optimizer.step()

        #////////////////////////////////////////////////

        res = loss.item()

        #////////////////////////////////////////////////

        return(res)

    #////////////////////////////////////////////////

    def predict(self, state):

        #////////////////////////////////////////////////
        """predict Q values (for actions) on ONE state or a BATCH of states"""
        #////////////////////////////////////////////////

        res = self.model(state)

        #////////////////////////////////////////////////

        return(res)
        
##################################################################################

MEMORY module and parts of the AGENT module:


##################################################################################

class MEMORY_D(object):

    #////////////////////////////////////////////////
    """memory in a deque"""
    #////////////////////////////////////////////////

    def __init__(self, max_size):

        #////////////////////////////////////////////////
        """memory"""
        #////////////////////////////////////////////////

        self.buffer = deque(maxlen=max_size) # buffer is a deque from collections

        #////////////////////////////////////////////////

    def add(self, obj):

        #////////////////////////////////////////////////

        self.buffer.append(obj)

    #////////////////////////////////////////////////

    def get_batch(self, batch_size):

        #////////////////////////////////////////////////

        res = random.sample(self.buffer, k=batch_size)

        #////////////////////////////////////////////////

        return(res)

    #////////////////////////////////////////////////

    def get_size(self):

        #////////////////////////////////////////////////

        return(len(self.buffer))

##################################################################################

class DQNAgent(object):

    #////////////////////////////////////////////////
    """DDQNAgent"""
    #////////////////////////////////////////////////

    def __init__(self, SYS):

        #////////////////////////////////////////////////
        """load env and init agent"""
        #////////////////////////////////////////////////

        SYS.AGT["NAME"] = self.__class__.__name__

        #////////////////////////////////////////////////

        self.MEM = MEMORY_D(max_size=SYS.AGT["MEMORY_MAX"])
        self.RAM = None # todo preallocate?

        #////////////////////////////////////////////////

        self.DQN = DQNetwork(SYS, target=False)
        self.TAR = DQNetwork(SYS, target=True)

        #////////////////////////////////////////////////

        self.UDTAU = SYS.AGT['TAR_UPD_TAU']
        self.GAMMA = SYS.AGT["GAMMA"]

        #////////////////////////////////////////////////

        self.state_size  = SYS.NET['STATE_SIZE']
        self.action_size = SYS.NET['ACTION_SIZE']

        #////////////////////////////////////////////////

        self.target_mode = SYS.AGT["TARGET_MODE"]
        self.batch_size  = SYS.AGT["MEM_BATCH_SIZE"]

        #////////////////////////////////////////////////

        self.update_every = SYS.AGT['TAR_UPD_EVERY']
        self.update_count = 0

        #////////////////////////////////////////////////
        # preallocate vars
        #////////////////////////////////////////////////

        self.state    = torch.zeros(1, self.state_size, dtype=torch.float32)
        self.action   = 0

        #////////////////////////////////////////////////

        self.c_states = torch.zeros(self.batch_size, self.state_size , dtype=torch.float32)
        self.actions  = torch.zeros(self.batch_size, 1               , dtype=torch.int64)
        self.rewards  = torch.zeros(self.batch_size, 1               , dtype=torch.float32)
        self.n_states = torch.zeros(self.batch_size, self.state_size , dtype=torch.float32)
        self.dones    = torch.zeros(self.batch_size, 1               , dtype=torch.int64)

        #////////////////////////////////////////////////

        self.DQN.model.eval()
        self.TAR.model.eval()

        #////////////////////////////////////////////////

        self.update_TAR_hard() # copy weights from DQN to TAR

        #////////////////////////////////////////////////

        print(LINE)
        print("Init", SYS.AGT["NAME"], "with params: {}".format(self.__dict__))
        
    #////////////////////////////////////////////////

    def train(self, SYS):
    
    	#////////////////////////////////////////////////
    	"""train the agent on a MINIBATCH"""
    	#////////////////////////////////////////////////
    
    	loss = 0.0
    
    	#////////////////////////////////////////////////
    
    	if(self.get_mem_size() <= self.batch_size): return(0)
    
    	#////////////////////////////////////////////////
    	# possible bottleneck, using a list is ~ 3.3 x faster for random access!
    	#////////////////////////////////////////////////
    
    	self.RAM = self.MEM.get_batch(batch_size=self.batch_size)
    
    	#////////////////////////////////////////////////
    
    	self.c_states = torch.from_numpy(np.array([item[0] for item in self.RAM], dtype=np.float32))
    	self.actions  = torch.from_numpy(np.array([item[1] for item in self.RAM], dtype=np.int64))
    	self.rewards  = torch.from_numpy(np.array([item[2] for item in self.RAM], dtype=np.float32))
    	self.n_states = torch.from_numpy(np.array([item[3] for item in self.RAM], dtype=np.float32))
    	self.dones    = torch.from_numpy(np.array([item[4] for item in self.RAM], dtype=np.int64))
    
    	#////////////////////////////////////////////////
    
    	q_values = self.DQN.predict(self.c_states).gather(1, self.actions.unsqueeze(1)).squeeze(1)
    
    	dqn_next = self.DQN.predict(self.n_states)
    
    	#////////////////////////////////////////////////
    
    	q_action = torch.argmax(dqn_next, dim=1)
    
    	tar_next = self.TAR.predict(self.n_states).gather(1, q_action.unsqueeze(1)).squeeze(1)
    
    	#////////////////////////////////////////////////
    
    	q_target = (self.rewards + (self.GAMMA * tar_next * (1 - self.dones)))
    
    	#////////////////////////////////////////////////
    
    	self.DQN.model.train()
    
    	#////////////////////////////////////////////////
    
    	loss = self.DQN.fit(current_q=q_values, expected_q=q_target)
    
    	#////////////////////////////////////////////////
    
    	self.DQN.model.eval()
    
    	#////////////////////////////////////////////////
    
    	self.update_TAR() # copy DQN to TAR on every N episodes
    
    	#////////////////////////////////////////////////
    
    	return(loss)
        
##################################################################################

Hi,

I would advice getting all your samples into a single Tensor. Wrap that into a TensorDataset. And use the builtin DataLoader to get samples for training. That will make your dataloader as efficient as it can be. Use a couple workers to make sure you get as much data as possible.

Then you might want to check how multithreading is behaving. In particular, if you have a small network, it might be beneficial to set torch.set_num_threads(x) to a value x that is smaller than the number of cpu threads. You will need to experiment with that value as it will depend a lot on your cpu and workload.

Dear albanD, thank you for your feedback. I have already read something about the TensorDataset / DataLoader, I would like to look at it again.

To be able to use my data in a DDQN reinforcement learner, I programmed my own environment (it uses Numpy). I do NOT use Cartpole / MountainCar environment or anything.

The big question is:

Can I increase the speed at which PyTorch works by a factor of > 2-3 by loading the data into tensors? That’s what i need to be at the same speed as with Keras/Theano. I have to rewrite the complete workflow / environment to do this. I have doubts, do you have any experience with the topic?

See the following, simple example:

MAIN LOOP (everything else removed)

    while not done:
        #////////////////////////////////////////////////
        SYS.ENV['STEPS'] += 1
        #////////////////////////////////////////////////
        action = SYS.AGT['CALL'].get_action(SYS, state)
        #////////////////////////////////////////////////
        next_state, reward, done, info = SYS.ENV['CALL'].step(action)
        #////////////////////////////////////////////////
        SYS.AGT['CALL'].remember(state, action, reward, next_state, done) # add data to memory
        #////////////////////////////////////////////////
        state = deepcopy(next_state) # copy
        #////////////////////////////////////////////////
        tqdm_e.update(1)
        #////////////////////////////////////////////////
        if(done): return(True)

The important part here is SYS.AGT[‘CALL’].get_action(SYS, state), it “predicts” an action by a state.

I can do this with Keras/Theano backend:

    def get_action(self, SYS, state): # (Keras/Theano)
        #////////////////////////////////////////////////
        self.state = convert_1D2D(state, 1, self.state_size)
        #////////////////////////////////////////////////
        qvals = self.DQN.predict(self.state)
        #////////////////////////////////////////////////
        self.action = np.argmax(qvals[0])
        #////////////////////////////////////////////////
        return(self.action)

ONE simple predict per call: [00:01<00:00, 1635.84 item(s)/s]

And i can do this with PyTorch:

    def get_action(self, SYS, state): # (PyTorch)
        #////////////////////////////////////////////////
        self.state = torch.from_numpy(state).float().unsqueeze(0)
        #////////////////////////////////////////////////
        self.DQN.model.eval()
        #////////////////////////////////////////////////
        with torch.no_grad():
            qvals = self.DQN.predict(self.state).numpy() # this predict is forward or model(x)
        #////////////////////////////////////////////////
        self.action = np.argmax(qvals[0])
        #////////////////////////////////////////////////
        return(self.action)

ONE simple forward per call: [00:03<00:00, 643.33 item(s)/s]

1635÷643 = ~ 2.5

That means, Keras with the Theano backend is about 2.5 times faster for this simple task. Both are using 1 Cpu Core at the moment. I would turn on multicore support as the last thing. Before that, it must be clear which data, where and how is processed. Thanks anyway for the hint.

Do you think I can make up the difference by using PyTorch Tensors for my data?

Can anyone tell me something about my question 2?

Lots of things can be at play here.
Is the .float() really useful. If so that is going to be relatively expensive (depending on the size of you DQN).
Why do you can .eval() every time? once should be enough no?
How did you install pytorch, do you use MKL/MKLDNN?

Dear albanD, thank you again for your feedback.

With the help of the following instructions I tested whether MKL is active.
https://gist.github.com/mingfeima/bdfb2db3928ca51b795622b29264ef11

Here are the results:

ldd libtorch.so
linux-vdso.so.1 (0x00007fff38544000)
libcudart-1b201d85.so.10.1 => /home/user/.local/lib/python3.7/site-packages/torch/lib/./libcudart-1b201d85.so.10.1 (0x000072a3e395a000)
libgomp-7c85b1e2.so.1 => /home/user/.local/lib/python3.7/site-packages/torch/lib/./libgomp-7c85b1e2.so.1 (0x000072a3e3730000)
libpthread.so.0 => /lib64/libpthread.so.0 (0x000072a3e36f0000)
librt.so.1 => /lib64/librt.so.1 (0x000072a3e36e6000)
libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x000072a3e36cc000)
libdl.so.2 => /lib64/libdl.so.2 (0x000072a3e36c6000)
libm.so.6 => /lib64/libm.so.6 (0x000072a3e357e000)
libc10_cuda.so => /home/user/.local/lib/python3.7/site-packages/torch/lib/./libc10_cuda.so (0x000072a3e3350000)
libnvToolsExt-3965bdd0.so.1 => /home/user/.local/lib/python3.7/site-packages/torch/lib/./libnvToolsExt-3965bdd0.so.1 (0x000072a3e3146000)
libc10.so => /home/user/.local/lib/python3.7/site-packages/torch/lib/./libc10.so (0x000072a3e2ef4000)
libstdc++.so.6 => /lib64/libstdc++.so.6 (0x000072a3e2cfa000)
libc.so.6 => /lib64/libc.so.6 (0x000072a3e2b34000)
/lib64/ld-linux-x86-64.so.2 (0x000072a42b45c000)

Note: all versions of PyTorch (with or without CUDA support) have Intel® MKL-DNN acceleration support enabled by default. https://software.intel.com/en-us/articles/getting-started-with-intel-optimization-of-pytorch

So it looks like PyTorch doesn’t use MKL in my distribution?

But this:

python3 -c 'import torch; a = torch.randn(10); print(a.to_mkldnn().layout)'
torch._mkldnn <<< result on my machine

does work?

I use Fedora 30 (Qubes OS) and have installed PyTorch via PIP. I will create a separate Conda Environment, install PyTorch with MKL and test it, thanks for the hint. I know that Theano uses openblas, maybe that explains the speed difference?

" .eval() every time"

The above loop is my bootstrap loop, “MAIN LOOP (everything else removed)” is wrong here. It should mean “BOOT LOOP (everything else removed)”.

I use it to fill the memory before i start training. So in this case you’re right, i don’t need it here. I already checked that, removing it does help a bit - but not very much.

BOOT LOOP (everything else removed):

while not done:
        #////////////////////////////////////////////////
        SYS.ENV['STEPS'] += 1
        #////////////////////////////////////////////////
        action = SYS.AGT['CALL'].get_action(SYS, state)
        #////////////////////////////////////////////////
        next_state, reward, done, info = SYS.ENV['CALL'].step(action)
        #////////////////////////////////////////////////
        SYS.AGT['CALL'].remember(state, action, reward, next_state, done) # add data to memory
        #////////////////////////////////////////////////
        state = deepcopy(next_state) # copy
        #////////////////////////////////////////////////
        tqdm_e.update(1)
        #////////////////////////////////////////////////
        if(done): return(True)

followed by the TRAIN LOOP:

while not done:
            #////////////////////////////////////////////////
            SYS.ENV['STEPS'] += 1
            #////////////////////////////////////////////////
            action = SYS.AGT['CALL'].get_action(SYS, state)
            #////////////////////////////////////////////////
            next_state, reward, done, info = SYS.ENV['CALL'].step(action)
            #////////////////////////////////////////////////
            SYS.AGT['CALL'].remember(state, action, reward, next_state, done)
            #////////////////////////////////////////////////
            SYS.ENV['SCORE']  += reward
            SYS.ENV['LOSSES'] += SYS.AGT['CALL'].train(SYS)
            #////////////////////////////////////////////////
            state = deepcopy(next_state)

Inside the TRAIN LOOP

  • in SYS.AGT[‘CALL’].get_action() i switch to eval()
  • in SYS.AGT[‘CALL’].train() i switch to train()

You can see that in DQNAgent.train() in the code of the first post. But as I can see I have done it twice, it will be corrected.

So my next steps are as follows:

  • move all data to tensors (TensorDataset) and use DataLoader
  • check the usage of float()
  • check the usage of eval()/train()
  • install PyTorch via Conda and compare the runtimes with/without MKL

It will take some time (days) but I will report the results here.

Since I have seen in other threads that you are very well informed about these topics:
Can you answer my question 2 (first posting)?

Thank you very much for your efforts to help me. :+1:

For your stack overflow question. The batch size is 8 and by using the .gather(1, action_batch), you select the one action that was made out of the 2 possible. and so you get an output of size [8, 1].

That part is clear to me, but i still don’t understand how it is working internally. Sorry :see_no_evil:

The code in the SO question is from this PyTorch tutorial, i didn’t changed it.

To make the question clearer:

# Compute Huber loss
loss = F.smooth_l1_loss(state_action_values, expected_state_action_values.unsqueeze(1))

Here the loss is computed from 2 arrays with batch size 8 and 1 column. But we have to predict/learn Q-Values for 2 actions [LEFT|RIGHT] in the cartpole env? How can we do loss.backward() without knowing for which action we do the update?

Does the gather() just return some results or does it somehow internally set/save this index to do the following backward() call?

If so, what happens when i need another gather() call to collect something - like in my code from AGENT module?

    def train(self, SYS):
       	#////////////////////////////////////////////////
    	"""train the agent on a MINIBATCH"""
    	#////////////////////////////////////////////////
       	loss = 0.0
       	#////////////////////////////////////////////////
       	if(self.get_mem_size() <= self.batch_size): return(0)
       	#////////////////////////////////////////////////
       	self.RAM = self.MEM.get_batch(batch_size=self.batch_size)
       	#////////////////////////////////////////////////
       	self.c_states = torch.from_numpy(np.array([item[0] for item in self.RAM], dtype=np.float32))
    	self.actions  = torch.from_numpy(np.array([item[1] for item in self.RAM], dtype=np.int64))
    	self.rewards  = torch.from_numpy(np.array([item[2] for item in self.RAM], dtype=np.float32))
    	self.n_states = torch.from_numpy(np.array([item[3] for item in self.RAM], dtype=np.float32))
    	self.dones    = torch.from_numpy(np.array([item[4] for item in self.RAM], dtype=np.int64))
       	#////////////////////////////////////////////////

       	q_values = self.DQN.predict(self.c_states).gather(1, self.actions.unsqueeze(1)).squeeze(1)
    	dqn_next = self.DQN.predict(self.n_states)
    
    	#////////////////////////////////////////////////
    
    	q_action = torch.argmax(dqn_next, dim=1)    
    	tar_next = self.TAR.predict(self.n_states).gather(1, q_action.unsqueeze(1)).squeeze(1)
    
    	#////////////////////////////////////////////////    
    	q_target = (self.rewards + (self.GAMMA * tar_next * (1 - self.dones)))    
    	#////////////////////////////////////////////////    
    	self.DQN.model.train()    
    	#////////////////////////////////////////////////    
    	loss = self.DQN.fit(current_q=q_values, expected_q=q_target)    
    	#////////////////////////////////////////////////    
    	self.DQN.model.eval()    
    	#////////////////////////////////////////////////    
    	self.update_TAR() # copy DQN to TAR on every N episodes    
    	#////////////////////////////////////////////////    
    	return(loss)

It works fine - but i don’t get how it works.

Thanks again!

The update rule for the DQN get gradients flowing only for the index that was selected.
The backward pass of the gather function will do just that: pass the gradient of the output to the values that were gathered and 0 for everything else. These indices are saved during the forward pass as a “buffer” needed to compute the backward. We actually save a lot of a them during the forward.

Dear albanD, thank you for the explanations of the gather () function. I haven’t fully understood it yet, but now I know which way to look for more information. Links are welcome :wink:

Unfortunately, I’m back earlier than I thought, with bad news. I’ve installed PyTorch in a conda-env called “test232”, MKL should work. But nothing has changed. I did some tests and I would like to show you the results.

(test232) [user@localhost lib]$ python --version
Python 3.7.4

(test232) [user@localhost lib]$ which python
~/.conda/envs/test232/bin/python

(test232) [user@localhost /]$ python -c 'import torch; print(torch.__path__)'
['/home/user/.conda/envs/test232/lib/python3.7/site-packages/torch']

(test232) [user@localhost /]$ cd /home/user/.conda/envs/test232/lib/python3.7/site-packages/torch
(test232) [user@localhost torch]$ cd lib
(test232) [user@localhost lib]$ ldd libtorch.so
	linux-vdso.so.1 (0x00007ffe21549000)
	libgomp.so.1 => /home/user/.conda/envs/test232/lib/python3.7/site-packages/torch/lib/./../../../../libgomp.so.1 (0x00007256b0ed9000)
	libpthread.so.0 => /lib64/libpthread.so.0 (0x00007256b0e99000)
	librt.so.1 => /lib64/librt.so.1 (0x00007256b0e8f000)
	libgcc_s.so.1 => /home/user/.conda/envs/test232/lib/python3.7/site-packages/torch/lib/./../../../../libgcc_s.so.1 (0x00007256b0e7b000)
	libdl.so.2 => /lib64/libdl.so.2 (0x00007256b0e75000)
	libmkl_intel_lp64.so => /home/user/.conda/envs/test232/lib/python3.7/site-packages/torch/lib/./../../../../libmkl_intel_lp64.so (0x00007256b0309000)
	libmkl_gnu_thread.so => /home/user/.conda/envs/test232/lib/python3.7/site-packages/torch/lib/./../../../../libmkl_gnu_thread.so (0x00007256aea2e000)
	libmkl_core.so => /home/user/.conda/envs/test232/lib/python3.7/site-packages/torch/lib/./../../../../libmkl_core.so (0x00007256aa70e000)
	libm.so.6 => /lib64/libm.so.6 (0x00007256aa5c8000)
	libc10.so => /home/user/.conda/envs/test232/lib/python3.7/site-packages/torch/lib/./libc10.so (0x00007256aa36f000)
	libstdc++.so.6 => /home/user/.conda/envs/test232/lib/python3.7/site-packages/torch/lib/./../../../../libstdc++.so.6 (0x00007256aa1fb000)
	libc.so.6 => /lib64/libc.so.6 (0x00007256aa035000)
	/lib64/ld-linux-x86-64.so.2 (0x00007256b687c000)
	
(test232) [user@localhost lib]$ python -c 'import torch; a = torch.randn(10); print(a.to_mkldnn().layout)'
torch._mkldnn

BOOT LOOP

        while not done:
            #////////////////////////////////////////////////
            SYS.ENV['STEPS'] += 1
            #////////////////////////////////////////////////
            action = SYS.AGT['CALL'].get_action1(SYS, state)
            #////////////////////////////////////////////////
            next_state, reward, done, info = SYS.ENV['CALL'].step(action)
            #////////////////////////////////////////////////
            SYS.AGT['CALL'].remember(state, action, reward, next_state, done)
            #////////////////////////////////////////////////
            state = deepcopy(next_state)
            #////////////////////////////////////////////////
            tqdm_e.update(1)
            #////////////////////////////////////////////////
            if(SYS.AGT['CALL'].get_mem_size() >= request):
                tqdm_e.refresh()
                tqdm_e.close()
                return(True)

I’ve created some different get_action() functions, the old (from above) is get_action1(), you already know it:

    def get_action1(self, SYS, state): # (PyTorch)
        #////////////////////////////////////////////////
        self.epsilon = SYS.AGT['EPSILON']
        self.action  = -1 # set to invalid value
        #////////////////////////////////////////////////
        self.state = torch.from_numpy(state).float().unsqueeze(0)
        #////////////////////////////////////////////////
        self.DQN.model.eval()
        #////////////////////////////////////////////////
        with torch.no_grad():
            qvals = self.DQN.predict(self.state).numpy()
        #////////////////////////////////////////////////
        self.action = np.argmax(qvals[0])
        #////////////////////////////////////////////////
        return(self.action)

and

    def get_action2(self, SYS, state):
        #////////////////////////////////////////////////
        self.epsilon = SYS.AGT['EPSILON']
        self.action  = -1 # set to invalid value
        #////////////////////////////////////////////////
        self.state = torch.as_tensor(state, dtype=torch.float32).unsqueeze(0)
        #////////////////////////////////////////////////
        # self.DQN.model.eval()
        #////////////////////////////////////////////////
        with torch.no_grad():
            qvals = self.DQN.model(self.state)
        #////////////////////////////////////////////////
        self.action = torch.argmax(qvals, dim=1).item()
        #////////////////////////////////////////////////
        return(self.action)

I’ve increased my memory size to 32.000 to get stable measures:

Keras/Theano backend
bootstrap examples: 100%|██████████| 32000/32000 [00:10<00:00, 3098.43 item(s)/s]

PyTorch without MKL using get_action1()
python: /usr/bin/python3
torch.path: /home/user/.local/lib/python3.7/site-packages/torch
bootstrap examples: 100%|██████████| 32000/32000 [00:15<00:00, 2085.01 item(s)/s]

PyTorch with MKL using get_action1()
python: /home/user/.conda/envs/test232/bin/python
torch.path: /home/user/.conda/envs/test232/lib/python3.7/site-packages/torch
bootstrap examples: 100%|██████████| 32000/32000 [00:15<00:00, 2054.94 item(s)/s]

PyTorch with MKL using get_action2()
python: /home/user/.conda/envs/test232/bin/python
torch.path: /home/user/.conda/envs/test232/lib/python3.7/site-packages/torch
bootstrap examples: 100%|██████████| 32000/32000 [00:13<00:00, 2319.04 item(s)/s]

A small speed increase but unfortunately it is still slower than Keras with the Theano backend.

To rule out that data processing (data in numpy) is the bottleneck, I created get_action3():

self.state is now only ONCE initialized to zeros

    def get_action3(self, SYS, state):
        #////////////////////////////////////////////////
        # self.state = torch.as_tensor(state, dtype=torch.float32).unsqueeze(0)
        #////////////////////////////////////////////////
        # self.DQN.model.eval()
        #////////////////////////////////////////////////
        with torch.no_grad():
            qvals = self.DQN.model(self.state) # <<< self.state is now only initialized to zeros
        #////////////////////////////////////////////////
        self.action = torch.argmax(qvals, dim=1).item()
        #////////////////////////////////////////////////
        return(self.action)

PyTorch with MKL using get_action3()
python: /home/user/.conda/envs/test232/bin/python
torch.path: /home/user/.conda/envs/test232/lib/python3.7/site-packages/torch
bootstrap examples: 100%|██████████| 32000/32000 [00:13<00:00, 2439.39 item(s)/s]

To see how fast the speed is when no get_action() function is called, I changed the corresponding line in the boot loop:

BOOT LOOP without get_action() call

        while not done:
            #////////////////////////////////////////////////
            SYS.ENV['STEPS'] += 1
            #////////////////////////////////////////////////
            action = 0 # no get_action() call
            #////////////////////////////////////////////////
            next_state, reward, done, info = SYS.ENV['CALL'].step(action)
            #////////////////////////////////////////////////
            SYS.AGT['CALL'].remember(state, action, reward, next_state, done)
            #////////////////////////////////////////////////
            state = deepcopy(next_state)
            #////////////////////////////////////////////////
            tqdm_e.update(1)
            #////////////////////////////////////////////////
            if(SYS.AGT['CALL'].get_mem_size() >= request):
                tqdm_e.refresh()
                tqdm_e.close()
                return(True)

That shows the maximum speed that my code/environment can deliver, including storing new data.

bootstrap examples: 100%|██████████| 32000/32000 [00:01<00:00, 26063.70 item(s)/s]

I’m really sorry to say that, but unfortunately it looks like PyTorch can’t keep up. :sob:

I haven’t fully understood it yet,

What is unclear? The main idea is that if some input is not used in the output, then the gradient of the output wrt that input is 0.

PyTorch can’t keep up.

Yes unfortunately, pytorch was built for large GPU training and experimentation than CPU inference for small inputs (batch size 1).
The way we are going to make such inference more efficient is with the jit module and TorchScript.
Also if you are doing only inference, you should run your loop in a with torch.no_grad(): wrapper to avoid the autograd overhead.

I have now created get_action4():

    def get_action4(self, SYS, state):
        #////////////////////////////////////////////////
        qvals = self.DQN.model(self.state)
        #////////////////////////////////////////////////
        self.action = torch.argmax(qvals, dim=1).item()
        #////////////////////////////////////////////////
        return(self.action)

an run the boot loop inside torch.no_grad():

        with torch.no_grad():
            while not done:

bootstrap examples: 100%|██████████| 32000/32000 [00:13<00:00, 2426.59 item(s)/s]

That also does not help, sorry. Thank you for your help anyway and I’m sorry that I took so much of your time. It is really sad but unfortunately I have to stay with Keras / Theano under these circumstances. In my case, this “little differences” mean that I would have several hours more training time.

By the way, I experienced almost the same story with Tensorflow. I started with Keras / Theano, then I spent weeks working with Tensorflow, various environments from Intel / Conda-Forge and so on. No matter what I tried, everything was slower than Theano.

And now the same with PyTorch, I never would have thought that…

Fortunately, I don’t have to understand in detail how gather () and backward () work. :grinning:

I would guess that this is because Theano compiles the whole thing ahead of time. So it can do many optiomizations that pytorch cannot do :confused: That’s the price to pay to have flexibility.
I am surprised that tensorflow is slower though.

Unfortunately Theano is no longer maintained / developed, so I thought it would make sense to look for an alternative. But I agree with you, that could be the (or one) reason. Yes, I was also surprised, but I have to deal with it again in more detail.

I wish you all the best, many thanks again. Hopefully / maybe we’ll see each other again :wink:

Sure, happy to help.

And if you get a chance later, try torchscript which we use as pytorch runtime for inference. It can do fancy optimization as it knows in advance what you’re gonna run. But it does not support training yet :confused:

Without training i can’t use it, but thanks.

Bye bye and greetings from Berlin :grinning: