Conv module in forward can't alloc

Hi, here is the problem description:
I’m training an IMPALA agent on a GPU with allocated memory 113000MB, and every time I train over 1.1M frames I got this error:


It didn’t show something like “Cuda out of memory error”, but just a RUNTIME Error: Can’t alloc. The feature_extraction model contains a six-layer Conv2d net.
I wonder if my model is too big for scalable training? What can I do to reduce the memory usage?

This is a CPU memory bad allocation problem, the screensho is not enough to determine the problem root.

@iffiX Thanks for the problem identification! I’m new to here and only can upload one image at a time :frowning: . Can you give me some hints to to determine the problem root? I currently have 40 actors to run and fill the experience buffer

Is showing all of your code applicable?

Unfortunately, the code is too large, however, I can show you the code of creating actors and batch_and_learn in each thread:
here is the actor initialization:

And here is the batch and learn in each thread


(I still can only upload one image at a time :frowning: )

There doesn’t seem to be anything wrong in the showed part. I think you may try to decrease the the local /global replay buffer size in your implementation, /or skip learning & rolling out by appending radomly generated tensor data into your buffer and see when they will overflow, and track down the distributed bug.

If you would like to have a correct implementation of IMPALA as reference, I have written a tested one before.
Doc here:
https://machin.readthedocs.io/en/latest/tutorials/unleash_distributed_power.html#impala
and source code here:

1 Like

@Shizhe_Cai you could copy paste your code between triple backward ticks instead of uploading a screenshot, Something like this -

your entire code here, this would make it more readable...
1 Like

@iffiX Thank you, I will try to reduce the buffer size at first. @a_d My implementation is actually based on the TorchBeast implementation
https://github.com/facebookresearch/torchbeast/blob/master/torchbeast/monobeast.py
, all I’ve modified is that I added some new models to train.

Sorry to bother you but I still didn’t solve the problem. I tried to half the buffer size but it still cause alloc error when reaching 1.1M frames.

The allocation error could be a leak at anywhere, I suppose that you should remove training & sampling completely, and just stuff random data into your system and test its stability.

Hi iffi,
I think I’ve located the problem, but I don’t know how to fix it. As you know the IMPLALA model has actors and learners. I’m using the torchbeast implementation called monobeast: that run actors on a number of processes on CPU, and do the learning part on GPU. There are some drawbacks of this pattern, which also indicated in the torchbeast paper: the model require a large amount of shared memory between processes, and do the model eval and environment.step on cpu too. And their are some copies of tensors that is unnecessary. So as the frame number goes large, more and more memories were eaten by some unused tensors. That’s why no matter what buffer size and number of actors i use, the model always got an error when training for about 1.1M frames, there is no room for the actor model to do the forward operation.

I wonder how I can free up those memory/ I’ve tried del output, and del loss. the limit increase to 1.3M frames but that’s it. here is the part of fill in the buffer, followed by the part of withdrawing the buffer.
`def act(i: int, free_queue: mp.SimpleQueue, full_queue: mp.SimpleQueue,
model: torch.nn.Module, buffers: Buffers,
episode_state_count_dict: dict, train_state_count_dict: dict,
initial_agent_state_buffers, flags):

try:
    log.info('Actor %i started.', i)
    timings = prof.Timings()

    gym_env = create_env(flags)

    if flags.num_input_frames > 1:
        gym_env = FrameStack(gym_env, flags.num_input_frames)

    if 'procgen' in flags.env:
        env = ProcGenEnvironment(gym_env, flags.start_level, flags.num_levels, flags.distribution_mode)
    else:
        seed = i ^ int.from_bytes(os.urandom(4), byteorder='little')
        gym_env.seed(seed)
        env = Environment(gym_env, fix_seed=flags.fix_seed, env_seed=flags.env_seed)

    env_output = env.initial()
    agent_state = model.initial_state(batch_size=1)
    agent_output, unused_state = model(env_output, agent_state)

    while True:
        index = free_queue.get()
        if index is None:
            break

        # Write old rollout end.
        for key in env_output:
            buffers[key][index][0, ...] = env_output[key]
        for key in agent_output:
            buffers[key][index][0, ...] = agent_output[key]
        for i, tensor in enumerate(agent_state):
            initial_agent_state_buffers[index][i][...] = tensor


        # Update the episodic state counts
        episode_state_key = tuple(env_output['frame'].view(-1).tolist())
        if episode_state_key in episode_state_count_dict:
            episode_state_count_dict[episode_state_key] += 1
        else:
            episode_state_count_dict.update({episode_state_key: 1})
        buffers['episode_state_count'][index][0, ...] = \
            torch.tensor(1 / np.sqrt(episode_state_count_dict.get(episode_state_key)))
        
        # Reset the episode state counts when the episode is over
        if env_output['done'][0][0]:
            for episode_state_key in episode_state_count_dict:
                episode_state_count_dict = dict()

        # Update the training state counts
        train_state_key = tuple(env_output['frame'].view(-1).tolist())
        if train_state_key in train_state_count_dict:
            train_state_count_dict[train_state_key] += 1
        else:
            train_state_count_dict.update({train_state_key: 1})
        buffers['train_state_count'][index][0, ...] = \
            torch.tensor(1 / np.sqrt(train_state_count_dict.get(train_state_key)))

        # delete output
        del agent_output

        # Do new rollout
        for t in range(flags.unroll_length):
            timings.reset()

            with torch.no_grad():
                agent_output, agent_state = model(env_output, agent_state)

            timings.time('model')

            env_output = env.step(agent_output['action'])

            timings.time('step')

            for key in env_output:
                buffers[key][index][t + 1, ...] = env_output[key]

            for key in agent_output:
                buffers[key][index][t + 1, ...] = agent_output[key]
            
            # Update the episodic state counts
            episode_state_key = tuple(env_output['frame'].view(-1).tolist())
            if episode_state_key in episode_state_count_dict:
               episode_state_count_dict[episode_state_key] += 1
            else:
                episode_state_count_dict.update({episode_state_key: 1})
            buffers['episode_state_count'][index][t + 1, ...] = \
                torch.tensor(1 / np.sqrt(episode_state_count_dict.get(episode_state_key)))

            # Reset the episode state counts when the episode is over
            if env_output['done'][0][0]:
                episode_state_count_dict = dict()

            # Update the training state counts
            train_state_key = tuple(env_output['frame'].view(-1).tolist())
            if train_state_key in train_state_count_dict:
                train_state_count_dict[train_state_key] += 1
            else:
                train_state_count_dict.update({train_state_key: 1})
            buffers['train_state_count'][index][t + 1, ...] = \
                torch.tensor(1 / np.sqrt(train_state_count_dict.get(train_state_key)))

            timings.time('write')
        full_queue.put(index)

    if i == 0:
        log.info('Actor %i: %s', i, timings.summary())

except KeyboardInterrupt:
    pass  
except Exception as e:
    logging.error('Exception in worker process %i', i)
    traceback.print_exc()
    print()
    raise e`
def get_batch(free_queue: mp.SimpleQueue,
              full_queue: mp.SimpleQueue,
              buffers: Buffers,
              initial_agent_state_buffers,
              flags,
              timings,
              lock=threading.Lock()):
    with lock:
        timings.time('lock')
        indices = [full_queue.get() for _ in range(flags.batch_size)]
        timings.time('dequeue')
    batch = {
        key: torch.stack([buffers[key][m] for m in indices], dim=1)
        for key in buffers
    }
    initial_agent_state = (
        torch.cat(ts, dim=1)
        for ts in zip(*[initial_agent_state_buffers[m] for m in indices])
    )
    timings.time('batch')
    for m in indices:
        free_queue.put(m)
    timings.time('enqueue')
    batch = {
        k: t.to(device=flags.device, non_blocking=True)
        for k, t in batch.items()
    }
    initial_agent_state = tuple(t.to(device=flags.device, non_blocking=True)
                                for t in initial_agent_state)
    timings.time('device')
    return batch, initial_agent_state

Is there a way I can free those used buffer? Cause to my understanding, memory usage consistently increasing does not make sense right?

Pytorch is not quite clear about this part, in fact, in this previously asked question (by me):

In come cases (my implementation 2), even if the tensors are dereferenced, their memory are still not freed immediately after dereferencing, for more deep down implementation information, maybe you could consider asking the pytorch development member @ptrblck and @albanD.

I have not tried the jemalloc solution yet, but it is easy enough to install it by simply making it like:

wget https://github.com/jemalloc/jemalloc/releases/download/5.0.1/jemalloc-5.0.1.tar.bz2
tar -jxvf jemalloc-5.0.1.tar.bz2
cd jemalloc-5.0.1
sudo apt-get install autogen autoconf

./autogen.sh
make -j2
sudo make install
sudo ldconfig
cd ../
rm -rf jemalloc-5.0.1 jemalloc-5.0.1.tar.bz2

then export LD_PRELOAD environment variable before running your program:

LD_PRELOAD=/home/mingfeim/packages/jemalloc-5.2.0/lib/libjemalloc.so ./your_script.sh

Give it a try?

Reference:

Really appreciate your suggestion. Actually I’m using the HPC in my University and submit the slurm script to get the job run on a node. So I don’t know if I’m able to using jemalloc in this case. But again thank you for the idea!

I see, then it would be very difficult regarding subitting a precompiled library or compile remotely on nodes, I guess you will need to tweak the framework code instead. Or you could change an implementation.

@ptrblck and @albanD It would be awesome if you can share some ideas about this.
One more update: I’m using the procgen environment, for the Coinrun, I can only train 1.1M frames. But I trained the same model on the Maze environment, it reached 5M frames without error

And could you please have a look at another problem I posted, its about the behaviour of Multinomial with input inf and nan and <0 value in difference torch version.Torch 1.6.0 RuntimeError: probability tensor contains either inf, nan or element < 0, But good with Torch 1.1.0

I am not sure this is the same. The issue you have should never lead to OOM.
The memory is still available for this process and can be re-used. It is more an issue that it makes the memory not available for other processes and confuses reporting tools.

I am really not sure where this errors can come from though. I am pretty sure this is not in our own code. So maybe cuda-related stuff?
Do you actually run out of RAM while the program runs?

I’m using the HPC system, and I even crash two nodes because of the OOM. So I’m pretty sure that the memory out of usage.

If the whole node crashes, then the stack trace is most likely not very representative: it just points to whatever it was executing when the node ran out of memory.
I think you should try and reproduce this on your local machine, with a smaller dataset/model and see if you can reproduce the memory increase.