Device out of space error when torch.save with multiprocessing in docker

dsicahn · December 8, 2022, 3:15pm

I’m attempting to save a large number of torch Data objects to disk in parallel with the following block

def to_disk(r):
    path = '/data/protein_data_dir/raw/' + r['ID'] + '.pt'
    g = convert_nx_to_pyg(create_graph(r))
    torch.save(g, path)
    return g

NUM_CORE = 35
with mp.Pool(NUM_CORE) as pool:
    out = list(tqdm(pool.imap_unordered(to_disk, rows), total=len(rows)))

This will work for some number of iterations and then fail with “MaybeEncodingError: Error sending result: ‘Data(x=[51, 163], edge_index=[2, 92], edge_attr=[92, 10], y=1.0)’. Reason: ‘RuntimeError(‘unable to write to file </torch_29804_423365969_110>: No space left on device (28)’)’”. If I then start the loop again it will save some more objects before running into the same error again. Running this on a single process does not induce the error but is slow. There are several terabytes of space on the disk these objects are being saved to. Anyone know whats going on?

ptrblck · December 9, 2022, 5:15am

Since you are using a multiprocessing pool, I assume you depend on shared memory of your system? If so, you might be running out of shared memory, not actual disk space. Could you check if that’s the case?

dsicahn · December 9, 2022, 5:30am

Would that be the RAM? The machine has ~512GB RAM and the usage never appears to go above ~50. However, I think the docker container is limited to 64GB of “shared memory” (again not sure on exactly what that means). I could try increasing that maybe.

ptrblck · December 9, 2022, 5:55am

Yes, usually /dev/shm would use your RAM to allocate the temp. file storage filesystem.
Docker limits it by quite a bit (the default should be 64MB, not GB), so you might want to start your container with --ipc==host (or increase the shared memory usage directly via --shm-size).

dsicahn · December 9, 2022, 3:24pm

Ahh 64MB makes more sense. Its been running for many iterations now so I think that was probably the issue. Thanks!