Valgrind flagging an issue

ericrhenry · May 15, 2020, 3:46pm

This may be a non-issue ultimately, but I’ve come to take valgrind analyses quite seriously. I am running a batched optimization (CPU only) on Linux, using the C++ frontend of a Pytorch 1.5 build. Upon initialization I get

==6595== Syscall param sched_setaffinity(mask) points to unaddressable byte(s)
==6595== at 0x7316989: syscall (in /lib64/libc-2.30.so)
==6595== by 0x1585EE11: __kmp_affinity_determine_capable (in /usr/lib64/libomp.so)
==6595== by 0x1581B61F: __kmp_env_initialize(char const*) (in /usr/lib64/libomp.so)
==6595== by 0x15807BAE: __kmp_do_serial_initialize() (in /usr/lib64/libomp.so)
==6595== by 0x15808374: __kmp_get_global_thread_id_reg (in /usr/lib64/libomp.so)
==6595== by 0x157F4DE6: __kmpc_global_thread_num (in /usr/lib64/libomp.so)
==6595== by 0xFA6AF05: at::native::randperm_out_cpu(at::Tensor&, long, at::Generator*) (in /usr/lib64/libtorch_cpu.so)
==6595== by 0xFD20856: at::CPUType::randperm_out_generator_out(at::Tensor&, long, at::Generator*) (in /usr/lib64/libtorch_cpu.so)
==6595== by 0xFA7AA40: at::Tensor& c10::KernelFunction::callUnboxed<at::Tensor&, at::Tensor&, long, at::Generator*>(c10::OperatorHandle const&, at::Tensor&, long, at::Generator*) const (in /usr/lib64/libtorch_cpu.so)
==6595== by 0xFA6A9DF: at::native::randperm(long, at::Generator*, c10::TensorOptions const&) (in /usr/lib64/libtorch_cpu.so)
==6595== by 0xFA6A89D: at::native::randperm(long, c10::TensorOptions const&) (in /usr/lib64/libtorch_cpu.so)
==6595== by 0xFE2B5A4: at::TypeDefault::randperm(long, c10::TensorOptions const&) (in /usr/lib64/libtorch_cpu.so)
==6595== Address 0x0 is not stack’d, malloc’d or (recently) free’d

When I step beyond the error, execution appears to continue without any issues. (At least nothing crashes and burns.) The complaint appears to be triggered deep inside libomp.so, which may mean it’s not really a pytorch issue at all, in which case it passes upstream.
I would just ignore this, except that the man page for the sched_setaffinity call says nothing about what happens when the mask is a nullptr.
Does anyone have a sense whether such seemingly “benign” complaints (hard to regard nullptr’s being passed around as benign) are SOP under these conditions?

Thanks,
Eric

glaringlee · May 15, 2020, 4:54pm

@ericrhenry
This is to me caused by libomp, we use it to do parallel processing when available, the message above is a potential memory leak report (might not be a real problem), I saw same error in some other forums and github discussions, it should not affect your usage at this point.

Can you create an issue in pytorch repo? https://github.com/pytorch/pytorch/issues
We can triage it later.

ericrhenry · May 15, 2020, 6:25pm

This specific valgrind message is not associated with a memory leak, but ultimately identifying inaccessible user-space data (in this case a null pointer). The kernel source for the routine suggests that the pointer should be dereferenced while copying the mask from user space. Not sure why it doesn’t fault…

I did the run with leak-checking enabled, and the entire output of leaks at the end of the run, some “definitely lost” and some “possibly lost”, were associated with __kmp_xxx calls in libomp. I don’t know if some or all of these are because of incomplete thread shutdowns or because these calls are not covered by the extensive set of leak-check-heuristics (which I set to “all”) in valgrind.

You are correct, it appears not to affect usage. I am a bit fanatical about memory leaks, but if these are things that happen only once per execution, I can probably live with it.

I will look into creating an issue, but I am pretty sure this is not a pytorch problem per se.

Thanks,
Eric