[GLOW] Memory leaking in opencl and build issues

marijnfs · November 30, 2018, 11:48am

I’m having memory issues running the mnist and cifar10 examples, and I had problems compiling which might be related.

The memory issue is that when I run e.g. the cifar10 training example with -opencl, the GPU memory usage grows at a rate of 200MB per second (looking at nvidia-smi) until the lack of memory crashes the program.

I tried updating my nvidia libraries and opencl libraries but the behaviour is the same.
Any ideas how to debug this problem? The mnist example actually makes it to the final predictions and they are right, so the program is running correctly in that respect.

The building issues I had might be related, firstly for some reason llvm-link-6.0 was found by cmake while the rest finds llvm-7.0, so I had to manually set that to llvm-link-7 (I’m running debian btw.)
Then I had to make sure the code within the FACEBOOK_INTERNAL && LLVM_VERSION_PATCH < 20181009 is run, and not the regular llvm-7 code. It seems that the LLVM_VERSION_PATCH variable is not set while I apparently need it.

Any help would be appreciated! The glow library seems otherwise perfect to use!

-Marijn

albanD · November 30, 2018, 1:35pm

Hi,

What example are you talking about exactly? What is this -opencl option supposed to do?

marijnfs · November 30, 2018, 1:38pm

Hi sorry I should be more clear, I’m talking in this case about the GLOW framework (which I guess is the backend of pytorch?). When you build that framework (from https://github.com/pytorch/glow) it builds several binaries that implement some simple training, but I get these memory leaks when I run it with ‘-opencl’ which is a flag that selects the opencl backend (as opposed to regular cpu backend).

I’m running the latest nvidia driver and have tried different versions, because opencl driver support is notorious for such issues I believe, but I would really like to find a fix for this.

albanD · November 30, 2018, 1:53pm

It’s not actually the backend for pytorch I thinks it’s more caffe2 related?
I’m not sure who is knowledgeable about this, @smth might know who to ask?

marijnfs · November 30, 2018, 2:01pm

I see, yeah I’m not sure where it fits in the ecosystem, just that they point to this discussion forum to discuss issues. There is an interesting talk about it BTW https://www.youtube.com/watch?v=cTz7c5dn5Gc

Bert_Maher · November 30, 2018, 5:58pm

Thanks for reporting this – we can continue the discussion on the Github issue (https://github.com/pytorch/glow/issues/2104).

Just to give you an idea of where Glow fits in the stack, it’s an optional, experimental backend for Caffe2. It can also (sort of) be used as a standalone framework, but it’s really intended to sit underneath C2 and provide a backend for hardware accelerators.