Sharing tensors across CPU and GPU?

rbelew · March 2, 2024, 4:20am

TL;DR: I seem to be confused as to what’s required to make data and
model agree about the (symbolic) tensor’s format in order to make use
of GPU efficiencies?

I’m trying to set up a common environment across OSX using M2 Silicon
hardware and an Linux using an i7 with an RTX 3050.

I’m using the
Text classification from scratch
example to drive my testing of what CPU vs GPU performance looks like.

I’m using Keras3, and specifying os.environ["KERAS_BACKEND"] = "torch". Package version specifics listed below.

I’m using DEVICE = torch.device("cpu") or DEVICE = torch.device("cuda")
to switch between the alternatives on the Linux box, and
DEVICE = torch.device("cpu") or DEVICE = torch.device("mps") on
OSX. So, I have four experiments: CPU vs GPU on OSX, and CPU vs GPU
on linux.

The results table:

  Linux	CPU	    Expected all tensors to be on the same device, but found at least two devices, cpu and cuda:0!
        CUDA    Works!

  OSX	CPU	    Works!
        MPS	    Placeholder storage has not been allocated on MPS device!

“Works!” means the model.fit() method runs as expected. On both OSX and Linux,
the issue arises in torch.embedding()

On Linux using CPU:

  DEVICE=cpu type=<class 'torch.device'>
  ...
  Traceback (most recent call last):
  File ".../torch-kerasNLP.py", line 183, in <module>
  main()
  File ".../torch-kerasNLP.py", line 154, in main
  model.fit(train_ds, validation_data=val_ds, epochs=epochs)
  File ".../lib/python3.11/site-packages/keras/src/utils/traceback_utils.py", line 123, in error_handler
  raise e.with_traceback(filtered_tb) from None
  File ".../lib/python3.11/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
  return self._call_impl(*args, **kwargs)
  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File ".../lib/python3.11/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
  return forward_call(*args, **kwargs)
  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File ".../lib/python3.11/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
  return self._call_impl(*args, **kwargs)
  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File ".../lib/python3.11/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
  return forward_call(*args, **kwargs)
  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File ".../lib/python3.11/site-packages/torch/nn/functional.py", line 2237, in embedding
  return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  RuntimeError: Exception encountered when calling Embedding.call().

  Expected all tensors to be on the same device, but found at least two devices, cpu and cuda:0! (when checking argument for argument index in method wrapper_CUDA__index_select)

  Arguments received by Embedding.call():
  • inputs=torch.Tensor(shape=torch.Size([32, 500]), dtype=int64)

This is especially confusing, since I am specifying CPU as the device, but nevertheless torch is considering a CUDA device?

On OSX using MPS:

  DEVICE=mps type=<class 'torch.device'>
  ...
  Traceback (most recent call last):
  File ".../torch-kerasNLP.py", line 178, in <module>
  main()
  File ".../torch-kerasNLP.py", line 149, in main
  model.fit(train_ds, validation_data=val_ds, epochs=epochs)
  File ".../lib/python3.11/site-packages/keras/src/utils/traceback_utils.py", line 123, in error_handler
  raise e.with_traceback(filtered_tb) from None
  File ".../lib/python3.11/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
  return self._call_impl(*args, **kwargs)
  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File ".../lib/python3.11/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
  return forward_call(*args, **kwargs)
  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File ".../lib/python3.11/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
  return self._call_impl(*args, **kwargs)
  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File ".../lib/python3.11/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
  return forward_call(*args, **kwargs)
  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File ".../lib/python3.11/site-packages/torch/nn/functional.py", line 2233, in embedding
  return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  RuntimeError: Exception encountered when calling Embedding.call().

  Placeholder storage has not been allocated on MPS device!

  Arguments received by Embedding.call():
  • inputs=torch.Tensor(shape=torch.Size([32, 500]), dtype=int64)

Are my issues with torch? with keras? thanks for any suggestions, Rik

Linux environment:

  >>> keras.__version__
  '3.0.5'
  >>> tensorflow_text.__version__
  '2.15.0'
  >>> keras_nlp.__version__
  '0.7.0'
  >>> torch.__version__
  '2.0.1'
  >>> torchtext.__version__
  '0.15.2'
  >>> tensorflow.__version__
  '2.15.0'

OSX environment

  >>> keras.__version__
  '3.0.5'
  >>> tensorflow_text.__version__
  '2.15.0'
  >>> keras_nlp.__version__
  '0.7.0'
  >>> torch.__version__
  '2.1.0.post100'
  >>> torchtext.__version__
  '0.16.1'
  >>> tensorflow.__version__
  '2.15.0'

ptrblck · March 2, 2024, 6:11pm

Based on the error message for Linux on CPU it seems the model was moved to the GPU while the inputs are still on the CPU. I don’t know if Keras or another part of your code will move things behind your back to the GPU, if it’s available, so you might need to double check this.

eaah · March 2, 2024, 6:56pm

It would be useful for you to share your code atleast where you are loading your model.

If in your code you send the model to gpu at anytime you are expected to have all the inputs in gpu. If you do no send the model to gpu you should not see the gpu being loaded with the model. The memory in your gpu should not change when running on CPU.

You can run a test while supervising the gpu memory to identify when is the model being loaded to gpu and start debbugin from there.

I dont suggest running two experiments at the same time if you only have one GPU. (At least at the begining of the tests) I would use a singularity or docker container to run multiple experiments but only if I have multiple GPUs.

rbelew · March 2, 2024, 8:24pm

hi @ptrblck and @eaah, thanks for your help.

i knew i should have made the code accessible, but didn’t know where this forum likes paste examples (didn’t see this FAQ). anyway, here it is: rikHak/torch-kerasNLP-pub.py at master · rbelew/rikHak · GitHub

@ptrblck, re: Linux+CPU, how is it you know it is the MODEL that’s on the GPU? i have an explicit model.to(DEVICE) before compiling it?

eaah · March 2, 2024, 9:12pm

That is correct. The model.to(DEVICE) line sends the model to the GPU. Meaning the model weights and computations are now in the graphics card. To perform mathematical with your inputs they need to be done in the GPU as well.

When you run model.to(DEVICE) you can see the memory in your GPU change as the model has now been loaded to the GPU.

In line 135 you have model.fit(train_ds, validation_data=val_ds, epochs=epochs) but the train_ds is still in the CPU.

To run everything in CPU remove the model.to(DEVICE).
To run in GPU move the dataset train_ds to the GPU. try train_ds.to(DEVICE) or tran_ds.cuda()

rbelew · March 2, 2024, 9:16pm

The model.to(DEVICE) line sends the model to the GPU.

despite the fact that on line#24 i have DEVICE = torch.device("cpu") ?!

eaah · March 2, 2024, 9:25pm

Oh, apologies!! I can see you are choosing the device based on the host. It should be choosing the ‘cpu’ device, but i would find this line mode.to(DEVICE) redundant when operating in cpu. Have you tried removing that line? Just to test?

You should check which devices are available.

if torch.cuda.is_available() returns False you should be running on CPU.
and torch.cuda.device_count()should be 0.

torch.cuda.current_device()

You can also check the .device attribute of all parameters and see where are they allocated.

I reviewed this post:

https://medium.com/@btatmaja/trying-keras-core-with-pytorch-backend-4a643275911f

here they don’t seem to be using model.to(DEVICE) with keras_core

rbelew · March 2, 2024, 9:35pm

@eaah Hah! indeed, commenting out the model.to(DEVICE) line lets the CPU version proceed. so that seems a bug (in keras?) to report.

now, down to just the OSX+MPS issue:).

thanks for your help.

rbelew · March 2, 2024, 11:48pm

re: OSX+MPS , it seems clear the issue is that the IMBD tensors are on the CPU not the GPU/MPS.

MPS notes show
moving model to MPS, and raw tensors with device=.
but the keras.utils.text_dataset_from_directory() utility doesn’t accept a device= argument, and the .map(vectorize_text) doesn’t either?