I am encountering some particularly strange behavior with very simple usage of nn.Linear
when using CUDA. I wanted to post and see if others have encountered similar behavior. I’ve posted some information about my environment below the example.
Example:
I start with a trivial nn.Linear
and a Tensor
, both on CUDA.
import torch
import torch.nn as nn
linear = nn.Linear(1, 1, bias=False).to('cuda:0')
x = torch.ones(3, 1, device='cuda:0')
There’s only one parameter for linear
:
>>> linear.state_dict()
OrderedDict([('weight', tensor([[0.3293]], device='cuda:0'))])
This default parameter is fine for the example. The crazy behavior appears when we use linear.forward
:
>>> linear(x) # This call returns nonsense
tensor([[0.0000],
[1.8750],
[1.0000]], device='cuda:0', grad_fn=<MmBackward>)
(Side note: if I repeatedly call linear(x
), the result can fluctuate dramatically.)
Just as a reality-check, when running on CPU, the result comes out correctly:
>>> linear.to('cpu')
>>> x = x.cpu()
>>> linear(x)
tensor([[0.3293],
[0.3293],
[0.3293]], grad_fn=<MmBackward>)
Environment
I am using an EC2 with instance type p2.xlarge. I haven’t made any changes to the environment–I’m it’s a newly spun-up instance and I’m using a default virtual environment with source activate pytorch_latest_p37
. The GPU is a Tesla K80. Here is the output from nvidia-smi
:
$ nvidia-smi
Fri Jul 2 16:04:45 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.119.03 Driver Version: 450.119.03 CUDA Version: 11.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla K80 On | 00000000:00:1E.0 Off | 0 |
| N/A 38C P0 54W / 149W | 500MiB / 11441MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 12914 C ...rch_latest_p37/bin/python 497MiB |
+-----------------------------------------------------------------------------+
Thanks in advance for any suggestions about what could be going on here, and sorry if I’m making some silly obvious mistake.