Hello,
newbie here in dire need of help.
The problem
I just got a brand new RTX 2070 8 GB and while it’s certainly fast, I don’t seem to be able to utilize its entire memory capacity. I’ve noticed this while running models but realized I need a more objective way to test it.
What I’ve tried
I made a tiny Jupyter notebook in which I create tensors of a given size on the GPU. This way I can precisely check how much memory I can use. To make the example below clearer, let me just mention that I also have a GTX 1060 6 GB and I’m running Windows 10. I’ve replicated the problem on both the nightly and stable versions of PyTorch 1.0 CUDA 10.
The approach was to find the maximum tensor size that fits on each card and see if I can fill the RAM memory this way. I’m printing some CUDA memory metrics but since I’m fairly (make that extremely) new to this I’m also monitoring GPU memory usage on both cards using GPU-Z. The 1060 is the primary card and usually has about 810 MB RAM occupied. The 2070 only has 4 MB RAM occupied.
import torch
def device_info(device):
# Just for showing info
device_name = torch.cuda.get_device_name(device)
print(f"Info on {device_name}")
print(f"CUDA capability {torch.cuda.get_device_capability(device)}")
print(f"Maximum GPU memory usage by tensors on {device_name}: {torch.cuda.max_memory_allocated(device)/1e9} GB")
print(f"Current GPU memory usage by tensors on {device_name}: {torch.cuda.memory_allocated(device)/1e9} GB")
print(f"Maximum GPU cached memory usage on {device_name}: {torch.cuda.memory_cached(device)/1e9} GB")
print(f"Current GPU cached memory usage on {device_name}: {torch.cuda.max_memory_cached(device)/1e9} GB\n\n")
print(f"Using cuDNN version {torch.backends.cudnn.version()}")
# n_2070 = int(1.69605e9) # Does not work and produces amusing reason for not working
n_2070 = int(1.696e9) # Maximum size that works
n_1060 = int(1.275e9)
device_2070 = torch.device("cuda:0")
device_1060 = torch.device("cuda:1")
print(f"Estimated tensor size for 2070: {(n_2070 * 4 / 1e9):.3f} GB")
print(f"Estimated tensor size for 1060: {(n_1060 * 4 / 1e9):.3f} GB")
Output:
Using cuDNN version 7401
Estimated tensor size for 2070: 6.784 GB
Estimated tensor size for 1060: 5.100 GB
torch.empty((n_1060, 1), device=device_1060)
torch.empty((n_2070, 1), device=device_2070)
device_info(device_2070)
device_info(device_1060)
Output:
Info on GeForce RTX 2070
CUDA capability (7, 5)
Maximum GPU memory usage by tensors on GeForce RTX 2070: 6.784156672 GB
Current GPU memory usage by tensors on GeForce RTX 2070: 6.784024576 GB
Maximum GPU cached memory usage on GeForce RTX 2070: 6.785073152 GB
Current GPU cached memory usage on GeForce RTX 2070: 6.785073152 GB
Info on GeForce GTX 1060 6GB
CUDA capability (6, 1)
Maximum GPU memory usage by tensors on GeForce GTX 1060 6GB: 5.10001152 GB
Current GPU memory usage by tensors on GeForce GTX 1060 6GB: 0.0 GB
Maximum GPU cached memory usage on GeForce GTX 1060 6GB: 5.10001152 GB
Current GPU cached memory usage on GeForce GTX 1060 6GB: 5.10001152 GB
Further remarks
According to GPU-Z, the 2070 now has 6831 MB occupied and the 1060 has 5942 MB occupied. Therefore, the 1060 used all its memory while the 2070 has more than 1 GB left.
If I uncomment the first n_2070 = int(1.69605e9), I get the following mind-boggling error:
RuntimeError: CUDA out of memory. Tried to allocate 1024.00 KiB (GPU 0; 8.00 GiB total capacity; 6.32 GiB already allocated; 1.85 MiB free; 0 bytes cached)
A few questions here:
- Why are there only 1.85 MB free when the capacity is 8 GB and only 6.32 GB have been allocated?
- If there are 1.85 MB free, why can’t 1 MB be allocated?
Other things I tried
- I want to emphasize that I tried my best to find an answer to this problem and couldn’t find any.
- I tried removing the 1060 and only using the 2070 as a primary card. I’ve removed the drivers and performed a fresh install. I deleted the entire environment and created it again. No matter what I did, I could never get past 7 GB memory usage according to GPU-Z, while with the 1060 I could easily fill its memory.
- As mentioned in the beginning of the post, I tried both stable and nightly versions of PyTorch 1.0 CUDA 10.
Please advise. Is the card at fault? Should I return it? Is there anything to do? I’m getting quite desperate with this situation.
Best,
Mircea