Low GPU util and unexpected performance on 4090

847942074flywise · March 11, 2024, 2:20pm

Currently, I’m doing some training using package ‘deepxde’ on a new linux environment with a RTX 4090. Using nvidia-smi, the GPU memory usage and the Processes status seems normal, but the actual trainning speed is unexpectedly slow with very low GPU util(below 10%).

Here’s what gets me confused:
I used to run the exact same code on windows 11 with a RTX 4060, and it trains at approximate 6 iterations per second with the gpu util being nearly 100%. But with 4090 on linux, it can only train 2 to 3 iterations per second, which is unexpectedly much slower than a 4060, and is pretty close to not using cuda at all.

I’ve tried trainning a simple model without using ‘deepxde’, and both the trainning speed and GPU util are back to normal again, so I would believe the under-performing issue has something to do with this package, but I’m having trouble finding the exact cause.

Here’s the code I’m running:

"""Backend supported: tensorflow.compat.v1, tensorflow, pytorch, paddle"""
import deepxde as dde
import matplotlib.pyplot as plt
import numpy as np
from tools import GetMetrics, set_server_gpu
from ADR_solver import solve_ADR


# PDE
def pde(x, y, v):
    D = 0.01
    k = 0.01
    dy_t = dde.grad.jacobian(y, x, j=1)
    dy_xx = dde.grad.hessian(y, x, j=0)
    return dy_t - D * dy_xx + k * y ** 2 - v


# set_server_gpu()

geom = dde.geometry.Interval(0, 1)
timedomain = dde.geometry.TimeDomain(0, 1)
geomtime = dde.geometry.GeometryXTime(geom, timedomain)

bc = dde.icbc.DirichletBC(geomtime, lambda _: 0, lambda _, on_boundary: on_boundary)
ic = dde.icbc.IC(geomtime, lambda _: 0, lambda _, on_initial: on_initial)

pde = dde.data.TimePDE(
    geomtime,
    pde,
    [bc, ic],
    num_domain=200,
    num_boundary=40,
    num_initial=20,
    num_test=500,
)

# Function space
func_space = dde.data.GRF(length_scale=0.2)

# Data
eval_pts = np.linspace(0, 1, num=50)[:, None]
data = dde.data.PDEOperatorCartesianProd(
    pde, func_space, eval_pts, 1000, function_variables=[0], num_test=100, batch_size=50
)

# Net
net = dde.nn.DeepONetCartesianProd(
    [50, 128, 128, 128],
    [2, 128, 128, 128],
    "tanh",
    "Glorot normal",
)

model = dde.Model(data, net)
model.compile("adam", lr=0.0005)
losshistory, train_state = model.train(iterations=20)
dde.utils.plot_loss_history(losshistory)

func_feats = func_space.random(1)
xs = np.linspace(0, 1, num=100)[:, None]
v = func_space.eval_batch(func_feats, xs)[0]
x, t, u_true = solve_ADR(
    0,
    1,
    0,
    1,
    lambda x: 0.01 * np.ones_like(x),
    lambda x: np.zeros_like(x),
    lambda u: 0.01 * u ** 2,
    lambda u: 0.02 * u,
    lambda x, t: np.tile(v[:, None], (1, len(t))),
    lambda x: np.zeros_like(x),
    100,
    100,
)
u_true = u_true.T
plt.figure()
plt.imshow(u_true)
plt.colorbar()

v_branch = func_space.eval_batch(func_feats, np.linspace(0, 1, num=50)[:, None])
xv, tv = np.meshgrid(x, t)
x_trunk = np.vstack((np.ravel(xv), np.ravel(tv))).T
u_pred = model.predict((v_branch, x_trunk))
u_pred = u_pred.reshape((100, 100))
print(dde.metrics.l2_relative_error(u_true, u_pred))
plt.figure()
plt.imshow(u_pred)
plt.colorbar()
plt.show()

I’ve also run bottleneck tests on both the 4060 and 4090 environment:

On4090:
--------------------------------------------------------------------------------
  Environment Summary
--------------------------------------------------------------------------------
PyTorch 2.2.1+cu118 DEBUG compiled w/ CUDA 11.8
Running with Python 3.9 and CUDA 11.8.89

`pip3 list` truncated output:
numpy==1.26.4
torch==2.2.1+cu118
torchaudio==2.2.1+cu118
torchvision==0.17.1+cu118
triton==2.2.0
--------------------------------------------------------------------------------
  cProfile output
--------------------------------------------------------------------------------
         4427611 function calls (4360305 primitive calls) in 13.592 seconds

   Ordered by: internal time
   List reduced from 8900 to 15 due to restriction <15>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
     2620    5.154    0.002    5.154    0.002 {method 'run_backward' of 'torch._C._EngineBase' objects}
       99    0.793    0.008    0.796    0.008 /home/huangrui/.pyenv/versions/3.9.9/lib/python3.9/site-packages/numpy/linalg/linalg.py:329(solve)
    57283    0.510    0.000    7.429    0.000 /home/huangrui/.pyenv/versions/3.9.9/lib/python3.9/site-packages/torch/utils/_device.py:73(__torch_function__)
     2202    0.454    0.000    0.733    0.000 /home/huangrui/.pyenv/versions/3.9.9/lib/python3.9/site-packages/scipy/interpolate/_bsplines.py:1138(make_interp_spline)
     9100    0.334    0.000    0.334    0.000 {method 'sub' of 'torch._C.TensorBase' objects}
       24    0.259    0.011    6.809    0.284 /home/huangrui/PyProjs/AE_DON/deepxde/data/pde_operator.py:234(_losses)
7944/3972    0.223    0.000    0.233    0.000 {built-in method torch.mean}
     3900    0.188    0.000    4.317    0.001 /home/huangrui/PyProjs/AE_DON/deepxde/gradients/gradients_reverse.py:36(__call__)
7800/3900    0.182    0.000    0.191    0.000 {built-in method torch.square}
7852/3926    0.181    0.000    0.191    0.000 {built-in method torch.zeros_like}
     2600    0.174    0.000    0.174    0.000 {method 'mul' of 'torch._C.TensorBase' objects}
5220/2620    0.163    0.000    0.169    0.000 {built-in method torch.ones_like}
     1502    0.147    0.000    0.147    0.000 {built-in method marshal.loads}
     1300    0.128    0.000    5.200    0.004 diff_example.py:10(pde)
     2202    0.128    0.000    0.129    0.000 /home/huangrui/.pyenv/versions/3.9.9/lib/python3.9/site-packages/scipy/interpolate/_bsplines.py:511(_evaluate)


--------------------------------------------------------------------------------
  autograd profiler output (CPU mode)
--------------------------------------------------------------------------------
        top 15 events sorted by cpu_time_total

----------------------------  ------------  ------------  ------------  ------------  ------------  ------------
                        Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg    # of Calls
----------------------------  ------------  ------------  ------------  ------------  ------------  ------------
    Optimizer.step#Adam.step         6.85%     478.709ms         8.31%     581.015ms     581.015ms             1
    Optimizer.step#Adam.step         6.85%     479.209ms         8.27%     578.315ms     578.315ms             1
    Optimizer.step#Adam.step         6.82%     476.847ms         8.22%     574.519ms     574.519ms             1
    Optimizer.step#Adam.step         6.69%     467.805ms         8.17%     571.262ms     571.262ms             1
    Optimizer.step#Adam.step         6.75%     472.158ms         8.15%     570.242ms     570.242ms             1
    Optimizer.step#Adam.step         6.66%     465.557ms         8.06%     563.433ms     563.433ms             1
    Optimizer.step#Adam.step         6.64%     464.175ms         8.03%     561.308ms     561.308ms             1
    Optimizer.step#Adam.step         6.67%     466.127ms         8.02%     561.162ms     561.162ms             1
    Optimizer.step#Adam.step         6.62%     463.260ms         8.01%     560.369ms     560.369ms             1
    Optimizer.step#Adam.step         6.61%     462.558ms         7.99%     558.609ms     558.609ms             1
    Optimizer.step#Adam.step         6.61%     462.011ms         7.96%     556.407ms     556.407ms             1
    Optimizer.step#Adam.step         6.50%     454.416ms         7.95%     556.286ms     556.286ms             1
    Optimizer.step#Adam.step         6.61%     462.154ms         7.90%     552.577ms     552.577ms             1
    Optimizer.step#Adam.step         6.64%     464.063ms         7.78%     543.918ms     543.918ms             1
    Optimizer.step#Adam.step         6.49%     453.956ms         7.71%     538.991ms     538.991ms             1
----------------------------  ------------  ------------  ------------  ------------  ------------  ------------
Self CPU time total: 6.993s

--------------------------------------------------------------------------------
  autograd profiler output (CUDA mode)
--------------------------------------------------------------------------------
        top 15 events sorted by cpu_time_total

        Because the autograd profiler uses the CUDA event API,
        the CUDA time column reports approximately max(cuda_time, cpu_time).
        Please ignore this output if your code does not use CUDA.

----------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                        Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls
----------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
    Optimizer.step#Adam.step         6.83%        1.019s         8.42%        1.256s        1.256s        1.015s         6.85%        1.256s        1.256s             1
    Optimizer.step#Adam.step         6.82%        1.018s         8.41%        1.254s        1.254s        1.003s         6.76%        1.254s        1.254s             1
    Optimizer.step#Adam.step         6.73%        1.004s         8.38%        1.249s        1.249s     997.483ms         6.73%        1.249s        1.249s             1
    Optimizer.step#Adam.step         6.77%        1.010s         8.36%        1.247s        1.247s        1.006s         6.78%        1.247s        1.247s             1
    Optimizer.step#Adam.step         6.81%        1.015s         8.36%        1.246s        1.246s        1.003s         6.76%        1.246s        1.246s             1
    Optimizer.step#Adam.step         6.77%        1.009s         8.31%        1.240s        1.240s        1.000s         6.75%        1.240s        1.240s             1
    Optimizer.step#Adam.step         6.74%        1.005s         8.28%        1.234s        1.234s        1.003s         6.77%        1.234s        1.234s             1
    Optimizer.step#Adam.step         6.69%     998.105ms         8.18%        1.220s        1.220s     989.922ms         6.68%        1.220s        1.220s             1
    Optimizer.step#Adam.step         6.62%     987.582ms         8.15%        1.216s        1.216s     983.782ms         6.64%        1.216s        1.216s             1
    Optimizer.step#Adam.step         6.63%     988.699ms         8.15%        1.216s        1.216s     987.912ms         6.66%        1.216s        1.216s             1
    Optimizer.step#Adam.step         6.57%     978.974ms         8.06%        1.201s        1.201s     973.332ms         6.57%        1.201s        1.201s             1
    Optimizer.step#Adam.step         6.66%     993.141ms         7.99%        1.191s        1.191s     988.879ms         6.67%        1.191s        1.191s             1
    Optimizer.step#Adam.step         6.50%     968.859ms         7.96%        1.187s        1.187s     968.450ms         6.53%        1.187s        1.187s             1
    Optimizer.step#Adam.step         6.50%     968.686ms         7.93%        1.182s        1.182s     960.467ms         6.48%        1.182s        1.182s             1
    Optimizer.step#Adam.step         6.35%     947.565ms         7.85%        1.171s        1.171s     945.277ms         6.38%        1.171s        1.171s             1
----------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
Self CPU time total: 14.911s
Self CUDA time total: 14.825s

On 4060:
--------------------------------------------------------------------------------
  Environment Summary
--------------------------------------------------------------------------------
PyTorch 2.0.1+cu118 DEBUG compiled w/ CUDA 11.8
Running with Python 3.10 and 

`pip3 list` truncated output:
numpy==1.24.1
torch==2.0.1+cu118
torchaudio==2.0.2+cu118
torchvision==0.15.2+cu118
--------------------------------------------------------------------------------
  cProfile output
--------------------------------------------------------------------------------
         3686649 function calls (3614394 primitive calls) in 29.232 seconds

   Ordered by: internal time
   List reduced from 6861 to 15 due to restriction <15>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1   10.774   10.774   11.364   11.364 {method 'mainloop' of '_tkinter.tkapp' objects}
       99   10.200    0.103   10.202    0.103 D:\Python310\lib\site-packages\numpy\linalg\linalg.py:306(solve)
     2620    3.403    0.001    3.403    0.001 {method 'run_backward' of 'torch._C._EngineBase' objects}
     3545    0.279    0.000    0.279    0.000 {built-in method nt.stat}
      153    0.242    0.002    0.243    0.002 {built-in method _imp.create_dynamic}
      827    0.240    0.000    0.240    0.000 {built-in method io.open_code}
     2202    0.168    0.000    0.400    0.000 D:\Python310\lib\site-packages\scipy\interpolate\_bsplines.py:1138(make_interp_spline)
     1300    0.148    0.000    1.800    0.001 .\diff_example.py:10(pde)
     1232    0.131    0.000    0.131    0.000 {method 'read' of '_io.BufferedReader' objects}
      150    0.129    0.001    0.129    0.001 {built-in method torch._C._nn.linear}
        3    0.126    0.042    0.126    0.042 {built-in method _tkinter.create}
      792    0.120    0.000    0.130    0.000 {method 'call' of '_tkinter.tkapp' objects}
      827    0.116    0.000    0.116    0.000 {built-in method marshal.loads}
       24    0.097    0.004    2.285    0.095 D:\PyProjs\AE_TON\.\deepxde\data\pde_operator.py:234(_losses)
     3972    0.092    0.000    0.092    0.000 {built-in method torch.mean}


--------------------------------------------------------------------------------
  autograd profiler output (CPU mode)
--------------------------------------------------------------------------------
        top 15 events sorted by cpu_time_total

----------------------------  ------------  ------------  ------------  ------------  ------------  ------------
                        Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg    # of Calls
----------------------------  ------------  ------------  ------------  ------------  ------------  ------------
    Optimizer.step#Adam.step         7.28%     202.177ms         8.17%     226.776ms     226.776ms             1
    Optimizer.step#Adam.step         6.93%     192.547ms         7.95%     220.917ms     220.917ms             1
    Optimizer.step#Adam.step         6.84%     190.044ms         7.83%     217.360ms     217.360ms             1
    Optimizer.step#Adam.step         6.86%     190.622ms         7.76%     215.607ms     215.607ms             1
    Optimizer.step#Adam.step         6.76%     187.876ms         7.68%     213.243ms     213.243ms             1
    Optimizer.step#Adam.step         6.73%     186.849ms         7.62%     211.565ms     211.565ms             1
    Optimizer.step#Adam.step         6.72%     186.576ms         7.56%     209.966ms     209.966ms             1
    Optimizer.step#Adam.step         6.65%     184.816ms         7.54%     209.471ms     209.471ms             1
    Optimizer.step#Adam.step         6.52%     181.089ms         7.44%     206.591ms     206.591ms             1
    Optimizer.step#Adam.step         6.58%     182.669ms         7.43%     206.222ms     206.222ms             1
    Optimizer.step#Adam.step         6.57%     182.529ms         7.39%     205.190ms     205.190ms             1
    Optimizer.step#Adam.step         6.48%     179.918ms         7.33%     203.684ms     203.684ms             1
    Optimizer.step#Adam.step         6.35%     176.386ms         7.32%     203.275ms     203.275ms             1
    Optimizer.step#Adam.step         6.37%     176.830ms         7.31%     203.081ms     203.081ms             1
    Optimizer.step#Adam.step         6.35%     176.334ms         7.16%     198.871ms     198.871ms             1
----------------------------  ------------  ------------  ------------  ------------  ------------  ------------
Self CPU time total: 2.777s

--------------------------------------------------------------------------------
  autograd profiler output (CUDA mode)
--------------------------------------------------------------------------------
        top 15 events sorted by cpu_time_total

        Because the autograd profiler uses the CUDA event API,
        the CUDA time column reports approximately max(cuda_time, cpu_time).
        Please ignore this output if your code does not use CUDA.

----------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                        Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls
----------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
    Optimizer.step#Adam.step         7.21%     853.039ms         7.88%     931.945ms     931.945ms     838.745ms         7.24%     931.973ms     931.973ms             1
    Optimizer.step#Adam.step         6.94%     820.409ms         7.63%     903.036ms     903.036ms     805.176ms         6.95%     903.063ms     903.063ms             1
    Optimizer.step#Adam.step         6.86%     811.037ms         7.49%     886.203ms     886.203ms     795.693ms         6.87%     886.234ms     886.234ms             1
    Optimizer.step#Adam.step         6.80%     803.784ms         7.48%     884.410ms     884.410ms     784.082ms         6.77%     884.437ms     884.437ms             1
    Optimizer.step#Adam.step         6.75%     798.318ms         7.37%     872.318ms     872.318ms     783.933ms         6.76%     872.340ms     872.340ms             1
    Optimizer.step#Adam.step         6.64%     785.027ms         7.29%     862.242ms     862.242ms     769.438ms         6.64%     862.277ms     862.277ms             1
    Optimizer.step#Adam.step         6.64%     785.070ms         7.27%     859.652ms     859.652ms     770.147ms         6.65%     859.675ms     859.675ms             1
    Optimizer.step#Adam.step         6.57%     777.579ms         7.25%     857.623ms     857.623ms     764.028ms         6.59%     857.648ms     857.648ms             1
    Optimizer.step#Adam.step         6.55%     774.979ms         7.22%     853.951ms     853.951ms     761.479ms         6.57%     853.951ms     853.951ms             1
    Optimizer.step#Adam.step         6.57%     776.879ms         7.20%     852.138ms     852.138ms     762.417ms         6.58%     852.034ms     852.034ms             1
    Optimizer.step#Adam.step         6.50%     768.554ms         7.18%     849.427ms     849.427ms     749.704ms         6.47%     849.449ms     849.449ms             1
    Optimizer.step#Adam.step         6.55%     774.228ms         7.17%     848.422ms     848.422ms     759.713ms         6.56%     848.422ms     848.422ms             1
    Optimizer.step#Adam.step         6.54%     773.416ms         7.15%     845.252ms     845.252ms     756.634ms         6.53%     845.276ms     845.276ms             1
    Optimizer.step#Adam.step         6.47%     764.719ms         7.12%     841.811ms     841.811ms     742.835ms         6.41%     841.830ms     841.830ms             1
    Optimizer.step#Adam.step         6.44%     761.551ms         7.06%     834.725ms     834.725ms     744.437ms         6.42%     834.748ms     834.748ms             1
----------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
Self CPU time total: 11.829s
Self CUDA time total: 11.588s

ptrblck · March 11, 2024, 3:28pm

The output of the native profiler points to expensive CPU calls in e.g. numpy and scipy.
It might be worth creating a visual timeline via e.g. Nsight Systems to verify that these host ops are causing a bottleneck and are starving your GPU.