Currently, I’m doing some training using package ‘deepxde’ on a new linux environment with a RTX 4090. Using nvidia-smi, the GPU memory usage and the Processes status seems normal, but the actual trainning speed is unexpectedly slow with very low GPU util(below 10%).
Here’s what gets me confused:
I used to run the exact same code on windows 11 with a RTX 4060, and it trains at approximate 6 iterations per second with the gpu util being nearly 100%. But with 4090 on linux, it can only train 2 to 3 iterations per second, which is unexpectedly much slower than a 4060, and is pretty close to not using cuda at all.
I’ve tried trainning a simple model without using ‘deepxde’, and both the trainning speed and GPU util are back to normal again, so I would believe the under-performing issue has something to do with this package, but I’m having trouble finding the exact cause.
Here’s the code I’m running:
"""Backend supported: tensorflow.compat.v1, tensorflow, pytorch, paddle"""
import deepxde as dde
import matplotlib.pyplot as plt
import numpy as np
from tools import GetMetrics, set_server_gpu
from ADR_solver import solve_ADR
# PDE
def pde(x, y, v):
D = 0.01
k = 0.01
dy_t = dde.grad.jacobian(y, x, j=1)
dy_xx = dde.grad.hessian(y, x, j=0)
return dy_t - D * dy_xx + k * y ** 2 - v
# set_server_gpu()
geom = dde.geometry.Interval(0, 1)
timedomain = dde.geometry.TimeDomain(0, 1)
geomtime = dde.geometry.GeometryXTime(geom, timedomain)
bc = dde.icbc.DirichletBC(geomtime, lambda _: 0, lambda _, on_boundary: on_boundary)
ic = dde.icbc.IC(geomtime, lambda _: 0, lambda _, on_initial: on_initial)
pde = dde.data.TimePDE(
geomtime,
pde,
[bc, ic],
num_domain=200,
num_boundary=40,
num_initial=20,
num_test=500,
)
# Function space
func_space = dde.data.GRF(length_scale=0.2)
# Data
eval_pts = np.linspace(0, 1, num=50)[:, None]
data = dde.data.PDEOperatorCartesianProd(
pde, func_space, eval_pts, 1000, function_variables=[0], num_test=100, batch_size=50
)
# Net
net = dde.nn.DeepONetCartesianProd(
[50, 128, 128, 128],
[2, 128, 128, 128],
"tanh",
"Glorot normal",
)
model = dde.Model(data, net)
model.compile("adam", lr=0.0005)
losshistory, train_state = model.train(iterations=20)
dde.utils.plot_loss_history(losshistory)
func_feats = func_space.random(1)
xs = np.linspace(0, 1, num=100)[:, None]
v = func_space.eval_batch(func_feats, xs)[0]
x, t, u_true = solve_ADR(
0,
1,
0,
1,
lambda x: 0.01 * np.ones_like(x),
lambda x: np.zeros_like(x),
lambda u: 0.01 * u ** 2,
lambda u: 0.02 * u,
lambda x, t: np.tile(v[:, None], (1, len(t))),
lambda x: np.zeros_like(x),
100,
100,
)
u_true = u_true.T
plt.figure()
plt.imshow(u_true)
plt.colorbar()
v_branch = func_space.eval_batch(func_feats, np.linspace(0, 1, num=50)[:, None])
xv, tv = np.meshgrid(x, t)
x_trunk = np.vstack((np.ravel(xv), np.ravel(tv))).T
u_pred = model.predict((v_branch, x_trunk))
u_pred = u_pred.reshape((100, 100))
print(dde.metrics.l2_relative_error(u_true, u_pred))
plt.figure()
plt.imshow(u_pred)
plt.colorbar()
plt.show()
I’ve also run bottleneck tests on both the 4060 and 4090 environment:
On4090:
--------------------------------------------------------------------------------
Environment Summary
--------------------------------------------------------------------------------
PyTorch 2.2.1+cu118 DEBUG compiled w/ CUDA 11.8
Running with Python 3.9 and CUDA 11.8.89
`pip3 list` truncated output:
numpy==1.26.4
torch==2.2.1+cu118
torchaudio==2.2.1+cu118
torchvision==0.17.1+cu118
triton==2.2.0
--------------------------------------------------------------------------------
cProfile output
--------------------------------------------------------------------------------
4427611 function calls (4360305 primitive calls) in 13.592 seconds
Ordered by: internal time
List reduced from 8900 to 15 due to restriction <15>
ncalls tottime percall cumtime percall filename:lineno(function)
2620 5.154 0.002 5.154 0.002 {method 'run_backward' of 'torch._C._EngineBase' objects}
99 0.793 0.008 0.796 0.008 /home/huangrui/.pyenv/versions/3.9.9/lib/python3.9/site-packages/numpy/linalg/linalg.py:329(solve)
57283 0.510 0.000 7.429 0.000 /home/huangrui/.pyenv/versions/3.9.9/lib/python3.9/site-packages/torch/utils/_device.py:73(__torch_function__)
2202 0.454 0.000 0.733 0.000 /home/huangrui/.pyenv/versions/3.9.9/lib/python3.9/site-packages/scipy/interpolate/_bsplines.py:1138(make_interp_spline)
9100 0.334 0.000 0.334 0.000 {method 'sub' of 'torch._C.TensorBase' objects}
24 0.259 0.011 6.809 0.284 /home/huangrui/PyProjs/AE_DON/deepxde/data/pde_operator.py:234(_losses)
7944/3972 0.223 0.000 0.233 0.000 {built-in method torch.mean}
3900 0.188 0.000 4.317 0.001 /home/huangrui/PyProjs/AE_DON/deepxde/gradients/gradients_reverse.py:36(__call__)
7800/3900 0.182 0.000 0.191 0.000 {built-in method torch.square}
7852/3926 0.181 0.000 0.191 0.000 {built-in method torch.zeros_like}
2600 0.174 0.000 0.174 0.000 {method 'mul' of 'torch._C.TensorBase' objects}
5220/2620 0.163 0.000 0.169 0.000 {built-in method torch.ones_like}
1502 0.147 0.000 0.147 0.000 {built-in method marshal.loads}
1300 0.128 0.000 5.200 0.004 diff_example.py:10(pde)
2202 0.128 0.000 0.129 0.000 /home/huangrui/.pyenv/versions/3.9.9/lib/python3.9/site-packages/scipy/interpolate/_bsplines.py:511(_evaluate)
--------------------------------------------------------------------------------
autograd profiler output (CPU mode)
--------------------------------------------------------------------------------
top 15 events sorted by cpu_time_total
---------------------------- ------------ ------------ ------------ ------------ ------------ ------------
Name Self CPU % Self CPU CPU total % CPU total CPU time avg # of Calls
---------------------------- ------------ ------------ ------------ ------------ ------------ ------------
Optimizer.step#Adam.step 6.85% 478.709ms 8.31% 581.015ms 581.015ms 1
Optimizer.step#Adam.step 6.85% 479.209ms 8.27% 578.315ms 578.315ms 1
Optimizer.step#Adam.step 6.82% 476.847ms 8.22% 574.519ms 574.519ms 1
Optimizer.step#Adam.step 6.69% 467.805ms 8.17% 571.262ms 571.262ms 1
Optimizer.step#Adam.step 6.75% 472.158ms 8.15% 570.242ms 570.242ms 1
Optimizer.step#Adam.step 6.66% 465.557ms 8.06% 563.433ms 563.433ms 1
Optimizer.step#Adam.step 6.64% 464.175ms 8.03% 561.308ms 561.308ms 1
Optimizer.step#Adam.step 6.67% 466.127ms 8.02% 561.162ms 561.162ms 1
Optimizer.step#Adam.step 6.62% 463.260ms 8.01% 560.369ms 560.369ms 1
Optimizer.step#Adam.step 6.61% 462.558ms 7.99% 558.609ms 558.609ms 1
Optimizer.step#Adam.step 6.61% 462.011ms 7.96% 556.407ms 556.407ms 1
Optimizer.step#Adam.step 6.50% 454.416ms 7.95% 556.286ms 556.286ms 1
Optimizer.step#Adam.step 6.61% 462.154ms 7.90% 552.577ms 552.577ms 1
Optimizer.step#Adam.step 6.64% 464.063ms 7.78% 543.918ms 543.918ms 1
Optimizer.step#Adam.step 6.49% 453.956ms 7.71% 538.991ms 538.991ms 1
---------------------------- ------------ ------------ ------------ ------------ ------------ ------------
Self CPU time total: 6.993s
--------------------------------------------------------------------------------
autograd profiler output (CUDA mode)
--------------------------------------------------------------------------------
top 15 events sorted by cpu_time_total
Because the autograd profiler uses the CUDA event API,
the CUDA time column reports approximately max(cuda_time, cpu_time).
Please ignore this output if your code does not use CUDA.
---------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
Name Self CPU % Self CPU CPU total % CPU total CPU time avg Self CUDA Self CUDA % CUDA total CUDA time avg # of Calls
---------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
Optimizer.step#Adam.step 6.83% 1.019s 8.42% 1.256s 1.256s 1.015s 6.85% 1.256s 1.256s 1
Optimizer.step#Adam.step 6.82% 1.018s 8.41% 1.254s 1.254s 1.003s 6.76% 1.254s 1.254s 1
Optimizer.step#Adam.step 6.73% 1.004s 8.38% 1.249s 1.249s 997.483ms 6.73% 1.249s 1.249s 1
Optimizer.step#Adam.step 6.77% 1.010s 8.36% 1.247s 1.247s 1.006s 6.78% 1.247s 1.247s 1
Optimizer.step#Adam.step 6.81% 1.015s 8.36% 1.246s 1.246s 1.003s 6.76% 1.246s 1.246s 1
Optimizer.step#Adam.step 6.77% 1.009s 8.31% 1.240s 1.240s 1.000s 6.75% 1.240s 1.240s 1
Optimizer.step#Adam.step 6.74% 1.005s 8.28% 1.234s 1.234s 1.003s 6.77% 1.234s 1.234s 1
Optimizer.step#Adam.step 6.69% 998.105ms 8.18% 1.220s 1.220s 989.922ms 6.68% 1.220s 1.220s 1
Optimizer.step#Adam.step 6.62% 987.582ms 8.15% 1.216s 1.216s 983.782ms 6.64% 1.216s 1.216s 1
Optimizer.step#Adam.step 6.63% 988.699ms 8.15% 1.216s 1.216s 987.912ms 6.66% 1.216s 1.216s 1
Optimizer.step#Adam.step 6.57% 978.974ms 8.06% 1.201s 1.201s 973.332ms 6.57% 1.201s 1.201s 1
Optimizer.step#Adam.step 6.66% 993.141ms 7.99% 1.191s 1.191s 988.879ms 6.67% 1.191s 1.191s 1
Optimizer.step#Adam.step 6.50% 968.859ms 7.96% 1.187s 1.187s 968.450ms 6.53% 1.187s 1.187s 1
Optimizer.step#Adam.step 6.50% 968.686ms 7.93% 1.182s 1.182s 960.467ms 6.48% 1.182s 1.182s 1
Optimizer.step#Adam.step 6.35% 947.565ms 7.85% 1.171s 1.171s 945.277ms 6.38% 1.171s 1.171s 1
---------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
Self CPU time total: 14.911s
Self CUDA time total: 14.825s
On 4060:
--------------------------------------------------------------------------------
Environment Summary
--------------------------------------------------------------------------------
PyTorch 2.0.1+cu118 DEBUG compiled w/ CUDA 11.8
Running with Python 3.10 and
`pip3 list` truncated output:
numpy==1.24.1
torch==2.0.1+cu118
torchaudio==2.0.2+cu118
torchvision==0.15.2+cu118
--------------------------------------------------------------------------------
cProfile output
--------------------------------------------------------------------------------
3686649 function calls (3614394 primitive calls) in 29.232 seconds
Ordered by: internal time
List reduced from 6861 to 15 due to restriction <15>
ncalls tottime percall cumtime percall filename:lineno(function)
1 10.774 10.774 11.364 11.364 {method 'mainloop' of '_tkinter.tkapp' objects}
99 10.200 0.103 10.202 0.103 D:\Python310\lib\site-packages\numpy\linalg\linalg.py:306(solve)
2620 3.403 0.001 3.403 0.001 {method 'run_backward' of 'torch._C._EngineBase' objects}
3545 0.279 0.000 0.279 0.000 {built-in method nt.stat}
153 0.242 0.002 0.243 0.002 {built-in method _imp.create_dynamic}
827 0.240 0.000 0.240 0.000 {built-in method io.open_code}
2202 0.168 0.000 0.400 0.000 D:\Python310\lib\site-packages\scipy\interpolate\_bsplines.py:1138(make_interp_spline)
1300 0.148 0.000 1.800 0.001 .\diff_example.py:10(pde)
1232 0.131 0.000 0.131 0.000 {method 'read' of '_io.BufferedReader' objects}
150 0.129 0.001 0.129 0.001 {built-in method torch._C._nn.linear}
3 0.126 0.042 0.126 0.042 {built-in method _tkinter.create}
792 0.120 0.000 0.130 0.000 {method 'call' of '_tkinter.tkapp' objects}
827 0.116 0.000 0.116 0.000 {built-in method marshal.loads}
24 0.097 0.004 2.285 0.095 D:\PyProjs\AE_TON\.\deepxde\data\pde_operator.py:234(_losses)
3972 0.092 0.000 0.092 0.000 {built-in method torch.mean}
--------------------------------------------------------------------------------
autograd profiler output (CPU mode)
--------------------------------------------------------------------------------
top 15 events sorted by cpu_time_total
---------------------------- ------------ ------------ ------------ ------------ ------------ ------------
Name Self CPU % Self CPU CPU total % CPU total CPU time avg # of Calls
---------------------------- ------------ ------------ ------------ ------------ ------------ ------------
Optimizer.step#Adam.step 7.28% 202.177ms 8.17% 226.776ms 226.776ms 1
Optimizer.step#Adam.step 6.93% 192.547ms 7.95% 220.917ms 220.917ms 1
Optimizer.step#Adam.step 6.84% 190.044ms 7.83% 217.360ms 217.360ms 1
Optimizer.step#Adam.step 6.86% 190.622ms 7.76% 215.607ms 215.607ms 1
Optimizer.step#Adam.step 6.76% 187.876ms 7.68% 213.243ms 213.243ms 1
Optimizer.step#Adam.step 6.73% 186.849ms 7.62% 211.565ms 211.565ms 1
Optimizer.step#Adam.step 6.72% 186.576ms 7.56% 209.966ms 209.966ms 1
Optimizer.step#Adam.step 6.65% 184.816ms 7.54% 209.471ms 209.471ms 1
Optimizer.step#Adam.step 6.52% 181.089ms 7.44% 206.591ms 206.591ms 1
Optimizer.step#Adam.step 6.58% 182.669ms 7.43% 206.222ms 206.222ms 1
Optimizer.step#Adam.step 6.57% 182.529ms 7.39% 205.190ms 205.190ms 1
Optimizer.step#Adam.step 6.48% 179.918ms 7.33% 203.684ms 203.684ms 1
Optimizer.step#Adam.step 6.35% 176.386ms 7.32% 203.275ms 203.275ms 1
Optimizer.step#Adam.step 6.37% 176.830ms 7.31% 203.081ms 203.081ms 1
Optimizer.step#Adam.step 6.35% 176.334ms 7.16% 198.871ms 198.871ms 1
---------------------------- ------------ ------------ ------------ ------------ ------------ ------------
Self CPU time total: 2.777s
--------------------------------------------------------------------------------
autograd profiler output (CUDA mode)
--------------------------------------------------------------------------------
top 15 events sorted by cpu_time_total
Because the autograd profiler uses the CUDA event API,
the CUDA time column reports approximately max(cuda_time, cpu_time).
Please ignore this output if your code does not use CUDA.
---------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
Name Self CPU % Self CPU CPU total % CPU total CPU time avg Self CUDA Self CUDA % CUDA total CUDA time avg # of Calls
---------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
Optimizer.step#Adam.step 7.21% 853.039ms 7.88% 931.945ms 931.945ms 838.745ms 7.24% 931.973ms 931.973ms 1
Optimizer.step#Adam.step 6.94% 820.409ms 7.63% 903.036ms 903.036ms 805.176ms 6.95% 903.063ms 903.063ms 1
Optimizer.step#Adam.step 6.86% 811.037ms 7.49% 886.203ms 886.203ms 795.693ms 6.87% 886.234ms 886.234ms 1
Optimizer.step#Adam.step 6.80% 803.784ms 7.48% 884.410ms 884.410ms 784.082ms 6.77% 884.437ms 884.437ms 1
Optimizer.step#Adam.step 6.75% 798.318ms 7.37% 872.318ms 872.318ms 783.933ms 6.76% 872.340ms 872.340ms 1
Optimizer.step#Adam.step 6.64% 785.027ms 7.29% 862.242ms 862.242ms 769.438ms 6.64% 862.277ms 862.277ms 1
Optimizer.step#Adam.step 6.64% 785.070ms 7.27% 859.652ms 859.652ms 770.147ms 6.65% 859.675ms 859.675ms 1
Optimizer.step#Adam.step 6.57% 777.579ms 7.25% 857.623ms 857.623ms 764.028ms 6.59% 857.648ms 857.648ms 1
Optimizer.step#Adam.step 6.55% 774.979ms 7.22% 853.951ms 853.951ms 761.479ms 6.57% 853.951ms 853.951ms 1
Optimizer.step#Adam.step 6.57% 776.879ms 7.20% 852.138ms 852.138ms 762.417ms 6.58% 852.034ms 852.034ms 1
Optimizer.step#Adam.step 6.50% 768.554ms 7.18% 849.427ms 849.427ms 749.704ms 6.47% 849.449ms 849.449ms 1
Optimizer.step#Adam.step 6.55% 774.228ms 7.17% 848.422ms 848.422ms 759.713ms 6.56% 848.422ms 848.422ms 1
Optimizer.step#Adam.step 6.54% 773.416ms 7.15% 845.252ms 845.252ms 756.634ms 6.53% 845.276ms 845.276ms 1
Optimizer.step#Adam.step 6.47% 764.719ms 7.12% 841.811ms 841.811ms 742.835ms 6.41% 841.830ms 841.830ms 1
Optimizer.step#Adam.step 6.44% 761.551ms 7.06% 834.725ms 834.725ms 744.437ms 6.42% 834.748ms 834.748ms 1
---------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
Self CPU time total: 11.829s
Self CUDA time total: 11.588s