I have been training two neural networks on a computer with two graphics cards. Each pytorch/python is run in its own instance of Spyder, and each runs on its own graphics card.
Training per epoch for each card is consistent 14.xx seconds when NOT connected to the internet. However when simply connecting to the internet, the training time per epoch begins to alternate for each epoch 14.xx, then 16.xx, then 14.xx seconds. This occurs in both instances.
When I simply unplug the WAN connection to the internet, (leaving the local network alone) the epoch training stabilizes to 14.xx seconds per epoch for each card.
The system is a server running Ubuntu 22.04, nothing else is running on it. The only difference that causes the change in performance is connecting to the local network to the internet, and not.
I suspect I am I missing something very basic. Can anyone provide any insight as to what is causing this?
Could you check if any processes are spawned once you are connected which could then slow down the CPU? A full profile with a timeline (e.g. via Nsight Systems) should also show it in case the CPU has trouble keeping up.
Keeping in mind that the server is on a local 10Gbe hub, and that hub is connected to a 1Gbe switch that runs the dhcp. It is the switch’s connection to the internet that I am connecting/disconnecting, and which causes the issue, while not touching anything else.
There are roughly 1322 processes and 1 user as reported by gkrell, with the combined processor running between 24% and 40%. I am not seeing any process other than python processes moving on the CPU utilization list, which varies with the epoch processing. as shown below.
I will have to install the Nsight system.
For reference the system is running a xeon w5-3435x, a x13swa-tf server board, and two rtx3090ti gpus
This appears to be an issue with how systemd-timesyncd works on Ubuntu 22.04.
I couldn’t find any processes that were magically starting when I connected the switch to the internet.
But I did notice that there were NTP calls going out, that could not go out when the internet was disconnected. Creating an NTP server on the local network and pointing the server to it solved the problem. Now there is consistent time between epochs being reported with and without the switch connected to the internet. SOLVED.