I will preface with I am new to pytorch and parallel computing. I have a pytorch training script that uses optuna for parallel processing during hyper parameter search. I run it on a Mac Studio with M1 Ultra which has 20 CPUs while setting n_jobs to 20. I see that all CPUs have utilization above 95% when every jobs starts training.
Similarly, I run the same exact code on a Linux virtual machine with 20 CPUs. It is on a corporate compute grid. Running the script with n_jobs set to 20 is 2-3x slower than the Mac run. I ran htop to check CPU usage and I see it is generally low, 30-40%. How can I debug this to find the bottle neck?
The discrepancy between the two runs is strange. I am using same python and package versions etc.
your Linux VM is running on a corporate compute grid, which likely means:
The VM’s 20 “CPUs” might be virtual CPUs (vCPUs) rather than dedicated physical cores
These vCPUs could be spread across multiple physical hosts or NUMA nodes
The VM might be competing with other VMs for physical resources
There could be additional virtualization overhead
You can try any of these
Verify CPU Architecture and Configuration
import os
import psutil
import platform
import subprocess
def diagnose_cpu_configuration():
"""Comprehensive CPU configuration diagnosis"""
print("=== System Information ===")
print(f"Platform: {platform.platform()}")
print(f"Processor: {platform.processor()}")
print(f"Python version: {platform.python_version()}")
print("\n=== CPU Configuration ===")
print(f"Physical cores: {psutil.cpu_count(logical=False)}")
print(f"Logical cores: {psutil.cpu_count(logical=True)}")
# Check CPU affinity - which cores can the process actually use
try:
affinity = os.sched_getaffinity(0)
print(f"CPU affinity (cores available to this process): {sorted(affinity)}")
print(f"Number of cores available: {len(affinity)}")
except AttributeError:
print("CPU affinity check not available on this system")
# Check for CPU frequency scaling
print("\n=== CPU Frequency ===")
freq = psutil.cpu_freq()
if freq:
print(f"Current: {freq.current:.2f} MHz")
print(f"Min: {freq.min:.2f} MHz")
print(f"Max: {freq.max:.2f} MHz")
# Check for NUMA configuration on Linux
if platform.system() == "Linux":
try:
numa_info = subprocess.check_output("numactl --hardware", shell=True, text=True)
print("\n=== NUMA Configuration ===")
print(numa_info)
except:
print("NUMA information not available")
# Check for CPU throttling or power management
if platform.system() == "Linux":
try:
governor = subprocess.check_output(
"cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor",
shell=True, text=True
).strip()
print(f"\nCPU Governor: {governor}")
except:
pass
diagnose_cpu_configuration()
or you can monitor resource utilization during training. Most importantly check CUDA, Numpy configs on both machines