I will preface with I am new to pytorch and parallel computing. I have a pytorch training script that uses optuna for parallel processing during hyper parameter search. I run it on a Mac Studio with M1 Ultra which has 20 CPUs while setting n_jobs to 20. I see that all CPUs have utilization above 95% when every jobs starts training.
Similarly, I run the same exact code on a Linux virtual machine with 20 CPUs. It is on a corporate compute grid. Running the script with n_jobs set to 20 is 2-3x slower than the Mac run. I ran htop to check CPU usage and I see it is generally low, 30-40%. How can I debug this to find the bottle neck?
The discrepancy between the two runs is strange. I am using same python and package versions etc.
Happy to share more info as needed.
Thank you for your help in advance.
your Linux VM is running on a corporate compute grid, which likely means:
- The VM’s 20 “CPUs” might be virtual CPUs (vCPUs) rather than dedicated physical cores
- These vCPUs could be spread across multiple physical hosts or NUMA nodes
- The VM might be competing with other VMs for physical resources
- There could be additional virtualization overhead
You can try any of these
Verify CPU Architecture and Configuration
import os
import psutil
import platform
import subprocess
def diagnose_cpu_configuration():
"""Comprehensive CPU configuration diagnosis"""
print("=== System Information ===")
print(f"Platform: {platform.platform()}")
print(f"Processor: {platform.processor()}")
print(f"Python version: {platform.python_version()}")
print("\n=== CPU Configuration ===")
print(f"Physical cores: {psutil.cpu_count(logical=False)}")
print(f"Logical cores: {psutil.cpu_count(logical=True)}")
# Check CPU affinity - which cores can the process actually use
try:
affinity = os.sched_getaffinity(0)
print(f"CPU affinity (cores available to this process): {sorted(affinity)}")
print(f"Number of cores available: {len(affinity)}")
except AttributeError:
print("CPU affinity check not available on this system")
# Check for CPU frequency scaling
print("\n=== CPU Frequency ===")
freq = psutil.cpu_freq()
if freq:
print(f"Current: {freq.current:.2f} MHz")
print(f"Min: {freq.min:.2f} MHz")
print(f"Max: {freq.max:.2f} MHz")
# Check for NUMA configuration on Linux
if platform.system() == "Linux":
try:
numa_info = subprocess.check_output("numactl --hardware", shell=True, text=True)
print("\n=== NUMA Configuration ===")
print(numa_info)
except:
print("NUMA information not available")
# Check for CPU throttling or power management
if platform.system() == "Linux":
try:
governor = subprocess.check_output(
"cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor",
shell=True, text=True
).strip()
print(f"\nCPU Governor: {governor}")
except:
pass
diagnose_cpu_configuration()
or you can monitor resource utilization during training. Most importantly check CUDA, Numpy configs on both machines 