CPU training optimisation

I all! I have Intel i9-9980HK and running PyTorch on MacOS.

I am running a training script using custom Dataset and Dataloader, num_workers=0
Before I updated the environment, my training script was utilizing 100% CPU - all cores in use efficiently. After I reinstalled PyTorch and some libraries, the utilization decreased. Now its 1/8 of what it used to be.

torch.__config__.parallel_info() gives:

ATen/Parallel:
	at::get_num_threads() : 8
	at::get_num_interop_threads() : 8
OpenMP not found
Intel(R) Math Kernel Library Version 2020.0.1 Product Build 20200208 for Intel(R) 64 architecture applications
	mkl_get_max_threads() : 1
Intel(R) MKL-DNN v2.2.3 (Git Hash 7336ca9f055cf1bfa13efb658fe15dc9b41f0740)
std::thread::hardware_concurrency() : 16
Environment variables:
	OMP_NUM_THREADS : [not set]
	MKL_NUM_THREADS : [not set]
ATen parallel backend: native thread pool

Is it possible to make sense of this? Why mkl_get_max_threads gives 1? Is it possible to debug this and bring performance back? How does PyTorch adapt to CPU and RAM utilization? Shall I attempt to alter the training script to use all cores?