Summary
I have downloaded pre-trained LLM: deepseek-R1-1.5B, separately running on OneDNN and MKL_BLAS.
And get 10x slower:
model, tokenizer = load_model_for_profiling(
args.model,
torch_dtype,
random_weights=args.random_weights
)
# Apply MKLDNN/OneDNN optimization
if args.use_mkldnn:
model = mkldnn_utils.to_mkldnn(model)
# Prepare batch inputs
if args.batch_prompts:
# Use the provided list of prompts
prompts = args.batch_prompts
if len(prompts) < args.batch_size:
# If there are not enough prompts, repeat the last prompt
prompts.extend([prompts[-1]] * (args.batch_size - len(prompts)))
prompts = prompts[:args.batch_size]
else:
# Repeat the default prompt
prompts = [args.prompt] * args.batch_size
print(f"Preparing batch of {args.batch_size} prompts...")
inputs = tokenizer(prompts, return_tensors="pt", padding=True)
# Calculate number of prompt tokens
prompt_tokens = inputs["input_ids"].shape[1]
performance_data["prompt_tokens"] = prompt_tokens
inputs = {key: val.to(device) for key, val in inputs.items()}
def inference_fn():
with torch.no_grad():
for _ in range(args.iterations):
if args.use_mkldnn and args.mkldnn_verbose:
with torch.backends.mkldnn.verbose(torch.backends.mkldnn.VERBOSE_ON):
outputs = model(**inputs)
else:
outputs = model(**inputs)
I have checked the onednn_verbose log info, and realized that the cost of aten::linear is mainly about the reorder:
onednn_verbose,primitive,exec,cpu,inner_product,brgemm:avx512_core,forward_training,src_f32:a:blocked:ab::f0 wei_f32:a:blocked:AB16b64a::f0 bia_f32:a:blocked:a::f0 dst_f32:a:blocked:ab::f0,attr-scratchpad:user,,mb1ic1536oc8960,0.596924
onednn_verbose,primitive,exec,cpu,reorder,simple:any,undef,src_f32::blocked:abc::f0 dst_f32::blocked:abc::f0,attr-scratchpad:user,,1x1x8960,4.28809
onednn_verbose,primitive,exec,cpu,reorder,simple:any,undef,src_f32::blocked:abc::f0 dst_f32::blocked:abc::f0,attr-scratchpad:user,,1x1x1536,0.0109863
onednn_verbose,primitive,exec,cpu,reorder,jit:uni,undef,src_f32::blocked:ab::f0 dst_f32::blocked:AB16b64a::f0,attr-scratchpad:user,,8960x1536,4.32202
onednn_verbose,primitive,exec,cpu,inner_product,brgemm:avx512_core,forward_training,src_f32:a:blocked:ab::f0 wei_f32:a:blocked:AB16b64a::f0 bia_f32:a:blocked:a::f0 dst_f32:a:blocked:ab::f0,attr-scratchpad:user,,mb1ic1536oc8960,0.611816
onednn_verbose,primitive,exec,cpu,reorder,simple:any,undef,src_f32::blocked:abc::f0 dst_f32::blocked:abc::f0,attr-scratchpad:user,,1x1x8960,0.059082
onednn_verbose,primitive,exec,cpu,reorder,simple:any,undef,src_f32::blocked:abc::f0 dst_f32::blocked:abc::f0,attr-scratchpad:user,,1x1x8960,0.013916
onednn_verbose,primitive,exec,cpu,reorder,jit:uni,undef,src_f32::blocked:ab::f0 dst_f32::blocked:AB16b64a::f0,attr-scratchpad:user,,1536x8960,4.17603
onednn_verbose,primitive,exec,cpu,inner_product,brgemm:avx512_core,forward_training,src_f32:a:blocked:ab::f0 wei_f32:a:blocked:AB16b64a::f0 bia_f32:a:blocked:a::f0 dst_f32:a:blocked:ab::f0,attr-scratchpad:user,,mb1ic8960oc1536,0.780029
My question is, is there a way to directly convert a pretrained model’s weights downloaded from PyTorch into onednn format, so that future linear operations don’t need to perform reordering?