Performance and AMX utilization questions with optimum-intel 1.17 and 1.20 for LLM Inference on SPR CPU #946

zsym-sjtu · 2024-10-12T10:12:16Z

I am using AMX on a SPR CPU to test LLM inference performance. Now I have problems about the performance of different optimum.intel interfaces.

I am using optimum-intel==1.17.0 and optimum-intel==1.20.0, with:
intel_extension_for_pytorch==2.4.0 torch==2.4.1 transformers==4.41.2.

I am trying to test llama-2-7b for inference, with BF16 precision.

As in #942, I am recommended to use from optimum.intel.pipelines import pipeline instead of from optimum.intel import inference_mode. Then I try to test their performance under the environment above. Result (avg. inference time) as follows. base means without optimum-intel and opt means with.

optimum-intel==1.17	inference_mode (to be removed)	optimum.intel.pipelines (recommended)
base	17.52	17.58
opt	17.31	1.00

optimum-intel==1.20	inference_mode (removed)	optimum.intel.pipelines (recommended)
base	N/A	17.58
opt	N/A	0.49

Using optimum-intel 1.17, I'm able to observe oneDNN primitives used with DNNL_VERBOSE, and control ISA (to control whether to use AMX) with DNNL_MAX_CPU_ISA, under base and opt setup.
Using optimum-intel 1.20, I am still able to do so in base (with native transformers), but failed in opt (optimum.intel.pipelines). These indicates that optimum-intel 1.17 is using oneDNN while optimum-intel 1.20 is not.

I learned that optimum-intel is partly based on intel-extension-for-pytorch (IPEX), and that IPEX uses libxsmm to use AMX (see in intel/intel-extension-for-pytorch#517 and intel/intel-extension-for-pytorch#720).

My questions are:

In optimum-intel 1.17, do inference_mode and optimum.intel.pipelines apply different optimizations? What's the difference?
It seems that the native transformers 4.41.2 is also using AMX (with oneDNN). How does optimum-intel 1.17's optimum.intel.pipelines better utilize AMX?
It seems that optimum-intel 1.20 doesn't use AMX with oneDNN. Then does it use AMX? If so, what does it use AMX with? libxsmm?
How can I verify that AMX is used by optimum-intel 1.20 and the corresponding native transformer?
As oneDNN is not used, how can I control whether to use AMX and observe the primitives in another library? Any substitute for DNNL_MAX_CPU_ISA?

Many many thanks!

Code as follows:

# inference_mode (to be removed)
from transformers import pipeline
from optimum.intel import inference_mode

pipe = pipeline("text-generation",
        model=model_id,
        torch_dtype=torch.bfloat16
        )

if sys.argv[1] == 'base':
    result = benchmark(pipe)
elif sys.argv[1] == 'opt':
    with inference_mode(pipe, dtype=torch.bfloat16, jit=True) as opt_pipe:
        result = benchmark(opt_pipe)

# optimum.intel.pipelines (recommended)
from transformers.pipelines import pipeline as transformers_pipeline
from optimum.intel.pipelines import pipeline as ipex_pipeline

if sys.argv[1] == 'base':
    pipe = transformers_pipeline("text-generation", model_id, torch_dtype=torch.bfloat16)
elif sys.argv[1] == 'opt':
    pipe = ipex_pipeline("text-generation", model_id, accelerator="ipex", torch_dtype=torch.bfloat16)

with torch.inference_mode():
    result = benchmark(pipe)

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance and AMX utilization questions with optimum-intel 1.17 and 1.20 for LLM Inference on SPR CPU #946

Performance and AMX utilization questions with optimum-intel 1.17 and 1.20 for LLM Inference on SPR CPU #946

zsym-sjtu commented Oct 12, 2024 •

edited

Loading

Performance and AMX utilization questions with optimum-intel 1.17 and 1.20 for LLM Inference on SPR CPU #946

Performance and AMX utilization questions with optimum-intel 1.17 and 1.20 for LLM Inference on SPR CPU #946

Comments

zsym-sjtu commented Oct 12, 2024 • edited Loading

zsym-sjtu commented Oct 12, 2024 •

edited

Loading