Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance and AMX utilization questions with optimum-intel 1.17 and 1.20 for LLM Inference on SPR CPU #946

Open
zsym-sjtu opened this issue Oct 12, 2024 · 0 comments

Comments

@zsym-sjtu
Copy link

zsym-sjtu commented Oct 12, 2024

I am using AMX on a SPR CPU to test LLM inference performance. Now I have problems about the performance of different optimum.intel interfaces.

I am using optimum-intel==1.17.0 and optimum-intel==1.20.0, with:
intel_extension_for_pytorch==2.4.0 torch==2.4.1 transformers==4.41.2.

I am trying to test llama-2-7b for inference, with BF16 precision.

As in #942, I am recommended to use from optimum.intel.pipelines import pipeline instead of from optimum.intel import inference_mode. Then I try to test their performance under the environment above. Result (avg. inference time) as follows. base means without optimum-intel and opt means with.

optimum-intel==1.17 inference_mode (to be removed) optimum.intel.pipelines (recommended)
base 17.52 17.58
opt 17.31 1.00
optimum-intel==1.20 inference_mode (removed) optimum.intel.pipelines (recommended)
base N/A 17.58
opt N/A 0.49

Using optimum-intel 1.17, I'm able to observe oneDNN primitives used with DNNL_VERBOSE, and control ISA (to control whether to use AMX) with DNNL_MAX_CPU_ISA, under base and opt setup.
Using optimum-intel 1.20, I am still able to do so in base (with native transformers), but failed in opt (optimum.intel.pipelines). These indicates that optimum-intel 1.17 is using oneDNN while optimum-intel 1.20 is not.

I learned that optimum-intel is partly based on intel-extension-for-pytorch (IPEX), and that IPEX uses libxsmm to use AMX (see in intel/intel-extension-for-pytorch#517 and intel/intel-extension-for-pytorch#720).

My questions are:

  1. In optimum-intel 1.17, do inference_mode and optimum.intel.pipelines apply different optimizations? What's the difference?
  2. It seems that the native transformers 4.41.2 is also using AMX (with oneDNN). How does optimum-intel 1.17's optimum.intel.pipelines better utilize AMX?
  3. It seems that optimum-intel 1.20 doesn't use AMX with oneDNN. Then does it use AMX? If so, what does it use AMX with? libxsmm?
  4. How can I verify that AMX is used by optimum-intel 1.20 and the corresponding native transformer?
  5. As oneDNN is not used, how can I control whether to use AMX and observe the primitives in another library? Any substitute for DNNL_MAX_CPU_ISA?

Many many thanks!

Code as follows:

# inference_mode (to be removed)
from transformers import pipeline
from optimum.intel import inference_mode

pipe = pipeline("text-generation",
        model=model_id,
        torch_dtype=torch.bfloat16
        )

if sys.argv[1] == 'base':
    result = benchmark(pipe)
elif sys.argv[1] == 'opt':
    with inference_mode(pipe, dtype=torch.bfloat16, jit=True) as opt_pipe:
        result = benchmark(opt_pipe)
# optimum.intel.pipelines (recommended)
from transformers.pipelines import pipeline as transformers_pipeline
from optimum.intel.pipelines import pipeline as ipex_pipeline

if sys.argv[1] == 'base':
    pipe = transformers_pipeline("text-generation", model_id, torch_dtype=torch.bfloat16)
elif sys.argv[1] == 'opt':
    pipe = ipex_pipeline("text-generation", model_id, accelerator="ipex", torch_dtype=torch.bfloat16)

with torch.inference_mode():
    result = benchmark(pipe)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant