Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GPU Not Utilized When Using llm-rs with CUDA Version #27

Open
andri-jpg opened this issue Jul 19, 2023 · 2 comments
Open

GPU Not Utilized When Using llm-rs with CUDA Version #27

andri-jpg opened this issue Jul 19, 2023 · 2 comments
Labels
documentation Improvements or additions to documentation question Further information is requested

Comments

@andri-jpg
Copy link
Contributor

I have installed the llm-rs library with the CUDA version, However, even though I have set use_gpu=True in the SessionConfig, the GPU is not utilized when running the code. Instead, the CPU usage remains at 100% during execution.

Additional Information:
I am using the "RedPajama Chat 3B" model from Rustformers. The model can be found at the following link: RedPajama Chat 3B Model.

Terminal output:

PS C:\Users\andri\Downloads\chatwaifu> python main.py
ggml_init_cublas: found 1 CUDA devices:
  Device 0: NVIDIA P106-100, compute capability 6.1

Code:

import json
from llm_rs.langchain import RustformersLLM
from llm_rs import SessionConfig, GenerationConfig, ContainerType, QuantizationType, Precision
from langchain import PromptTemplate
from langchain.chains import LLMChain
from langchain.memory import ConversationBufferMemory
from langchain.callbacks.streaming_stdout import StreamingStdOutCallbackHandler
from pathlib import Path

class ChainingModel:
    def __init__(self, model, name, assistant_name):
        with open('config.json') as self.configuration:
            self.user_config = json.load(self.configuration)
        with open('template.json') as self.prompt_template:
            self.user_template = json.load(self.prompt_template)
        model = f"{model}.bin"
        self.model = model

        self.name = name
        self.assistant_name = assistant_name
        self.names = f"<{name}>"
        self.assistant_names = f"<{assistant_name}>"
        
        self.stop_word = ['\n<human>:', '<human>', '<bot>', '\n<bot>:']
        self.stop_words = self.change_stop_words(self.stop_word, self.name, self.assistant_name)
        session_config = SessionConfig(
            threads=self.user_config['threads'],
            context_length=self.user_config['context_length'],
            prefer_mmap=False,
            use_gpu=True
        )

        generation_config = GenerationConfig(
            top_p=self.user_config['top_p'],
            top_k=self.user_config['top_k'],
            temperature=self.user_config['temperature'],
            max_new_tokens=self.user_config['max_new_tokens'],
            repetition_penalty=self.user_config['repetition_penalty'],
            stop_words=self.stop_words
        )

        template = self.user_template['template']

        self.template = self.change_names(template, self.assistant_name, self.name)
        self.prompt = PromptTemplate(
            input_variables=["chat_history", "instruction"],
            template=self.template
        )
        self.memory = ConversationBufferMemory(memory_key="chat_history")

        self.llm = RustformersLLM(
            model_path_or_repo_id=self.model,
            session_config=session_config,
            generation_config=generation_config,
            callbacks=[StreamingStdOutCallbackHandler()]
        )

        self.chain = LLMChain(llm=self.llm, prompt=self.prompt, memory=self.memory)
@LLukas22
Copy link
Owner

Currently only llama based models are accelerated by metal/cuda/opencl. If you use another architecture like gpt-neox it will fallback to cpu only inference. What you are seeing in your std-out is your gpu being initialized but the model then isn't offloaded to the gpu as we haven't implemented acceleration for this architecture in rustformers/llm yet.

I will probably create some sort of table in the rustformers/llm repo which shows which architectures are accelerated on which platform and then link to it to avoid further confusion.

We are planning to bring cuda acceleration too gpt-neox, gpt2 etc. but it will take some time as all internal operations of these models need to be implemented as cuda kernels in the ggml repo. Currently only llama and falcon can be completely offloaded onto the gpu and get the full acceleration.

@LLukas22 LLukas22 added documentation Improvements or additions to documentation question Further information is requested labels Jul 19, 2023
@LLukas22 LLukas22 pinned this issue Jul 19, 2023
@andri-jpg
Copy link
Contributor Author

I appreciate the plan to create a table in the rustformers/llm repository, showing which architectures are supported with acceleration on specific platforms. That will definitely help avoid confusion in the future. Thanks again for the explanation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants