Suggested architecture for Kedro & LLMs #3979

astrojuanlu · 2024-07-02T17:33:31Z

astrojuanlu
Jul 2, 2024
Maintainer

I've been toying with GenAI & Kedro for a while, first with Hugging Face (thanks to our datasets) and now through Ollama, which gives an OpenAI-compatible HTTP API:

$ curl http://localhost:11434/api/generate -d '{
  "model": "llama3",
  "prompt":"Why is the sky blue?"
}'

I'm trying to wrap my head around what would be the "blessed" way to implement this in Kedro.

1. Frozen inputs

One way would be to let the dataset include the prompt as part of the configuration, as follows:

llama3_ds:
  type: OllamaDataset
  model: llama3
  prompt: "Why is the sky blue?"

so catalog.load("llama3_ds") gives immediately an output.

Pro: The Kedro dataset performs all of the I/O needed. It's probably the "most pure" implementation. Could be modeled by using the already existing APIDataset.
Con: What if my prompt is not known at definition time?

2. Wrapper object

Another way of doing it would be returning a wrapper object to the node:

llama3_ds:
  type: OllamaDataset
  model: llama3

Which would then allow me to do things like

node(
  func=get_response,
  inputs=["llama3_ds", "param:prompt1"],
)

...

def get_response(llm, prompt: str):
  return llm.call(prompt)

Pro: It allows dynamic prompts, which might be necessary for some use cases.
Con 1: The node is performing I/O, which means that we're "breaking" the Kedro notion of what the node and the dataset should do.
Con 2: It breaks reproducibility, given that the prompt is not static (in this case it was a param, but could as well be the output of another node)

At the same time, we're already "breaking" that I/O separation in several places:

Any thoughts?

lordsoffallen · 2024-07-03T10:11:24Z

lordsoffallen
Jul 3, 2024

Hi @astrojuanlu,

Coming from our PR to share my thoughts on how kedro should work with LLMs.

My usual workflow is LLM is like this: I need a open source model so that requires defining some params for huggingface model. With LLMs in mind, there is lots of configs around managing models which I believe are suited to kedro catalog but I don't personally like considering them as datasets. So some separation like data catalog and model catalog would be highly appreciated. I am also not a fan of most of the claimed to be LLM tools our there, that includes langchain, llamaindex etc. They really aim for high level users with little to no experience in building ML models. I consider kedro as a tool for data scientists hence I believe they should be avoided early on. Could be optional dependency for certain cases but I don't see the need.

For huggingface I can imagine wrappers around getting model and tokenizer or pipeline depending on the use case is a good start. Also ability to save to local disk and/or to huggingface. This could be a dataset or a model as well. For instance, creating embeddings from a dataset and I would like to save them to huggingface. Adding more vector store capabilities would be nice abstraction like both open source tools as well as some apis. (qdrant etc)

Now for the API endpoint, there were some nice packages ( i forgot the name) which makes a unified entrypoint for the all other different apis like OpenAI, Claude etc. This would def help however a this makes a call to an endpoint so here, useful features such as adding pricing or logging. Ability to test this with a smaller local model would be super nice (not sure how for now).

Lastly, there is some work around prompts. Managing prompts, strings etc. Maybe there is something to do there.

Overall since kedro aims to manage data IO, simplifying getting the open source models (even lazy loading big models), making it seamless to work with APIs is a step in the right direction imo.

Let me know if there are some points that weren't clear. Happy to explain it in detail.

5 replies

astrojuanlu Jul 3, 2024
Maintainer Author

Thanks a lot @lordsoffallen for adding your thoughts. There's a lot to unpack so I'll go piece by piece, just picking one for now:

With LLMs in mind, there is lots of configs around managing models which I believe are suited to kedro catalog but I don't personally like considering them as datasets.

Now that you mention it, it was one of the elements surfaced in @AlpAribal & @datajoely research on MLOps platforms https://github.com/kedro-org/kedro/wiki/Synthesis-of-research-related-to-deployment-of-Kedro-to-modern-MLOps-platforms#differentiating-between-data-model-and-reporting-artifacts

In the end we call them datasets for historical reasons, even though many of them aren't really datasets.

lordsoffallen Jul 3, 2024

I mean for the standard data science workflow there weren't a need to do much as they were just pickle dataset in the end. So you can do regular imports and use whatever model you like and gets serialized in the same way. LLMs on the hand, can be just bunch of configs. Example of what I mean here:

Standard Flow: -> Works for linear regression, XGboost, LightGBM etc.

model:
    type: PickleDataset
    filepath: ...

LLM Flow

llama3:
    type: HFModel
    checkpoint: meta/llama-3

gemma-2:
   type: HFModel
   checkpoint: google/gemma2

As you can see if I wanna put down couple of models and eval them and pick one, multiple configs suffice but mixing with training data grows quickly. I could convert this into a parameter and try to use a unifed function that gets the right model but that also goes out of the kedro principle of doing data IO but me doing that manually instead. Hope the example extends on my point a bit more.

datajoely Jul 3, 2024
Collaborator

I would love to make this metadata a first class citizen:

cars:
  type: pandas.CSVDataset
  filepath: data/01_raw/company/cars.csv
  load_args:
    sep: ','
  save_args:
    index: False
    date_format: '%Y-%m-%d %H:%M'
    decimal: .
   metadata:
     viz:
      layer: raw
+     kind: data # or (model/artifact)

We could take inspiration from Kubeflow's artifact types

astrojuanlu Jul 3, 2024
Maintainer Author

I wanna put down couple of models and eval them and pick one

Indeed, Kedro is a bit too obtuse for parallel experimentation, there's a lot of context about that in #1606

https://getindata.com/blog/kedro-dynamic-pipelines/ discusses one possible approach for tackling that, we know lots of users get there but we don't have enough context on whether it's enough or not.

I could convert this into a parameter and try to use a unifed function that gets the right model but that also goes out of the kedro principle of doing data IO

👀 indeed, parametrising the catalog itself in this way is not always easy.

You could have a look at runtime parameters like these: https://docs.kedro.org/en/stable/configuration/advanced_configuration.html#how-to-override-configuration-with-runtime-parameters-with-the-omegaconfigloader

lordsoffallen Jul 5, 2024

I'm sure i can find a way to hack it but it's kind of a headache to hack things constantly as it suggests it's not designed for those workflows. Again, for standard ML required hacks are minimum so it's fine there but LLM workflows weren't a great fit so far which made to write extra connectors/workarounds and i ended up questioning the value i gained from kedro by using it.

datajoely · 2024-07-03T11:27:47Z

datajoely
Jul 3, 2024
Collaborator

Kedro design decisions

Kedro was built against scar tissue gained building 'traditional' ML models, let's say sklearn style for the sake of it. It's principles speak about simplicity, modularity and reproducibility. That last point is uncharted territory with non-deterministic systems like LLMs.

One of the tensions in supporting advanced user of Kedro has been the balance between supporting dynamic pipelines. With dynamism one quickly gets questions about conditional logic and our party line has always been some version of -

"we do not believe the combinatorial complexity conditional logic introduced is worth the maintenance burden trade off".

Kedro 🤝 LLM scepticism

Orchestrators have work at a higher level of abstraction than Kedro. This is to say where 1 node is often 1 K8s pod, and have opted to introduce this type of operator where the DAG complexity introduced is lower. Examples (1), (2), (3).

Personally I currently do not believe our current principles or design decisions around Kedro lend themselves well to the requirements of working with LLMs. Whether we change / relax these is another question.... If you look at the effort that's gone into Prefect Control Flow there is a great deal of work that's gone into handling IO of various function calls within the task DAG. This pattern is very different to Kedro's decoupled dataset centric way of doing IO outside of nodes.

Proposition: Let's make Kedro a great fit for deterministic 'function calling'

Our biggest most important issue stems back to #143 and the fact that the Kedro session isn't fit for injecting data at runtime #2169. I believe Kedro could be a great ergonomic solution for LLM systems to perform complex, custom "Function calls" as part of their wider task chains/graphs. If we solve these underlying issues... we'll solve a bunch of problems in #3094 and this LLM space.

4 replies

lordsoffallen Jul 3, 2024

How about keeping standard ML and LLM space separate as a package? kedro does what it does now with focus on standard flow and kedro-llm focuses on improving LLM workflow. I believe trying to optimize for both will not make more people happy but having 2 tools with different designs could potentially work. Not to mention to advantage of not dealing with breaking changes and countless brainstorming about how to integrate LLM workflow into standard kedro flow

datajoely Jul 3, 2024
Collaborator

Absolutely open to this- but I think we need to really nail down what kedro-llm does and why it deserves to exist to go in this direction.

lordsoffallen Jul 5, 2024

💯. A good approach would be to start understading workflows of LLM case from data scientists - this is important as I would like kedro-llm not become yet another tool made for business and software engineers who has no idea about ML workflow in general. -

astrojuanlu Jul 5, 2024
Maintainer Author

I just sent an email (if you didn't receive it, then I might have sent it to the wrong place 😅)

ElenaKhaustova · 2024-07-05T15:51:36Z

ElenaKhaustova
Jul 5, 2024
Collaborator

I see the value in providing a way to use LLMs with Kedro. As a minimum requirement, I would consider providing the mechanism to configure the model, switch between the LLMs, manage prompts, generate output, and get embeddings.

"They really aim for high level users with little to no experience in building ML models."

Here, the question is if "low-lever" users, let's say those who implement models/serve them in production, are going to use Kedro for that. IMO, we can start from the most obvious cases when people just create their pipelines using ready-to-use LLMs/ external APIs and need some mechanism to incorporate them and configure them, as this covers most of the cases.

I would also not avoid using langchain as it provides many things out of the box. Instead, I would try to provide an API on top of it and huggingface so that users could easily switch between different LLMs implementations.

When we worked with the langchain we found it convenient to work with chains - that combine llm and promt and provide a standardised interface to call the model with run-time parameters (prompt placeholders). So one can use different llms with the same interface. Example using latest langchain API: https://python.langchain.com/v0.1/docs/integrations/chat/anthropic/

We also came to dynamic model initialisation, in our case it can help users to switch between different models without need to add extra datasets (OpenAI, Cohere, Azure, etc).

If combine those ideas the catalog.yaml can look like that:

gpt_3_5_turbo:
   type: langchain.LLMDataset
   model_type: langchain_openai.ChatOpenAI
   kwargs:
     model: "gpt-3.5-turbo"
     temperature: 0.0
   credentials: openai
   prompt: "prompt" # we can even use parameters to provide them as it's usually more convenient to keep them in a separate files as they might be too long

         
bert:
   type: huggingface.LLMDataset
   model_type: transformers.TFBertModel
   kwargs:
     model: "bert-base-uncased"
     tokenizer: "bert-base-uncased"

As for dynamic promts - it can be partly solved via placeholders in the prompts that are filled at the runtime as it's done in langchain. So we can inject any content or the entire prompt during execution.

"Kedro was built against scar tissue gained building 'traditional' ML models, let's say sklearn style for the sake of it. It's principles speak about simplicity, modularity and reproducibility. That last point is uncharted territory with non-deterministic systems like LLMs."

I think there's a difference between adding some feature to the framework, which leads to non-reproducible behaviour, and using Kedro with non-deterministic models (via dataset). We cannot control the last as users can make it anyway but should we?

0 replies

lordsoffallen · 2025-02-13T16:31:44Z

lordsoffallen
Feb 13, 2025

So coming back to this topic after a long time. I have been working with kedro + Large models (text, image, audio) for quite sometime now and wanted to share my workflow/tooling:

Model Catalog

Using the kedro data catalog but customized to read models.yml file which looks like this:

claude-sonnet: &claude-sonnet
  type: &LLM projx.models.LLM
  backend: anthropic
  model: claude-3-5-sonnet-20241022
  credentials: anthropic-api-key
  model_params: &model-params
    temperature: !!float 0.6

claude-sonnet-cached: &claude-sonnet-cache
  type: *LLM
  backend: anthropic
  model: claude-3-5-sonnet-20241022
  credentials: anthropic-api-key
  model_params:
    <<: *model-params
    extra_headers:
      anthropic-beta: prompt-caching-2024-07-31

gpt-4o:
  type: *LLM
  backend: openai
  model: gpt-4o-2024-11-20
  credentials: openai-api-key
  model_params: *model-params

LLM type

As you noticed, I am using a custom LLM class here, which is wrapper around my library here. It basically unifies API provider class, similar to litellm but with very minimal dependencies. (Any other API library could be used as a replacement)

This just provides a unified call interface which then my custom LLM class use. Now, for the usage of this class, I have also created utility in the kedro which I called KedroMixin

class KedroMixin(AbstractDataset):
    """ A generic base class that enables classes to be defined in the catalog as is """

    def load(self) -> _DO:
        return self

    def save(self, data: _DI) -> None:
        raise NotImplementedError("Save is not supported by default. Add if needed")

    def _describe(self) -> dict[str, Any]:
        return self.__dict__

A class that I use to convert any class so I can use them in the catalog directly. Then very basic version of the LLM class looks like this:

class LLM(KedroMixin):
    def __init__(
        self,
        backend: str,
        model: str,
        credentials: str = None,
        model_params: dict = None,
        backend_kwargs: dict = None,
    ):
        self.backend = backend
        self.model = model
        self.credentials = credentials
        self.model_params = model_params if model_params is not None else {}
        self.backend_kwargs = backend_kwargs if backend_kwargs is not None else {}

        endpoint = get_backend(self.backend)
        self.llm = endpoint(api_key=self.credentials, **self.backend_kwargs)

        log_dir = PROJECT_PATH.joinpath("logs")
        log_dir.mkdir(parents=True, exist_ok=True)
        self.log_dir = log_dir
        self._logging = True    # True by default
        
   def __call__(prompt related params...):
        # here we make the call with prompt params

Prompts

Now we have the LLM models that we can just let kedro load but we need prompt management system. I have tried couple of different ways but landed on using the prompts.yml which is customized to be a kedro parameter.
It looks like this:

prompts:
  create_summaries:
    system: &default-system |-
      Only output what user has defined as output format and nothing else unless user explicitly asked for step by step analysis.
    user: |-
      You are a helpful AI assistant tasked with creating a detailed chapter summary. 

      You are given this chapter of a book:

      {{ chapter }}

      Now, your summary should follow this structure:
      ## Chapter Story
      Use bullet points for summaries, keeping each point brief and action-focused.
      
      ## Characters
      - Comma separated character names

      ## Locations
      - Define main locations the story take place and what happens there (briefly)

Now, we can either provide a template and fill the values during the runtime or no template but the add them to prompt itself. Since we pass this to the function it's up to user to do what they want. Quite flexible.

The important factor is that I used the exact functions names here. For instance, create_summaries is the name of the function. Obviously not a must but there if one benefit which I will mention later.

Putting it all together

Now that we have those in place, this is a sample view how kedro node now looks like:

def create_summaries(
    book: pd.DataFrame, llm: LLM, prompt: Prompts | dict
) -> str:
    # logic to make a call
    response = llm(prompt)
    ...

and how this node is referred:

node(
    func=create_summaries,
    name="create_summaries",
    inputs=["book", "gpt-4o", "params:prompts:create_summaries"],
    outputs="summaries#df",
)

Next Level

Now at this point everything works but one thing started to got to me and it was that my nodes are getting bigger because with every LLM call, i wanted to store outputs so I don't have to rerun that logic unless it was wrong. I ended up having 15-20 and it was getting problematic to manage this structure. I decided to improve this and ended up with the following improvements:

Node kwargs

I have realized that having input/output definitions as decorator would be easier to maintain and wanted to avoid using prompt param as long form string so I came up with the code below:

class PromptParam:
    prefix = "params:prompts"


def node_kwargs(**kwargs):
    def decorator(func):
        if "name" not in kwargs:
            kwargs["name"] = func.__name__
        ins = kwargs["inputs"]
        if isinstance(ins, list):
            # Args pattern
            for idx, i in enumerate(ins):
                if isinstance(i, PromptParam):
                    ins[idx] = f"{i.prefix}.{func.__name__}"
        elif isinstance(ins, dict):
            # kwargs pattern
            for k, v in ins.items():
                if isinstance(v, PromptParam):
                    ins[k] = f"{v.prefix}.{func.__name__}"
        kwargs["inputs"] = ins
        kwargs["func"] = func
        func.__node_kwargs__ = kwargs
        return func
    return decorator

what node_kwargs does is that it stores kedro node function specific kwargs in the function itself. Every function now would have __node_kwargs__ which holds kedro node arguments.

And the PromptParam class is just a small utility to generate params:prompts.FUNC NAME

Example

@node_kwargs(
    inputs=[SAMPLE_BOOK, "deepseek-v3", PromptParam()], outputs=CHAPTER_SUMMARIES
)
def create_summaries(
    book: pd.DataFrame, llm: LLM, prompt: Prompts | dict
) -> str:
    ....

Now, with that code, i now define a pipeline creation as follows:

def create_pipeline():
    return pipeline([node(**f.__node_kwargs__) for f in [create_summaries])

As you can see, I just refer the custom __node_kwargs__ here to automatically create the node.

Going Agentic

Now this works really well but I had cases where I needed to use multiple different models in a single node. I could definitely pass them one by one and apply my own logic but i had a repetitive case where I needed a big smarter model and smaller model. Something like a master LLM and worker LLM.

Now, our code works with LLM but making it agentic is tricky because i want to change models as i wish and not really update catalog every time i want to try a new model. So I create the AgenticLLM class:

class AgentLLM:
    def __init__(self, master: LLM, worker: LLM):
        self.master = master
        self.worker = worker

Now, I want to let kedro load multiple models and create this class for me ( it was annoying to do this in multiple nodes manually). I created the following custom logic:

class DataCatalog(io.DataCatalog):
    def __contains__(self, dataset_name: str) -> bool:
        """Check if an item is in the catalog as a materialised dataset or pattern"""
        if "::" in dataset_name:
            # This condition is added for runner checks to verify existing datasets
            ds_checks = [n.split(":")[1] for n in dataset_name.split("::")]
            return all([d in self for d in ds_checks])
        else:
            return super().__contains__(dataset_name)

    def _get_dataset(
        self,
        dataset_name: str,
        version: Version | None = None,
        suggest: bool = True,
    ) -> AbstractDataset:
        if "::" in dataset_name:
            # This condition is added for runner checks, return type does not matter
            # as it is ignored.
            datasets = [n.split(":")[1] for n in dataset_name.split("::")]
            return [
                self._get_dataset(dataset_name=ds, version=version, suggest=suggest)
                for ds in datasets
            ]
        else:
            return super()._get_dataset(
                dataset_name=dataset_name, version=version, suggest=suggest
            )

    def load(self, name: str, version: str | None = None) -> Any:
        """Loads a registered dataset.

        Args:
            name: A dataset to be loaded.
            version: Optional argument for concrete data version to be loaded.
                Works only with versioned datasets.

        Returns:
            The loaded data as configured.

        Raises:
            DatasetNotFoundError: When a dataset with the given name
                has not yet been registered.
        """

        if "::" in name:    # We have agent-llm design
            params = {}

            for n in name.split("::"):
                param, ds = n.split(":")
                params[param] = super().load(name=ds, version=version)

            return AgentLLM(**params)

        else:
            return super().load(name=name, version=version)

What this does is just enabling support for this syntax master:claude-sonnet::worker:mistral-large, this means following node will use agentic LLM now:

@node_kwargs(
    inputs=[SAMPLE_BOOK, "master:claude-sonnet::worker:deepseek-v3", PromptParam()], outputs=CHAPTER_SUMMARIES
)
def create_summaries(
    book: pd.DataFrame, llm: AgentLLM, prompt: Prompts | dict
) -> str:
    ....

That's what's been working for me for quite sometime. I wanted to compile into a kedro package and open source it but I am super busy and not quite sure when I will have the time to do so but I wanted to share my tips/tricks so perhaps it helps others who needs inspirations.

@astrojuanlu @datajoely Let me know what you guys think 🙌

2 replies

astrojuanlu Feb 13, 2025
Maintainer Author

Thanks for the super detailed writeup @lordsoffallen !! This is almost a blog post 😄 I'll chew over all this and get back with some thoughts in a few days

lordsoffallen Feb 13, 2025

Awesome. There was some other details but I wanted to keep it at a minimum 😆

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Suggested architecture for Kedro & LLMs #3979

{{title}}

Replies: 4 comments 11 replies

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Suggested architecture for Kedro & LLMs #3979

astrojuanlu Jul 2, 2024 Maintainer

1. Frozen inputs

2. Wrapper object

Replies: 4 comments · 11 replies

lordsoffallen Jul 3, 2024

astrojuanlu Jul 3, 2024 Maintainer Author

lordsoffallen Jul 3, 2024

datajoely Jul 3, 2024 Collaborator

astrojuanlu Jul 3, 2024 Maintainer Author

lordsoffallen Jul 5, 2024

datajoely Jul 3, 2024 Collaborator

Kedro design decisions

Kedro 🤝 LLM scepticism

Proposition: Let's make Kedro a great fit for deterministic 'function calling'

lordsoffallen Jul 3, 2024

datajoely Jul 3, 2024 Collaborator

lordsoffallen Jul 5, 2024

astrojuanlu Jul 5, 2024 Maintainer Author

ElenaKhaustova Jul 5, 2024 Collaborator

lordsoffallen Feb 13, 2025

Model Catalog

LLM type

Prompts

Putting it all together

Next Level

Node kwargs

Going Agentic

astrojuanlu Feb 13, 2025 Maintainer Author

lordsoffallen Feb 13, 2025

astrojuanlu
Jul 2, 2024
Maintainer

Replies: 4 comments 11 replies

lordsoffallen
Jul 3, 2024

astrojuanlu Jul 3, 2024
Maintainer Author

datajoely Jul 3, 2024
Collaborator

astrojuanlu Jul 3, 2024
Maintainer Author

datajoely
Jul 3, 2024
Collaborator

datajoely Jul 3, 2024
Collaborator

astrojuanlu Jul 5, 2024
Maintainer Author

ElenaKhaustova
Jul 5, 2024
Collaborator

lordsoffallen
Feb 13, 2025

astrojuanlu Feb 13, 2025
Maintainer Author