Suggested architecture for Kedro & LLMs #3979
Replies: 4 comments 11 replies
-
Hi @astrojuanlu, Coming from our PR to share my thoughts on how kedro should work with LLMs. My usual workflow is LLM is like this: I need a open source model so that requires defining some params for huggingface model. With LLMs in mind, there is lots of configs around managing models which I believe are suited to kedro catalog but I don't personally like considering them as datasets. So some separation like For huggingface I can imagine wrappers around getting model and tokenizer or pipeline depending on the use case is a good start. Also ability to save to local disk and/or to huggingface. This could be a dataset or a model as well. For instance, creating embeddings from a dataset and I would like to save them to huggingface. Adding more vector store capabilities would be nice abstraction like both open source tools as well as some apis. (qdrant etc) Now for the API endpoint, there were some nice packages ( i forgot the name) which makes a unified entrypoint for the all other different apis like OpenAI, Claude etc. This would def help however a this makes a call to an endpoint so here, useful features such as adding pricing or logging. Ability to test this with a smaller local model would be super nice (not sure how for now). Lastly, there is some work around prompts. Managing prompts, strings etc. Maybe there is something to do there. Overall since kedro aims to manage data IO, simplifying getting the open source models (even lazy loading big models), making it seamless to work with APIs is a step in the right direction imo. Let me know if there are some points that weren't clear. Happy to explain it in detail. |
Beta Was this translation helpful? Give feedback.
-
Kedro design decisionsKedro was built against scar tissue gained building 'traditional' ML models, let's say One of the tensions in supporting advanced user of Kedro has been the balance between supporting dynamic pipelines. With dynamism one quickly gets questions about conditional logic and our party line has always been some version of -
Kedro 🤝 LLM scepticismOrchestrators have work at a higher level of abstraction than Kedro. This is to say where 1 node is often 1 K8s pod, and have opted to introduce this type of operator where the DAG complexity introduced is lower. Examples (1), (2), (3). Personally I currently do not believe our current principles or design decisions around Kedro lend themselves well to the requirements of working with LLMs. Whether we change / relax these is another question.... If you look at the effort that's gone into Prefect Control Flow there is a great deal of work that's gone into handling IO of various function calls within the task DAG. This pattern is very different to Kedro's decoupled dataset centric way of doing IO outside of nodes. Proposition: Let's make Kedro a great fit for deterministic 'function calling'Our biggest most important issue stems back to #143 and the fact that the Kedro session isn't fit for injecting data at runtime #2169. I believe Kedro could be a great ergonomic solution for LLM systems to perform complex, custom "Function calls" as part of their wider task chains/graphs. If we solve these underlying issues... we'll solve a bunch of problems in #3094 and this LLM space. |
Beta Was this translation helpful? Give feedback.
-
I see the value in providing a way to use LLMs with Kedro. As a minimum requirement, I would consider providing the mechanism to configure the model, switch between the LLMs, manage prompts, generate output, and get embeddings.
Here, the question is if "low-lever" users, let's say those who implement models/serve them in production, are going to use Kedro for that. IMO, we can start from the most obvious cases when people just create their pipelines using ready-to-use LLMs/ external APIs and need some mechanism to incorporate them and configure them, as this covers most of the cases. I would also not avoid using When we worked with the We also came to dynamic model initialisation, in our case it can help users to switch between different models without need to add extra datasets (OpenAI, Cohere, Azure, etc). If combine those ideas the
As for dynamic promts - it can be partly solved via placeholders in the prompts that are filled at the runtime as it's done in
I think there's a difference between adding some feature to the framework, which leads to non-reproducible behaviour, and using Kedro with non-deterministic models (via dataset). We cannot control the last as users can make it anyway but should we? |
Beta Was this translation helpful? Give feedback.
-
So coming back to this topic after a long time. I have been working with kedro + Large models (text, image, audio) for quite sometime now and wanted to share my workflow/tooling: Model CatalogUsing the kedro data catalog but customized to read claude-sonnet: &claude-sonnet
type: &LLM projx.models.LLM
backend: anthropic
model: claude-3-5-sonnet-20241022
credentials: anthropic-api-key
model_params: &model-params
temperature: !!float 0.6
claude-sonnet-cached: &claude-sonnet-cache
type: *LLM
backend: anthropic
model: claude-3-5-sonnet-20241022
credentials: anthropic-api-key
model_params:
<<: *model-params
extra_headers:
anthropic-beta: prompt-caching-2024-07-31
gpt-4o:
type: *LLM
backend: openai
model: gpt-4o-2024-11-20
credentials: openai-api-key
model_params: *model-params LLM typeAs you noticed, I am using a custom LLM class here, which is wrapper around my library here. It basically unifies API provider class, similar to This just provides a unified call interface which then my custom class KedroMixin(AbstractDataset):
""" A generic base class that enables classes to be defined in the catalog as is """
def load(self) -> _DO:
return self
def save(self, data: _DI) -> None:
raise NotImplementedError("Save is not supported by default. Add if needed")
def _describe(self) -> dict[str, Any]:
return self.__dict__ A class that I use to convert any class so I can use them in the catalog directly. Then very basic version of the class LLM(KedroMixin):
def __init__(
self,
backend: str,
model: str,
credentials: str = None,
model_params: dict = None,
backend_kwargs: dict = None,
):
self.backend = backend
self.model = model
self.credentials = credentials
self.model_params = model_params if model_params is not None else {}
self.backend_kwargs = backend_kwargs if backend_kwargs is not None else {}
endpoint = get_backend(self.backend)
self.llm = endpoint(api_key=self.credentials, **self.backend_kwargs)
log_dir = PROJECT_PATH.joinpath("logs")
log_dir.mkdir(parents=True, exist_ok=True)
self.log_dir = log_dir
self._logging = True # True by default
def __call__(prompt related params...):
# here we make the call with prompt params PromptsNow we have the LLM models that we can just let kedro load but we need prompt management system. I have tried couple of different ways but landed on using the prompts:
create_summaries:
system: &default-system |-
Only output what user has defined as output format and nothing else unless user explicitly asked for step by step analysis.
user: |-
You are a helpful AI assistant tasked with creating a detailed chapter summary.
You are given this chapter of a book:
{{ chapter }}
Now, your summary should follow this structure:
## Chapter Story
Use bullet points for summaries, keeping each point brief and action-focused.
## Characters
- Comma separated character names
## Locations
- Define main locations the story take place and what happens there (briefly) Now, we can either provide a template and fill the values during the runtime or no template but the add them to prompt itself. Since we pass this to the function it's up to user to do what they want. Quite flexible. The important factor is that I used the exact functions names here. For instance, Putting it all togetherNow that we have those in place, this is a sample view how kedro node now looks like: def create_summaries(
book: pd.DataFrame, llm: LLM, prompt: Prompts | dict
) -> str:
# logic to make a call
response = llm(prompt)
... and how this node is referred: node(
func=create_summaries,
name="create_summaries",
inputs=["book", "gpt-4o", "params:prompts:create_summaries"],
outputs="summaries#df",
) Next LevelNow at this point everything works but one thing started to got to me and it was that my nodes are getting bigger because with every LLM call, i wanted to store outputs so I don't have to rerun that logic unless it was wrong. I ended up having 15-20 and it was getting problematic to manage this structure. I decided to improve this and ended up with the following improvements: Node kwargsI have realized that having input/output definitions as decorator would be easier to maintain and wanted to avoid using prompt param as long form string so I came up with the code below: class PromptParam:
prefix = "params:prompts"
def node_kwargs(**kwargs):
def decorator(func):
if "name" not in kwargs:
kwargs["name"] = func.__name__
ins = kwargs["inputs"]
if isinstance(ins, list):
# Args pattern
for idx, i in enumerate(ins):
if isinstance(i, PromptParam):
ins[idx] = f"{i.prefix}.{func.__name__}"
elif isinstance(ins, dict):
# kwargs pattern
for k, v in ins.items():
if isinstance(v, PromptParam):
ins[k] = f"{v.prefix}.{func.__name__}"
kwargs["inputs"] = ins
kwargs["func"] = func
func.__node_kwargs__ = kwargs
return func
return decorator what And the Example @node_kwargs(
inputs=[SAMPLE_BOOK, "deepseek-v3", PromptParam()], outputs=CHAPTER_SUMMARIES
)
def create_summaries(
book: pd.DataFrame, llm: LLM, prompt: Prompts | dict
) -> str:
.... Now, with that code, i now define a pipeline creation as follows: def create_pipeline():
return pipeline([node(**f.__node_kwargs__) for f in [create_summaries]) As you can see, I just refer the custom Going AgenticNow this works really well but I had cases where I needed to use multiple different models in a single node. I could definitely pass them one by one and apply my own logic but i had a repetitive case where I needed a big smarter model and smaller model. Something like a master LLM and worker LLM. Now, our code works with class AgentLLM:
def __init__(self, master: LLM, worker: LLM):
self.master = master
self.worker = worker Now, I want to let kedro load multiple models and create this class for me ( it was annoying to do this in multiple nodes manually). I created the following custom logic: class DataCatalog(io.DataCatalog):
def __contains__(self, dataset_name: str) -> bool:
"""Check if an item is in the catalog as a materialised dataset or pattern"""
if "::" in dataset_name:
# This condition is added for runner checks to verify existing datasets
ds_checks = [n.split(":")[1] for n in dataset_name.split("::")]
return all([d in self for d in ds_checks])
else:
return super().__contains__(dataset_name)
def _get_dataset(
self,
dataset_name: str,
version: Version | None = None,
suggest: bool = True,
) -> AbstractDataset:
if "::" in dataset_name:
# This condition is added for runner checks, return type does not matter
# as it is ignored.
datasets = [n.split(":")[1] for n in dataset_name.split("::")]
return [
self._get_dataset(dataset_name=ds, version=version, suggest=suggest)
for ds in datasets
]
else:
return super()._get_dataset(
dataset_name=dataset_name, version=version, suggest=suggest
)
def load(self, name: str, version: str | None = None) -> Any:
"""Loads a registered dataset.
Args:
name: A dataset to be loaded.
version: Optional argument for concrete data version to be loaded.
Works only with versioned datasets.
Returns:
The loaded data as configured.
Raises:
DatasetNotFoundError: When a dataset with the given name
has not yet been registered.
"""
if "::" in name: # We have agent-llm design
params = {}
for n in name.split("::"):
param, ds = n.split(":")
params[param] = super().load(name=ds, version=version)
return AgentLLM(**params)
else:
return super().load(name=name, version=version) What this does is just enabling support for this syntax @node_kwargs(
inputs=[SAMPLE_BOOK, "master:claude-sonnet::worker:deepseek-v3", PromptParam()], outputs=CHAPTER_SUMMARIES
)
def create_summaries(
book: pd.DataFrame, llm: AgentLLM, prompt: Prompts | dict
) -> str:
.... That's what's been working for me for quite sometime. I wanted to compile into a kedro package and open source it but I am super busy and not quite sure when I will have the time to do so but I wanted to share my tips/tricks so perhaps it helps others who needs inspirations. @astrojuanlu @datajoely Let me know what you guys think 🙌 |
Beta Was this translation helpful? Give feedback.
-
I've been toying with GenAI & Kedro for a while, first with Hugging Face (thanks to our datasets) and now through Ollama, which gives an OpenAI-compatible HTTP API:
I'm trying to wrap my head around what would be the "blessed" way to implement this in Kedro.
1. Frozen inputs
One way would be to let the dataset include the prompt as part of the configuration, as follows:
so
catalog.load("llama3_ds")
gives immediately an output.Pro: The Kedro dataset performs all of the I/O needed. It's probably the "most pure" implementation. Could be modeled by using the already existing
APIDataset
.Con: What if my prompt is not known at definition time?
2. Wrapper object
Another way of doing it would be returning a wrapper object to the node:
Which would then allow me to do things like
Pro: It allows dynamic prompts, which might be necessary for some use cases.
Con 1: The node is performing I/O, which means that we're "breaking" the Kedro notion of what the node and the dataset should do.
Con 2: It breaks reproducibility, given that the prompt is not static (in this case it was a
param
, but could as well be the output of another node)At the same time, we're already "breaking" that I/O separation in several places:
PartitionedDataset
https://docs.kedro.org/en/stable/data/partitioned_and_incremental_datasets.html#partitioned-dataset-loadAny thoughts?
Beta Was this translation helpful? Give feedback.
All reactions