chore(llmobs): consolidate internally to one experiment class#16568
chore(llmobs): consolidate internally to one experiment class#16568
Conversation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 643e608e76
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
ddtrace/llmobs/_experiment.py
Outdated
| raise RuntimeError( | ||
| "Cannot use SyncExperiment.run() from an async context. " | ||
| "Use 'await exp.run()' with LLMObs.async_experiment() instead." |
There was a problem hiding this comment.
Allow sync experiment run when an event loop is active
SyncExperiment.run() now unconditionally raises in any thread with a running event loop, which breaks previously working sync call paths in environments like Jupyter notebooks and async services that invoke shared synchronous helpers. Before this refactor, LLMObs.experiment(...).run() used the synchronous implementation and did not hard-fail in these contexts, so this introduces a user-visible regression rather than a pure internal refactor.
Useful? React with 👍 / 👎.
Codeowners resolved as |
| summary_evaluators: Optional[Sequence[SummaryEvaluatorType]] = None, | ||
| runs: Optional[int] = 1, | ||
| ) -> Experiment: | ||
| ) -> SyncExperiment: |
There was a problem hiding this comment.
is this a breaking change for existing users? i guess you're saying in the release notes its not but why :)
There was a problem hiding this comment.
it has a run method thats same as before
There was a problem hiding this comment.
i guess customers might import the experiment type? not sure
There was a problem hiding this comment.
I was operating under the assumption the Experiment class itself is internal and people are meant to only get to it through the factory method (since it's that way in the docs). We have some validation in the factory that you don't get if you go direct.
ddtrace/llmobs/_experiment.py
Outdated
| "Cannot use SyncExperiment.run() from an async context. " | ||
| "Use 'await exp.run()' with LLMObs.async_experiment() instead." | ||
| ) | ||
| return asyncio.run(self._experiment.run(jobs=jobs, raise_errors=raise_errors, sample_size=sample_size)) |
There was a problem hiding this comment.
this lets you run sync experiments parallel?
There was a problem hiding this comment.
no it is just to re-use async code for the sync case, to reduce duplication
There was a problem hiding this comment.
but i guess my question is does parallel sync execution still work? i think it did before but maybe just for evals?
There was a problem hiding this comment.
Yes, the concurrency still works as it did before. There is concurrency both for running the task on records and for running evaluations (there was before and there is still is with this change).
There was a problem hiding this comment.
in _run_task, all the _process_record() corountines run concurrently in the gather(). WIthin each coroutines, if it's a sync task/evaluator, it gets run on the default thread pool executor, which by default has a fairly large of max threads (I think it's like 5 * number of cpus). But on the async side, the asyncio.semaphore controls how many coroutines are active, so we rely on that mostly to control how much concurrency there is.
gsvigruha
left a comment
There was a problem hiding this comment.
Looks like a relatively straightforward refactoring
|
@gsvigruha @Kyle-Verhoog I just created a more thorough test script to test out these changes locally: https://github.com/DataDog/llm-obs/blob/28ab7632b4bc6c40a1b44ade3022b6ebbc54d90f/demos/experiment_async/experiment_speed_test.py It demonstrates sync and async experiments (with async task + a fast sync evaluator) and verifies there is speedup in both cases when jobs increased from 1 to 10. It also exercises the multi-run feature that was added in DnE recently, so we can see that working. Experiment run in the UI Next I'm going to check on how this changes interacts with jupyter notebook, because I think there may be a risk there (it has running event loop already).
|
|
@gsvigruha @Kyle-Verhoog Pushed one more change to resolve an issue with running sync experiment from jupyter. I used this test notebook to verify both sync and async experiments work (async test is at the bottom): https://github.com/DataDog/llm-obs/blob/b641a7facba3f69303ba4b2258b9154c3681a6ca/demos/experiment_async/async_notebook_test.ipynb |
| - | | ||
| LLM Observability: Consolidates the internal sync ``Experiment`` and ``AsyncExperiment`` classes into a single | ||
| async-native ``Experiment`` class with a thin ``SyncExperiment`` wrapper for synchronous callers. This is an | ||
| internal refactoring with no changes to the public API. |
There was a problem hiding this comment.
internal refactoring with no changes to the public API
do we need a release note then?
What does any of this mean for a customer?
There was a problem hiding this comment.
I wasn't sure if release note is always required. Won't CI fail if I leave it out?
There was a problem hiding this comment.
add changelog/no-changelog label, and it'll skip the release note requirement check


Description
Following up #16442, this is an internal refactor, no user-facing change.
Rename AsyncExperiment to Experiment (replacing the previous sync Experiment). Add thin wrapper SyncExperiment class that has an Experiment as attribute and provides a sync run(). Now even if user is running an experiment synchronously, with sync task and evaluators, internally we run the experiment using async. This way, experiment is always running async internally, which lets us have just one Experiment class and no base class, reducing duplication and simplifying the maintenance.
Doc with more background on the changes
To make it easier to review and easier for reviewers to feel confident, the existing sync tests are pretty much left as-is with minimal changes.
Left for future work
Testing
Will update here once I run test script
Risks
Potentially there could be issues for users running a sync experiment from jupyter notebook. Need to test this.
Additional Notes