Workflow Update and Signal handlers concurrency sample #123

drewhoskins-temporal · 2024-06-19T01:46:07Z

What was changed

Added a ClusterManager sample that shows off workflow.wait_condition in handlers as well as the use of a mutex to guarantee atomicity.

Why?

As part of our effort to teach users about interleaving of blocking signal and update handlers, as well as about a workflow's reentrancy model in general, we are producing samples.

Checklist

Closes
How was this tested:

poetry run pytest tests/updates_and_signals/atomic_message_handlers_test.py

Any docs updates needed?

cretz · 2024-06-20T13:22:33Z

update_and_signal_handlers/atomic_message_handlers.py

README should be updated referencing this sample

cretz · 2024-06-20T13:24:12Z

update_and_signal_handlers/atomic_message_handlers.py

We consider it bad practice to put non-workflow code with workflow code in the same file and so we don't do it in samples except for the hello ones (which we may change see #49 and #67). Users have done bad things combining code since entire workflow files run in a sandbox including all non-workflow code/imports. Can we break this out to separate files like the other non-hello samples?

Yes changing the hello samples would be great as that's what I pattern-matched off of.

👍 Agreed (though I think there is some resistance to doing so), but yeah in the meantime I think matching the other whole-directory samples will work best here.

cretz · 2024-06-20T13:24:39Z

update_and_signal_handlers/atomic_message_handlers.py

+from temporalio.client import Client, WorkflowHandle
+from temporalio.worker import Worker
+
+# This samples shows off the key concurrent programming primitives for Workflows, especially 


Suggested change

# This samples shows off the key concurrent programming primitives for Workflows, especially

# This sample shows off the key concurrent programming primitives for Workflows, especially

cretz · 2024-06-20T13:25:09Z

update_and_signal_handlers/atomic_message_handlers.py

+#   - Running start_workflow with an initializer signal that you want to run before anything else.
+#
+@activity.defn
+async def allocate_nodes_to_job(nodes: List[int], task_name: str) -> List[int]:


Return doesn't match type hint (here and elsewhere)

cretz · 2024-06-20T13:25:35Z

update_and_signal_handlers/atomic_message_handlers.py

+#   - Running start_workflow with an initializer signal that you want to run before anything else.
+#
+@activity.defn
+async def allocate_nodes_to_job(nodes: List[int], task_name: str) -> List[int]:


We usually discourage multi-param activities/workflows in favor of single dataclass instances with multiple fields

cretz · 2024-06-20T13:40:10Z

update_and_signal_handlers/atomic_message_handlers.py

+            self.nodes_lock.release()
+
+    @workflow.run
+    async def run(self):


Would encourage explicit return type hints on workflow functions

cretz · 2024-06-20T13:40:38Z

update_and_signal_handlers/atomic_message_handlers.py

+            if self.cluster_shutdown:
+                break


Could just put break after the wait_condition line inside the try

cretz · 2024-06-20T13:41:43Z

update_and_signal_handlers/atomic_message_handlers.py

+
+    async with Worker(
+        client,
+        task_queue="tq",


To prevent clashing, in samples we try to name the task queue after the sample

cretz · 2024-06-20T13:42:06Z

tests/update_and_signal_handlers/atomic_message_handlers_test.py

+async def test_atomic_message_handlers(client: Client):
+    async with Worker(
+        client,
+        task_queue="tq",


Would suggest unique task queues in tests

cretz · 2024-06-20T13:43:05Z

tests/update_and_signal_handlers/atomic_message_handlers_test.py

+            ClusterManager.run,
+            id=f"ClusterManager-{uuid.uuid4()}",
+            task_queue="tq",
+            id_reuse_policy=common.WorkflowIDReusePolicy.TERMINATE_IF_RUNNING,


Should not be necessary in tests (tests should be isolated where they shouldn't have to worry about other things that could be running)

cretz

Mostly LGTM, only minor things

cretz · 2024-06-24T13:32:33Z

updates_and_signals/atomic_message_handlers/README.md

I think this can just be at a top-level directory of atomic_message_handlers, no need to nest an extra directory deep

🤔 I wanted people to see updates and signals for discoverability, and we're planning at least one more updates sample.

We haven't usually grouped by those top-level features before but more by what the sample does. So we don't have interceptors/context_propagation and interceptors/sentry, just two top-level separate samples that use the same Temporal features. We just need to determine whether we want this type of grouping now and maybe apply it generally. I know our other samples repositories have also tried to avoid nesting most samples.

cretz · 2024-06-24T13:33:00Z

updates_and_signals/atomic_message_handlers/README.md

The primary README at the root of this repo should be updated to reference this sample

cretz · 2024-06-24T13:33:20Z

updates_and_signals/atomic_message_handlers/activities.py

+from temporalio import activity
+
+
+@dataclass(kw_only=True)


Suggested change

@dataclass(kw_only=True)

@dataclass

Probably not needed, but no big deal

I prefer named arguments in general for 2+ parameters. Cuts down on callsite bugs and makes them clearer.

You can still use named arguments. We use them in lots of places, but since we're the only users of them we don't need to set this setting to force us to use them. Also, we have a CI check for our samples in 3.8 and I don't think this came about until 3.10 (we can look into relaxing our CI version constraints though).

Ah, too bad. I'd rather people who pattern-match off of this sample be directed toward best practices. Will remove for now. I wonder if we have stats on python versions people actually use in the wild?

cretz · 2024-06-24T13:34:07Z

updates_and_signals/atomic_message_handlers/activities.py

+
+
+@activity.defn
+async def allocate_nodes_to_job(input: AllocateNodesToJobInput) -> List[str]:


This return type hint seems invalid (same with some other functions)

cretz · 2024-06-24T13:36:11Z

updates_and_signals/atomic_message_handlers/starter.py

+        id_reuse_policy=common.WorkflowIDReusePolicy.TERMINATE_IF_RUNNING,
+        start_signal="start_cluster",


While I understand this is demonstrating handlers, arguably for users there is not much value of combining these two options together. If you know you always want to do something at the start of the workflow you could call it at the start of the workflow (e.g. when there is no state). No problem with it being here though, may just be a bit confusing.

cretz · 2024-06-24T15:49:36Z

updates_and_signals/atomic_message_handlers/starter.py

+    for i in range(6):
+        allocation_updates.append(
+            wf.execute_update(
+                ClusterManagerWorkflow.allocate_n_nodes_to_job, args=[f"task-{i}", 2]


I think we want to discourage multiple arguments to things (workflows, activities, signals, queries, updates, etc)

Ah, I missed the updates, sorry.

cretz · 2024-06-24T15:50:14Z

updates_and_signals/atomic_message_handlers/workflow.py

+    ) -> List[str]:
+        await workflow.wait_condition(lambda: self.state.cluster_started)
+        if self.state.cluster_shutdown:
+            raise RuntimeError(


This (and the ValueError below) are task failures. You may want to use ApplicationError.

cretz · 2024-06-24T15:51:08Z

updates_and_signals/atomic_message_handlers/workflow.py

+    cluster_shutdown: bool = False
+    nodes: Optional[Dict[str, Optional[str]]] = None
+    max_assigned_nodes: int = 0
+    num_assigned_nodes: int = 0


Porting this to .NET and not sure there is value storing this num-node field on "state" (and it's built out of lock, so it's a bit confusing)

cretz · 2024-06-24T16:00:50Z

updates_and_signals/atomic_message_handlers/workflow.py

+            nodes_to_free = [k for k, v in self.state.nodes.items() if v == task_name]
+            # This await would be dangerous without nodes_lock because it yields control and allows interleaving.
+            await self._deallocate_nodes_for_job(nodes_to_free, task_name)
+        return "Done"


Probably don't need to return a value from this update

cretz · 2024-06-24T16:02:08Z

updates_and_signals/atomic_message_handlers/workflow.py

+    @workflow.update
+    async def delete_job(self, task_name: str) -> str:
+        await workflow.wait_condition(lambda: self.state.cluster_started)
+        assert not self.state.cluster_shutdown


This should probably match the error from allocate (see comment there, this will fail task by default, may prefer ApplicationError)

cretz · 2024-06-24T17:05:14Z

updates_and_signals/atomic_message_handlers/starter.py

+        )
+    await asyncio.gather(*deletion_updates)
+
+    await wf.signal(ClusterManagerWorkflow.shutdown_cluster)


Arguably shutdown could be an update that returns what the workflow returns instead of making it a two-step process (but this is fine too)

Cool idea, and would show off the power of update. Ran out of time this A.M, though.

dandavison · 2024-06-24T20:24:29Z

updates_and_signals/atomic_message_handlers/workflow.py

+    @workflow.update
+    async def allocate_n_nodes_to_job(
+        self, input: ClusterManagerAllocateNNodesToJobInput
+    ) -> List[str]:


I think it would be nice to have docstrings on the signals and updates, e.g. explaining what the update returns. I'm thinking that this would help users understand why it's an update and how updates are useful.

dandavison · 2024-06-24T20:25:43Z

updates_and_signals/atomic_message_handlers/workflow.py

+            self.state.nodes[node] = task_name
+
+    @workflow.update
+    async def delete_job(self, input: ClusterManagerDeleteJobInput):


This doesn't return anything, so I think readers will be wondering why it's an update rather than a signal.

Updates that don't return anything are totally reasonable (they can still raise errors for instance and just waiting on their completion means you know it completed, both of which are improvements over signals). However, I would strongly recommend a -> None type hint here.

dandavison · 2024-06-24T20:29:23Z

updates_and_signals/atomic_message_handlers/workflow.py

+            self.max_history_length
+            and workflow.info().get_current_history_length() > self.max_history_length
+        ):
+            return True


This makes it more confusing from a pedagogical point of view. Might be nice to switch to e.g. using mock.patch in the test to control CAN limit. (Non-blocking comment)

Note that it's not just a pytest affordance, it's also for the sample. (there's a --test-continue-as-new argument)

dandavison · 2024-06-24T20:34:28Z

updates_and_signals/atomic_message_handlers/workflow.py

+            return True
+        return False
+
+    # max_history_size - to more conveniently test continue-as-new, not to be used in production.


Should this comment be here?

dandavison · 2024-06-24T21:08:50Z

updates_and_signals/atomic_message_handlers/workflow.py

+                    f"Cannot allocate {input.num_nodes} nodes; have only {len(unassigned_nodes)} available"
+                )
+            assigned_nodes = unassigned_nodes[: input.num_nodes]
+            # This await would be dangerous without nodes_lock because it yields control and allows interleaving.


Would help users understand the locking even more if this comment said what it is that shouldn't be interleaved.

dandavison · 2024-06-24T21:30:33Z

updates_and_signals/atomic_message_handlers/workflow.py

+@dataclass(kw_only=True)
+class ClusterManagerAllocateNNodesToJobInput:
+    num_nodes: int
+    task_name: str


Would be nice to use "job" xor "task" in names.

dandavison · 2024-06-25T16:48:07Z

README.md

@@ -52,6 +52,7 @@ Some examples require extra dependencies. See each sample's directory for specif
  * [hello_signal](hello/hello_signal.py) - Send signals to a workflow.
 <!-- Keep this list in alphabetical order -->
 * [activity_worker](activity_worker) - Use Python activities from a workflow in another language.
+* [atomic_message_handlers](updates_and_signals/atomic_message_handlers/) - Safely handling updates and signals.


I think the name of the sample should be changed to something like safe_message_handling. It's not about atomicity -- the sample doesn't demonstrate rolling back of incomplete side effects. Rather it's about maintaining strict isolation between handler executions, via serialization of handler executions. In any case, we don't want users to think this is showing a specialized form of message handling that they can ignore; we want them to consider whether they need this for any workflow with message handlers.

Good idea. "Safe" feels much more like something I'm supposed to read.

Not the biggest fan of "safe" vs "atomic" since the latter is more discoverable/descriptive when looking at the list of samples, but I don't have a strong opinion here.

@cretz we could choose a word other than "safe", but I argued above that "atomic" isn't the right word.

I don't think "atomic" relates to rollback at all. Atomic just means one at a time or uninterruptible, as opposed to "transactional". But many can also see it as meaning "quick" or "all or none", but I don't see it that way when I see it used. I think atomic is an ok word, but again I don't have a strong opinion. Also "safe" has a lot of meanings for Temporal workflow code. Many users will be ok w/ their handlers running concurrently and will still be "safe". Maybe "serial" or something, unsure.

Yes, to be honest I'm not in love with "safe" and implying that any other usage style is not safe.

Hm, an atomic operation is one that either completes in its entirety or behaves as if it never started, and can't be seen in an intermediate state. So, if the operation has multiple stages with side effects, that would require some notion of rollback. It's usually synonymous with "transactional". I agree it's closely related to the idea of serializing executions so that they occur one at a time, since that's one way of ensuring that one execution can't see in-progress state of another, but using "atomic" would imply that message handling that does multiple writes can rollback incomplete changes. I think here we're talking about "serialized message processing" or "preventing corruption of shared state by message handlers".

There's more here than just concurrency, such as dangling handlers. Sticking with safe.
I'm think I'm going to touch on idempotency as well in my next push, though we probably should also add a more focused idempotency sample.

Yes, to be honest I'm not in love with "safe" and implying that any other usage style is not safe.

I don't think it implies that. "Robust" is an alternate word.

Update: added idempotency. I didn't use the built-in update ID, since it wasn't necessary here. Maybe that can be our separate idempotency sample.

cretz

Some minor things. Also there seems to be a test failure in CI.

cretz · 2024-06-26T12:26:23Z

README.md

@@ -52,6 +52,7 @@ Some examples require extra dependencies. See each sample's directory for specif
  * [hello_signal](hello/hello_signal.py) - Send signals to a workflow.
 <!-- Keep this list in alphabetical order -->
 * [activity_worker](activity_worker) - Use Python activities from a workflow in another language.
+* [safe_message_handlers](updates_and_signals/safe_message_handlers/) - Safely handling updates and signals.


Should keep this list in alphabetical order (see comment a couple of lines above). Also not a fan of nesting these non-hello samples beneath a directory unnecessarily (you'll note we don't do this much in other samples here or in many samples repos). If you must inconsistently nest this sample, you may want nested bullets here.

cretz · 2024-06-26T12:28:19Z

tests/updates_and_signals/safe_message_handlers_test.py

We prefer tests to be in the same directory under tests that they are in the top level. So /custom_converter/ tests are in /tests/custom_converter/ and therefore /updates_and_signals/safe_message_handlers/ tests should be under /tests/updates_and_signals/safe_message_handlers/ (granted as mentioned in comments before, I don't think we should nest sample dirs).

cretz · 2024-06-26T12:29:35Z

updates_and_signals/safe_message_handlers/README.md

+
+To run, first see [README.md](../../README.md) for prerequisites.
+
+Then, run the following from this directory to run the sample:


Sometimes people get confused that they can't just run these two commands in the same terminal because the first blocks. In our sample READMEs we usually make clear that the starter needs to be in a separate terminal.

I copy-pasted this. Looks like I got unlucky in which one I chose!

👍 Yeah we are admittedly not consistent, this is not a blocker or anything.

cretz · 2024-06-26T12:29:51Z

updates_and_signals/safe_message_handlers/activities.py

+
+
+@activity.defn
+async def allocate_nodes_to_job(input: AllocateNodesToJobInput):


Should provide type hints for every activity return, even if -> None, it helps callers

cretz · 2024-06-26T12:31:07Z

updates_and_signals/safe_message_handlers/starter.py

+    cluster_manager_handle = await client.start_workflow(
+        ClusterManagerWorkflow.run,
+        ClusterManagerInput(test_continue_as_new=should_test_continue_as_new),
+        id=f"ClusterManagerWorkflow-{uuid.uuid4()}",


In other samples we have used fixed workflow IDs, but don't technically have to here, but it makes the id_reuse_policy have no value since this is always unique

cretz · 2024-06-26T12:31:20Z

updates_and_signals/safe_message_handlers/starter.py

+        ClusterManagerWorkflow.run,
+        ClusterManagerInput(test_continue_as_new=should_test_continue_as_new),
+        id=f"ClusterManagerWorkflow-{uuid.uuid4()}",
+        task_queue="atomic-message-handlers-task-queue",


Task queue was not changed to match the sample name

cretz · 2024-06-26T12:31:43Z

updates_and_signals/safe_message_handlers/worker.py

+        activities=[allocate_nodes_to_job, deallocate_nodes_for_job, find_bad_nodes],
+    ):
+        # Wait until interrupted
+        logging.info("ClusterManagerWorkflow worker started, ctrl+c to exit")


Inconsistent use of logging vs print

dandavison · 2024-07-01T10:08:06Z