Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Workflow Update and Signal handlers concurrency sample #123

Merged

Conversation

drewhoskins-temporal
Copy link
Contributor

@drewhoskins-temporal drewhoskins-temporal commented Jun 19, 2024

What was changed

Added a ClusterManager sample that shows off workflow.wait_condition in handlers as well as the use of a mutex to guarantee atomicity.

Why?

As part of our effort to teach users about interleaving of blocking signal and update handlers, as well as about a workflow's reentrancy model in general, we are producing samples.

Checklist

  1. Closes

  2. How was this tested:

poetry run pytest tests/updates_and_signals/atomic_message_handlers_test.py

  1. Any docs updates needed?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

README should be updated referencing this sample

Copy link
Member

@cretz cretz Jun 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We consider it bad practice to put non-workflow code with workflow code in the same file and so we don't do it in samples except for the hello ones (which we may change see #49 and #67). Users have done bad things combining code since entire workflow files run in a sandbox including all non-workflow code/imports. Can we break this out to separate files like the other non-hello samples?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes changing the hello samples would be great as that's what I pattern-matched off of.

Copy link
Member

@cretz cretz Jun 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 Agreed (though I think there is some resistance to doing so), but yeah in the meantime I think matching the other whole-directory samples will work best here.

from temporalio.client import Client, WorkflowHandle
from temporalio.worker import Worker

# This samples shows off the key concurrent programming primitives for Workflows, especially
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# This samples shows off the key concurrent programming primitives for Workflows, especially
# This sample shows off the key concurrent programming primitives for Workflows, especially

# - Running start_workflow with an initializer signal that you want to run before anything else.
#
@activity.defn
async def allocate_nodes_to_job(nodes: List[int], task_name: str) -> List[int]:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Return doesn't match type hint (here and elsewhere)

# - Running start_workflow with an initializer signal that you want to run before anything else.
#
@activity.defn
async def allocate_nodes_to_job(nodes: List[int], task_name: str) -> List[int]:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We usually discourage multi-param activities/workflows in favor of single dataclass instances with multiple fields

self.nodes_lock.release()

@workflow.run
async def run(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would encourage explicit return type hints on workflow functions

Comment on lines 133 to 134
if self.cluster_shutdown:
break
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could just put break after the wait_condition line inside the try


async with Worker(
client,
task_queue="tq",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To prevent clashing, in samples we try to name the task queue after the sample

async def test_atomic_message_handlers(client: Client):
async with Worker(
client,
task_queue="tq",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would suggest unique task queues in tests

ClusterManager.run,
id=f"ClusterManager-{uuid.uuid4()}",
task_queue="tq",
id_reuse_policy=common.WorkflowIDReusePolicy.TERMINATE_IF_RUNNING,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should not be necessary in tests (tests should be isolated where they shouldn't have to worry about other things that could be running)

Copy link
Member

@cretz cretz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly LGTM, only minor things

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this can just be at a top-level directory of atomic_message_handlers, no need to nest an extra directory deep

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤔 I wanted people to see updates and signals for discoverability, and we're planning at least one more updates sample.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We haven't usually grouped by those top-level features before but more by what the sample does. So we don't have interceptors/context_propagation and interceptors/sentry, just two top-level separate samples that use the same Temporal features. We just need to determine whether we want this type of grouping now and maybe apply it generally. I know our other samples repositories have also tried to avoid nesting most samples.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The primary README at the root of this repo should be updated to reference this sample

from temporalio import activity


@dataclass(kw_only=True)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
@dataclass(kw_only=True)
@dataclass

Probably not needed, but no big deal

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer named arguments in general for 2+ parameters. Cuts down on callsite bugs and makes them clearer.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can still use named arguments. We use them in lots of places, but since we're the only users of them we don't need to set this setting to force us to use them. Also, we have a CI check for our samples in 3.8 and I don't think this came about until 3.10 (we can look into relaxing our CI version constraints though).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, too bad. I'd rather people who pattern-match off of this sample be directed toward best practices. Will remove for now. I wonder if we have stats on python versions people actually use in the wild?



@activity.defn
async def allocate_nodes_to_job(input: AllocateNodesToJobInput) -> List[str]:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This return type hint seems invalid (same with some other functions)

Comment on lines 48 to 49
id_reuse_policy=common.WorkflowIDReusePolicy.TERMINATE_IF_RUNNING,
start_signal="start_cluster",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While I understand this is demonstrating handlers, arguably for users there is not much value of combining these two options together. If you know you always want to do something at the start of the workflow you could call it at the start of the workflow (e.g. when there is no state). No problem with it being here though, may just be a bit confusing.

for i in range(6):
allocation_updates.append(
wf.execute_update(
ClusterManagerWorkflow.allocate_n_nodes_to_job, args=[f"task-{i}", 2]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we want to discourage multiple arguments to things (workflows, activities, signals, queries, updates, etc)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I missed the updates, sorry.

) -> List[str]:
await workflow.wait_condition(lambda: self.state.cluster_started)
if self.state.cluster_shutdown:
raise RuntimeError(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This (and the ValueError below) are task failures. You may want to use ApplicationError.

cluster_shutdown: bool = False
nodes: Optional[Dict[str, Optional[str]]] = None
max_assigned_nodes: int = 0
num_assigned_nodes: int = 0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Porting this to .NET and not sure there is value storing this num-node field on "state" (and it's built out of lock, so it's a bit confusing)

nodes_to_free = [k for k, v in self.state.nodes.items() if v == task_name]
# This await would be dangerous without nodes_lock because it yields control and allows interleaving.
await self._deallocate_nodes_for_job(nodes_to_free, task_name)
return "Done"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably don't need to return a value from this update

@workflow.update
async def delete_job(self, task_name: str) -> str:
await workflow.wait_condition(lambda: self.state.cluster_started)
assert not self.state.cluster_shutdown
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should probably match the error from allocate (see comment there, this will fail task by default, may prefer ApplicationError)

)
await asyncio.gather(*deletion_updates)

await wf.signal(ClusterManagerWorkflow.shutdown_cluster)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Arguably shutdown could be an update that returns what the workflow returns instead of making it a two-step process (but this is fine too)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool idea, and would show off the power of update. Ran out of time this A.M, though.

@workflow.update
async def allocate_n_nodes_to_job(
self, input: ClusterManagerAllocateNNodesToJobInput
) -> List[str]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be nice to have docstrings on the signals and updates, e.g. explaining what the update returns. I'm thinking that this would help users understand why it's an update and how updates are useful.

self.state.nodes[node] = task_name

@workflow.update
async def delete_job(self, input: ClusterManagerDeleteJobInput):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't return anything, so I think readers will be wondering why it's an update rather than a signal.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updates that don't return anything are totally reasonable (they can still raise errors for instance and just waiting on their completion means you know it completed, both of which are improvements over signals). However, I would strongly recommend a -> None type hint here.

self.max_history_length
and workflow.info().get_current_history_length() > self.max_history_length
):
return True
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes it more confusing from a pedagogical point of view. Might be nice to switch to e.g. using mock.patch in the test to control CAN limit. (Non-blocking comment)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that it's not just a pytest affordance, it's also for the sample. (there's a --test-continue-as-new argument)

return True
return False

# max_history_size - to more conveniently test continue-as-new, not to be used in production.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this comment be here?

f"Cannot allocate {input.num_nodes} nodes; have only {len(unassigned_nodes)} available"
)
assigned_nodes = unassigned_nodes[: input.num_nodes]
# This await would be dangerous without nodes_lock because it yields control and allows interleaving.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would help users understand the locking even more if this comment said what it is that shouldn't be interleaved.

@dataclass(kw_only=True)
class ClusterManagerAllocateNNodesToJobInput:
num_nodes: int
task_name: str
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be nice to use "job" xor "task" in names.

README.md Outdated
@@ -52,6 +52,7 @@ Some examples require extra dependencies. See each sample's directory for specif
* [hello_signal](hello/hello_signal.py) - Send signals to a workflow.
<!-- Keep this list in alphabetical order -->
* [activity_worker](activity_worker) - Use Python activities from a workflow in another language.
* [atomic_message_handlers](updates_and_signals/atomic_message_handlers/) - Safely handling updates and signals.
Copy link
Contributor

@dandavison dandavison Jun 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the name of the sample should be changed to something like safe_message_handling. It's not about atomicity -- the sample doesn't demonstrate rolling back of incomplete side effects. Rather it's about maintaining strict isolation between handler executions, via serialization of handler executions. In any case, we don't want users to think this is showing a specialized form of message handling that they can ignore; we want them to consider whether they need this for any workflow with message handlers.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea. "Safe" feels much more like something I'm supposed to read.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not the biggest fan of "safe" vs "atomic" since the latter is more discoverable/descriptive when looking at the list of samples, but I don't have a strong opinion here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cretz we could choose a word other than "safe", but I argued above that "atomic" isn't the right word.

Copy link
Member

@cretz cretz Jun 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think "atomic" relates to rollback at all. Atomic just means one at a time or uninterruptible, as opposed to "transactional". But many can also see it as meaning "quick" or "all or none", but I don't see it that way when I see it used. I think atomic is an ok word, but again I don't have a strong opinion. Also "safe" has a lot of meanings for Temporal workflow code. Many users will be ok w/ their handlers running concurrently and will still be "safe". Maybe "serial" or something, unsure.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, to be honest I'm not in love with "safe" and implying that any other usage style is not safe.

Hm, an atomic operation is one that either completes in its entirety or behaves as if it never started, and can't be seen in an intermediate state. So, if the operation has multiple stages with side effects, that would require some notion of rollback. It's usually synonymous with "transactional". I agree it's closely related to the idea of serializing executions so that they occur one at a time, since that's one way of ensuring that one execution can't see in-progress state of another, but using "atomic" would imply that message handling that does multiple writes can rollback incomplete changes. I think here we're talking about "serialized message processing" or "preventing corruption of shared state by message handlers".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's more here than just concurrency, such as dangling handlers. Sticking with safe.
I'm think I'm going to touch on idempotency as well in my next push, though we probably should also add a more focused idempotency sample.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, to be honest I'm not in love with "safe" and implying that any other usage style is not safe.

I don't think it implies that. "Robust" is an alternate word.

Copy link
Contributor Author

@drewhoskins-temporal drewhoskins-temporal Jun 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update: added idempotency. I didn't use the built-in update ID, since it wasn't necessary here. Maybe that can be our separate idempotency sample.

Copy link
Member

@cretz cretz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some minor things. Also there seems to be a test failure in CI.

README.md Outdated
@@ -52,6 +52,7 @@ Some examples require extra dependencies. See each sample's directory for specif
* [hello_signal](hello/hello_signal.py) - Send signals to a workflow.
<!-- Keep this list in alphabetical order -->
* [activity_worker](activity_worker) - Use Python activities from a workflow in another language.
* [safe_message_handlers](updates_and_signals/safe_message_handlers/) - Safely handling updates and signals.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should keep this list in alphabetical order (see comment a couple of lines above). Also not a fan of nesting these non-hello samples beneath a directory unnecessarily (you'll note we don't do this much in other samples here or in many samples repos). If you must inconsistently nest this sample, you may want nested bullets here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We prefer tests to be in the same directory under tests that they are in the top level. So /custom_converter/ tests are in /tests/custom_converter/ and therefore /updates_and_signals/safe_message_handlers/ tests should be under /tests/updates_and_signals/safe_message_handlers/ (granted as mentioned in comments before, I don't think we should nest sample dirs).


To run, first see [README.md](../../README.md) for prerequisites.

Then, run the following from this directory to run the sample:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sometimes people get confused that they can't just run these two commands in the same terminal because the first blocks. In our sample READMEs we usually make clear that the starter needs to be in a separate terminal.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I copy-pasted this. Looks like I got unlucky in which one I chose!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 Yeah we are admittedly not consistent, this is not a blocker or anything.



@activity.defn
async def allocate_nodes_to_job(input: AllocateNodesToJobInput):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should provide type hints for every activity return, even if -> None, it helps callers

cluster_manager_handle = await client.start_workflow(
ClusterManagerWorkflow.run,
ClusterManagerInput(test_continue_as_new=should_test_continue_as_new),
id=f"ClusterManagerWorkflow-{uuid.uuid4()}",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In other samples we have used fixed workflow IDs, but don't technically have to here, but it makes the id_reuse_policy have no value since this is always unique

ClusterManagerWorkflow.run,
ClusterManagerInput(test_continue_as_new=should_test_continue_as_new),
id=f"ClusterManagerWorkflow-{uuid.uuid4()}",
task_queue="atomic-message-handlers-task-queue",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Task queue was not changed to match the sample name

activities=[allocate_nodes_to_job, deallocate_nodes_for_job, find_bad_nodes],
):
# Wait until interrupted
logging.info("ClusterManagerWorkflow worker started, ctrl+c to exit")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inconsistent use of logging vs print

from temporalio.client import Client, WorkflowHandle
from temporalio.common import RetryPolicy
from temporalio.exceptions import ApplicationError
from temporalio.worker import Worker
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Several unused imports above. (We're switching to ruff for Python repos -- using the ruff VSCode extension will make sense and highlights these.)

self.sleep_interval_seconds: int = 600

@workflow.signal
async def start_cluster(self) -> None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't need to be `async, and I think there's a pedagogical argument for making it not async.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, let me take that back. Somewhere out in the real world, there's a compute cluster that this workflow represents. For this example to be realistic, start_cluster would need to use an activity to make a network call in order to start that cluster (thus ensuring that workflow state is in sync with real cluster state).

return self.get_assigned_nodes(job_name=input.job_name)

async def _allocate_nodes_to_job(
self, assigned_nodes: List[str], job_name: str
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can choose either "allocate" or "assign" as the term used throughout this sample. "assign to" probably better than "allocate to". Argument here could be named nodes_to_assign.

cluster_started: bool = False
cluster_shutdown: bool = False
nodes: Dict[str, Optional[str]] = dataclasses.field(default_factory=dict)
jobs_added: Set[str] = dataclasses.field(default_factory=set)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

jobs_added isn't a great name :) Blindly naming according to semantics/function would yield something like jobs_with_nodes_assigned_already -- so, some more streamlined name that captures those semantics reasonably well?

I think there also needs to be a note somewhere explaining that our idempotency rules are that you cannot assign nodes twice to the same job; instead that will return a response indicating the already-assigned nodes.

"Cannot allocate nodes to a job: Cluster is already shut down"
)

async with self.nodes_lock:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's demonstrate timeout here (use asyncio.wait_for)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Going to punt on this for now to try to get this PR closed down.

# will cause the workflow to keep retrying and get it stuck.
raise ApplicationError(
f"Cannot allocate {input.num_nodes} nodes; have only {len(unassigned_nodes)} available"
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's create a validator for this update. Minimally, it should reject if input.num_nodes is negative. If it also rejects if input.num_nodes == 0, then I believe we could get rid of the jobsWithNodesAssignedAlready data structure and replace the logic with a dynamic computation: input.job_name in nodes.values() (i.e. is the requested job in the set of job names that have at least one assigned node).

Regarding the dynamic/run-time dependent logic len(unassigned_nodes) >= input.num_nodes check, I think it should not go in the validator, since (a) that ensures that we do the work if-and-only-if it's possible at run-time, and (b) probably there's an argument that statically valid requests that happened not to have available resources at run-time should leave a record in history.

If any/all of this makes sense, perhaps it's worth adding the reasoning to comments in the the code to really get people understanding the nuances.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels like valid (ha!) feedback, but for a separate PR.

# before sending work to those nodes.
# Returns the list of node names that were allocated to the job.
@workflow.update
async def allocate_n_nodes_to_job(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you definitely want the name to be assign_n_nodes rather than just assign_nodes? I vote the latter.


async with self.nodes_lock:
# Idempotency guard.
if input.job_name in self.state.jobs_added:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See below, this could be if input.job_name in self.state.nodes.values() rather than maintaining a separate materialized data structure.

)
nodes_to_assign = unassigned_nodes[: input.num_nodes]
# This await would be dangerous without nodes_lock because it yields control and allows interleaving
# with delete_job and perform_health_checks, which both touch self.state.nodes.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The checks above would also be dangerous without the node lock held, so let's either move this comment up to where we acquire the lock, or add additional comments up there. Probably the former:

We need to acquire the lock here in order to perform some checks that depend on the contents of self.state.nodes and then to...

(I can have a stab at drafting that comment; it's a little bit involved to correctly document all the reasons why the lock must be held (atomicity of checks with mutation and prevention of interleaving with other coroutines are related but distinct), but it might be pedagogically valuable.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the part that makes the lock necessary as there are no blocking calls above.

raise ApplicationError("Cannot delete a job: Cluster is already shut down")

async with self.nodes_lock:
nodes_to_free = [
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"free" is a third synonym (we already have unassign, deallocate). As I mentioned above, my suggestion is to pick a single verb and use it everywhere. (So suggest nodes_to_unassign here.)

find_bad_nodes,
FindBadNodesInput(nodes_to_check=assigned_nodes),
start_to_close_timeout=timedelta(seconds=10),
# This health check is optional, and our lock would block the whole workflow if we let it retry forever.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seeing as this is going in front of the public we probably want to gesture to what good practices would be here. Currently, if the health check fails, this throws an unhandled exception in the main wf loop, right?

self.sleep_interval_seconds: int = 600

@workflow.signal
async def start_cluster(self) -> None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, let me take that back. Somewhere out in the real world, there's a compute cluster that this workflow represents. For this example to be realistic, start_cluster would need to use an activity to make a network call in order to start that cluster (thus ensuring that workflow state is in sync with real cluster state).

Copy link
Contributor

@dandavison dandavison left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great. Feel free to leave my comments for follow-on PRs (and flag those you disagree with on any timescale).

@cretz
Copy link
Member

cretz commented Jul 2, 2024

CI needs to pass then we can merge

dandavison added a commit to temporalio/samples-typescript that referenced this pull request Jul 2, 2024
dandavison added a commit to temporalio/samples-typescript that referenced this pull request Jul 3, 2024
* list -> set to fix a test
* Return a struct rather than a raw value from the list for better hygiene
* Remove test dependency on race conditions between health check and adding nodes.
@CLAassistant
Copy link

CLAassistant commented Jul 8, 2024

CLA assistant check
All committers have signed the CLA.

@CLAassistant
Copy link

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you all sign our Contributor License Agreement before we can accept your contribution.
1 out of 2 committers have signed the CLA.

✅ drewhoskins
❌ drewhoskins-temporal
You have signed the CLA already but the status is still pending? Let us recheck it.

@drewhoskins-temporal
Copy link
Contributor Author

This is blocked on the Java Test Service being fixed for updates (or switching those tests away from using that time-skipping service)

@drewhoskins-temporal drewhoskins-temporal merged commit 35b476d into temporalio:main Jul 24, 2024
9 checks passed
dandavison added a commit to dandavison/temporalio-samples-typescript that referenced this pull request Aug 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants