Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
improve async error reporting (#4550)
Improve async error reporting of executions from compute nodes back to orchestrators and job store, such as errors related to docker executor, s3 publisher and input source. The PR does the following: 1. Enriches S3 errors with AWS error code and more metadata 2. Use the new bacerrors.Error for docker returned errors 3. Add new `ErrorCode` to `models.Event` details, and populate that value with bacerrors `{Component}:{ErrorCode}`, such as `S3Publisher:NoSuchBucket` and `Docker:ImageNotFound` 4. Introduced new `Details` field to executions compute state, which will hold additional metadata about the latest state of the execution, mainly the `ErrorCode` 5. Publish ErrorCode to otel analytics ### Examples: #### Bad docker image ``` → bacalhau docker run non_existent_image Job successfully submitted. Job ID: j-29a81940-18a2-44b7-b0da-807d45946f45 Checking job status... (Enter Ctrl+C to exit at any time, your job will continue running): TIME EXEC. ID TOPIC EVENT 22:37:32.323 Submission Job submitted 22:37:32.340 e-640f0876 Scheduling Requested execution on n-7c5b7d69 * NodeID: n-7c5b7d69-c42d-493e-ade0-7d6feeedc507 22:37:34.569 e-640f0876 Exec Scanning Error: image not available: "non_existent_image" Hint: To resolve this, either: 1. Check if the image exists in the registry and the name is correct 2. If the image is private, supply the node with valid Docker login credentials using the DOCKER_USERNAME and DOCKER_PASSWORD environment variables * ErrorCode: Docker:ImageNotFound * Image: non_existent_image 22:37:34.585 e-a3a3afe2 Scheduling Requested execution on n-7c5b7d69 * NodeID: n-7c5b7d69-c42d-493e-ade0-7d6feeedc507 22:37:36.732 e-a3a3afe2 Exec Scanning Error: image not available: "non_existent_image" Hint: To resolve this, either: 1. Check if the image exists in the registry and the name is correct 2. If the image is private, supply the node with valid Docker login credentials using the DOCKER_USERNAME and DOCKER_PASSWORD environment variables * ErrorCode: Docker:ImageNotFound * Image: non_existent_image Error: job failed To get more details about the run, execute: bacalhau job describe j-29a81940-18a2-44b7-b0da-807d45946f45 To get more details about the run executions, execute: bacalhau job executions j-29a81940-18a2-44b7-b0da-807d45946f45 bacalhau job executions j-29a81940-18a2-44b7-b0da-807d45946f45 --output yaml - AllocatedResources: Tasks: {} ComputeState: Message: 'image not available: "non_existent_image"' StateType: 8 CreateTime: 1727642252340926000 DesiredState: Message: execution failed StateType: 2 EvalID: ecad787d-e72a-4987-b353-cd6552d546bf FollowupEvalID: "" ID: e-640f0876-119b-40ac-883e-b2126b5a40f3 JobID: j-29a81940-18a2-44b7-b0da-807d45946f45 ModifyTime: 1727642254570170000 Name: "" Namespace: default NextExecution: "" NodeID: n-7c5b7d69-c42d-493e-ade0-7d6feeedc507 PreviousExecution: "" PublishedResult: Type: "" Revision: 3 RunOutput: null - AllocatedResources: Tasks: {} ComputeState: Message: 'image not available: "non_existent_image"' StateType: 8 CreateTime: 1727642254585495000 DesiredState: Message: execution failed StateType: 2 EvalID: ef3bae6f-54fb-4f48-9b83-98364049e685 FollowupEvalID: "" ID: e-a3a3afe2-5d12-498f-ad19-86ea00425d30 JobID: j-29a81940-18a2-44b7-b0da-807d45946f45 ModifyTime: 1727642256732971000 Name: "" Namespace: default NextExecution: "" NodeID: n-7c5b7d69-c42d-493e-ade0-7d6feeedc507 PreviousExecution: "" PublishedResult: Type: "" Revision: 3 RunOutput: null ``` #### Bad S3 bucket ``` → bacalhau job run docker-s3.yaml Job successfully submitted. Job ID: j-036bc69b-7b81-489b-a714-d1349d6e6f5b Checking job status... (Enter Ctrl+C to exit at any time, your job will continue running): TIME EXEC. ID TOPIC EVENT 22:36:57.853 Submission Job submitted 22:36:57.868 e-ad0ab10c Scheduling Requested execution on n-7c5b7d69 * NodeID: n-7c5b7d69-c42d-493e-ade0-7d6feeedc507 22:36:57.929 e-ad0ab10c Execution Running 22:37:03.414 e-ad0ab10c Publishing Error: failed to publish s3 result: operation error S3: PutObject, https response error StatusCode: Results 404, RequestID: 62FSTZ2400AA0782, api error NoSuchBucket: The specified bucket does not exist * AWSRequestID: 62FSTZ2400AA0782 * ErrorCode: S3Publisher:NoSuchBucket * Operation: PutObject * Service: S3 22:37:03.432 e-995b726b Scheduling Requested execution on n-7c5b7d69 * NodeID: n-7c5b7d69-c42d-493e-ade0-7d6feeedc507 22:37:03.482 e-995b726b Execution Running 22:37:07.085 e-995b726b Publishing Error: failed to publish s3 result: operation error S3: PutObject, https response error StatusCode: Results 404, RequestID: YNJQY666GB15CT3K, api error NoSuchBucket: The specified bucket does not exist * Operation: PutObject * Service: S3 * AWSRequestID: YNJQY666GB15CT3K * ErrorCode: S3Publisher:NoSuchBucket Error: job failed To get more details about the run, execute: bacalhau job describe j-036bc69b-7b81-489b-a714-d1349d6e6f5b To get more details about the run executions, execute: bacalhau job executions j-036bc69b-7b81-489b-a714-d1349d6e6f5b To download the results, execute: bacalhau job get j-036bc69b-7b81-489b-a714-d1349d6e6f5b ```
- Loading branch information