DataFlowFlexTemplateJob: long updates lose job ID #2571

justinsb · 2024-08-27T15:21:47Z

Checklist

I did not find a related open issue.
I did not find a solution in the troubleshooting guide: (https://cloud.google.com/config-connector/docs/troubleshooting)
If this issue is time-sensitive, I have submitted a corresponding issue with GCP support.

Bug Description

We had a report that a DataflowFlexTemplateJob can become detached from the underlying job, when a (streaming) job update takes more than 20 minutes.

Proposed approaches:

Increase the timeout (sort of a hack, but works around the problem)
Fix the terraform-based reconciler for this scenario
Replace the terraform-based reconciler with a direct reconciler that does not have this issue

Additional Diagnostic Information

This makes logical sense, the update does create a new job ID, and based on the code it is likely that we don't correctly record the job ID when there is an error/timeout from the update.

Kubernetes Cluster Version

n/a

Config Connector Version

n/a

Config Connector Mode

namespaced mode (default)

Log Output

No response

Steps to reproduce the issue

n/a

YAML snippets

No response

jonapgar-groupby · 2024-10-08T13:58:07Z

I have what I believe is the same issue with Dataflow Jobs being constantly recreated around every twenty minutes. The jobs that are being replaced are running fine (they're not being updated or queued or anything).

I'm using the latest version of the controller manager: gcr.io/gke-release/cnrm/controller:826b049

It would be very helpful if there was a way to actually view the diffs being generated during reconciliation so we could troubleshoot. I didn't see these being logged anywhere.

I noted the caveats about not using Server-side Apply for k8s. I'm using Helm to deploy these CRs and Helm doesn't support ssa. However, I couldn't see any fields actually being set on the DataflowFlexTemplateJob CR so I don't believe that is the issue.

justinsb added the bug Something isn't working label Aug 27, 2024

justinsb mentioned this issue Aug 27, 2024

feat: generate CRD & mapper for DataFlowFlexTemplateJob #2557

Merged

justinsb added this to the 1.122 milestone Sep 6, 2024

justinsb self-assigned this Sep 7, 2024

yuwenma modified the milestones: 1.122, 1.126 Oct 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DataFlowFlexTemplateJob: long updates lose job ID #2571

DataFlowFlexTemplateJob: long updates lose job ID #2571

justinsb commented Aug 27, 2024

jonapgar-groupby commented Oct 8, 2024 •

edited

Loading

DataFlowFlexTemplateJob: long updates lose job ID #2571

DataFlowFlexTemplateJob: long updates lose job ID #2571

Comments

justinsb commented Aug 27, 2024

Checklist

Bug Description

Additional Diagnostic Information

Kubernetes Cluster Version

Config Connector Version

Config Connector Mode

Log Output

Steps to reproduce the issue

YAML snippets

jonapgar-groupby commented Oct 8, 2024 • edited Loading

jonapgar-groupby commented Oct 8, 2024 •

edited

Loading