You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
If this issue is time-sensitive, I have submitted a corresponding issue with GCP support.
Bug Description
We had a report that a DataflowFlexTemplateJob can become detached from the underlying job, when a (streaming) job update takes more than 20 minutes.
Proposed approaches:
Increase the timeout (sort of a hack, but works around the problem)
Fix the terraform-based reconciler for this scenario
Replace the terraform-based reconciler with a direct reconciler that does not have this issue
Additional Diagnostic Information
This makes logical sense, the update does create a new job ID, and based on the code it is likely that we don't correctly record the job ID when there is an error/timeout from the update.
Kubernetes Cluster Version
n/a
Config Connector Version
n/a
Config Connector Mode
namespaced mode (default)
Log Output
No response
Steps to reproduce the issue
n/a
YAML snippets
No response
The text was updated successfully, but these errors were encountered:
I have what I believe is the same issue with Dataflow Jobs being constantly recreated around every twenty minutes. The jobs that are being replaced are running fine (they're not being updated or queued or anything).
It would be very helpful if there was a way to actually view the diffs being generated during reconciliation so we could troubleshoot. I didn't see these being logged anywhere.
I noted the caveats about not using Server-side Apply for k8s. I'm using Helm to deploy these CRs and Helm doesn't support ssa. However, I couldn't see any fields actually being set on the DataflowFlexTemplateJob CR so I don't believe that is the issue.
Checklist
Bug Description
We had a report that a DataflowFlexTemplateJob can become detached from the underlying job, when a (streaming) job update takes more than 20 minutes.
Proposed approaches:
Additional Diagnostic Information
This makes logical sense, the update does create a new job ID, and based on the code it is likely that we don't correctly record the job ID when there is an error/timeout from the update.
Kubernetes Cluster Version
n/a
Config Connector Version
n/a
Config Connector Mode
namespaced mode (default)
Log Output
No response
Steps to reproduce the issue
n/a
YAML snippets
No response
The text was updated successfully, but these errors were encountered: