Refactor Flink status to modular architecture #4165

nleigh · 2025-11-20T15:13:31Z

Plan to add new features(#4162) but found the current flink paasta status code made it difficult to add/review new changes, refactored existing to be more modular.

Code Refactoring:

✅ Extracted 10 modular, reusable functions to paasta_tools/flink_tools.py
✅ Reduced main function from ~290 lines to ~90 lines
✅ Separated data collection from presentation logic
New Functions (Data Collection):
get_flink_instance_details() - Collects metadata, version, pool, team,
runbook
collect_flink_job_details() - Collects job, pod, and resource information

New Functions (Formatting):

format_flink_instance_header() - Formats config SHA, version, URL
format_flink_instance_metadata() - Formats repo links, pool, owner, runbook
format_flink_config_links() - Formats yelpsoa/srv-configs links
format_flink_log_commands() - Formats paasta logs commands
format_flink_monitoring_links() - Formats Grafana/cost links
format_flink_state_and_pods() - Formats state, pods, jobs summary
format_flink_jobs_table() - Formats jobs table with proper columns
get_flink_job_name() - Helper to extract job name

Type Safety Improvements:

Added 4 TypedDicts for structured return types:
- PodCounts - Pod count statistics
- JobCounts - Job count statistics
- FlinkJobDetailsDict - Collected job details
- FlinkInstanceDetails - Instance metadata

Same output as existing (Tested 9th Jan after recent commits)

(py310-linux) nathanleigh@dev55-uswest1adevc:~/source/paasta (flink-status-refactor-modular) $ date
Fri Jan  9 08:16:50 AM PST 2026
(py310-linux) nathanleigh@dev55-uswest1adevc:~/source/paasta (flink-status-refactor-modular) $ paasta status -s sqlclient -i happyhour -c pnw-devc -v


sqlclient.happyhour in pnw-devc (EKS)
    Version:    3d559a95 (desired)
    Config SHA: confige958e818
    Flink version: 1.17.2 c0027e5 @ 2023-11-09T13:24:38+01:00
    URL: http://flink.eks.pnw-devc.paasta:31080/sqlclient-845c8c7b6c/
    Repo(git): https://github.yelpcorp.com/services/sqlclient
    Repo(sourcegraph): https://sourcegraph.yelpcorp.com/services/sqlclient
    Flink Pool: flink-spot
    Owner: streaming-infrastructure
    Flink Runbook: y/rb-sqlclient-happyhour
    Yelpsoa configs: https://github.yelpcorp.com/sysgit/yelpsoa-configs/tree/master/sqlclient
    Srv configs: https://github.yelpcorp.com/sysgit/srv-configs/tree/master/ecosystem/devc/sqlclient
==================================================================
    Flink Log Commands:
      Service:     paasta logs -a 1h -c pnw-devc -s sqlclient -i happyhour
      Taskmanager: paasta logs -a 1h -c pnw-devc -s sqlclient -i happyhour.TASKMANAGER
      Jobmanager:  paasta logs -a 1h -c pnw-devc -s sqlclient -i happyhour.JOBMANAGER
      Supervisor:  paasta logs -a 1h -c pnw-devc -s sqlclient -i happyhour.SUPERVISOR
==================================================================
    Flink Monitoring:
      Job Metrics: https://grafana.yelpcorp.com/d/flink-metrics/flink-job-metrics?orgId=1&var-datasource=Prometheus-flink&var-region=uswest2-devc&var-service=sqlclient&var-instance=happyhour&var-job=All&from=now-24h&to=now
      Container Metrics: https://grafana.yelpcorp.com/d/flink-container-metrics/flink-container-metrics?orgId=1&var-datasource=Prometheus-flink&var-region=uswest2-devc&var-service=sqlclient&var-instance=happyhour&from=now-24h&to=now
      JVM Metrics: https://grafana.yelpcorp.com/d/flink-jvm-metrics/flink-jvm-metrics?orgId=1&var-datasource=Prometheus-flink&var-region=uswest2-devc&var-service=sqlclient&var-instance=happyhour&from=now-24h&to=now
      Flink Cost: https://app.cloudzero.com/explorer?activeCostType=invoiced_amortized_cost&partitions=costcontext%3AResource%20Summary&dateRange=Last%2030%20Days&costcontext%3AKube%20Paasta%20Cluster=pnw-devc&costcontext%3APaasta%20Instance=happyhour&costcontext%3APaasta%20Service=sqlclient&showRightFlyout=filters
==================================================================
    State: Running
    Pods: 3 running, 0 evicted, 0 other, 3 total
    Jobs: 0 running, 0 finished, 35 failed, 0 cancelled, 35 total
    1 taskmanagers, 1/1 slots available
    Jobs:
      Job Name  State       Started
      happyhour Failed 2026-01-09 08:12:19 (4 minutes ago)
    Pods:
      Pod Name                                           Host                                        Phase    Uptime
      sqlclient-845c8c7b6c-jobmanager-6847cc4ccd-w82f7   ip-10-81-18-177.us-west-2.compute.internal  Running  0d18h26m26s
      sqlclient-845c8c7b6c-supervisor-srzt6              ip-10-81-22-73.us-west-2.compute.internal   Running  0d2h25m5s
      sqlclient-845c8c7b6c-taskmanager-5d847fc8cf-s48wt  ip-10-81-16-42.us-west-2.compute.internal   Running  0d15h21m36s

Extracted data collection and formatting logic from the monolithic _print_flink_status_from_job_manager() function into reusable, testable helper functions in flink_tools.py. Changes: - Added collect_flink_job_details() to gather pod/job/resource info - Added format_flink_state_and_pods() to format state and counts - Added format_flink_jobs_table() to format jobs table - Added get_flink_job_name() helper for job name extraction - Refactored main function from ~290 lines to ~90 lines - Removed unused imports in status.py (shutil, groupby, get_runbook, FlinkJobs) - Maintained identical output and behavior This refactor: - Improves code maintainability and testability - Separates data collection from presentation logic - Makes functions reusable for future features - Reduces cognitive complexity of main function Related to FLINK-5725 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>

Added 10 test classes with 30+ test methods covering all newly extracted functions from the Flink status refactor: **Data Collection Tests:** - TestCollectFlinkJobDetails (3 tests) * Handles missing overview (stopped clusters) * Counts evicted pods correctly * Collects complete job/resource information **Formatting Tests:** - TestFormatFlinkStateAndPods (3 tests) - TestGetFlinkJobName (3 tests) - TestFormatFlinkJobsTable (5 tests) - TestGetFlinkInstanceDetails (2 tests) - TestFormatFlinkInstanceHeader (3 tests) - TestFormatFlinkInstanceMetadata (1 test) - TestFormatFlinkConfigLinks (1 test) - TestFormatFlinkLogCommands (1 test) - TestFormatFlinkMonitoringLinks (1 test) Test Coverage: - All new functions (collect_flink_job_details, format_flink_state_and_pods, get_flink_job_name, format_flink_jobs_table) - All existing formatting functions from first refactor (get_flink_instance_details, format_flink_instance_header, etc.) - Edge cases: stopped clusters, evicted pods, verbose modes, terminal sizing - All tests pass successfully via tox - All mocks use autospec=True per repo guidelines Benefits: - Prevents regressions in output formatting - Documents expected behavior - Makes future changes safer - Easy to test individual functions in isolation Related to FLINK-5725 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>

Improvements after self-review: 1. **Fix potential KeyError:** - Changed pod["reason"] to pod.get("reason") to safely handle Failed pods without a reason field - Added test case for this edge case 2. **Add TypedDicts for better type safety:** - PodCounts: Structured type for pod count statistics - JobCounts: Structured type for job count statistics - FlinkJobDetailsDict: Return type for collect_flink_job_details() - FlinkInstanceDetails: Return type for get_flink_instance_details() - Updated function signatures to use these types Benefits: - Better IDE autocomplete and type checking - Prevents runtime errors from missing dictionary keys - Makes function contracts more explicit - Improves code documentation through types All tests pass, all pre-commit checks pass. Related to FLINK-5725 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>

Fixed type checking issues identified by mypy: 1. **TypedDict compatibility:** - Added explicit type annotations to pod_counts and job_counts - Changed pod_counts to: PodCounts = {...} - Changed job_counts to: Optional[JobCounts] = None 2. **Verbose parameter type:** - Changed format_flink_instance_header() parameter from bool to int - Matches the actual usage where verbose can be 0, 1, 2, etc. 3. **Merge conflict resolution:** - Updated format_flink_monitoring_links() to use CloudZero link - Replaced Splunk link with CloudZero (from master PR #4152) All checks pass: - ✅ mypy: Success: no issues found - ✅ All unit tests pass - ✅ All pre-commit checks pass 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>

Updated test expectations to match the new modular output format: 1. **Fixed test helper function ordering:** - _get_flink_base_status_verbose_1() now outputs in correct order: Config SHA → Version → URL → Repos → Pool/Owner/Runbook - Matches format_flink_instance_header + format_flink_instance_metadata 2. **Added URL to stopping state tests:** - test_output_stopping_jobmanager - test_output_stopping_taskmanagers - URL is shown even for non-running states (from annotations) 3. **Fixed color code assertion:** - test_format_stopped_state_with_evictions - Check for "evicted" instead of "2 evicted" (handles ANSI colors) 4. **Fixed mock patch locations:** - Changed get_team/get_runbook patches from flink_tools to monitoring_tools - These functions are imported inside get_flink_instance_details 5. **Updated CloudZero link test:** - Changed from splunk.yelpcorp.com to app.cloudzero.com All tests now pass (11 status tests + 36 flink_tools tests) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>

nemacysts

ty - i've been meaning to try to shrink status.py a bit and move the operator-specific logic out as much as possible :)

paasta_tools/flink_tools.py

paasta_tools/cli/cmds/status.py

nemacysts · 2025-11-20T19:47:33Z

paasta_tools/flink_tools.py

+    state: str,
+    pod_counts: PodCounts,
+    job_counts: Optional[JobCounts],
+    taskmanagers: Optional[int],
+    slots_available: Optional[int],
+    slots_total: Optional[int],


hmm, job_details is already a TypedDict - would we lose anything if we just passed job_details as a whole rather than each attribute individually?

tests/test_flink_tools.py

nemacysts · 2025-11-20T19:49:09Z

tests/test_flink_tools.py

+        assert result["state"] == "stopped"
+        assert result["pod_counts"]["running"] == 2
+        assert result["pod_counts"]["evicted"] == 0
+        assert result["pod_counts"]["other"] == 0
+        assert result["pod_counts"]["total"] == 2
+        assert result["job_counts"] is None
+        assert result["taskmanagers"] is None
+        assert result["slots_available"] is None
+        assert result["slots_total"] is None
+        assert result["jobs"] == []


should we construct an expected_result dict and compare against that rather than manually comparing each key?

(same for other tests below)

Thanks, updated in 0ca5455

nemacysts · 2025-11-20T19:51:24Z

tests/test_flink_tools.py

+        overview = mock.Mock(autospec=True)
+        overview.jobs_running = 1
+        overview.jobs_finished = 2
+        overview.jobs_failed = 0
+        overview.jobs_cancelled = 1
+        overview.taskmanagers = 3
+        overview.slots_available = 5
+        overview.slots_total = 15


overview is a typeddict, no? do we need a mock here? can't we just construct the expected dict?

(same for other tests below)

nemacysts · 2025-11-20T19:57:02Z

tests/test_flink_tools.py

+        overview.slots_available = 10
+        overview.slots_total = 25
+
+        mock_jobs = [mock.Mock(autospec=True), mock.Mock(autospec=True)]


I don't think we want autospec=True here - i think we want spec=$CLASS_OR_OBJECT or to use https://docs.python.org/3/library/unittest.mock.html#unittest.mock.create_autospec

Replace Any type annotations with concrete types: - flink_config: Optional[Any] -> Optional[FlinkConfig] - flink_instance_config: Any -> FlinkDeploymentConfig This addresses review comment from PR #4165. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>

Move get_runbook and get_team imports from inside get_flink_instance_details function to the top-level imports section, following Python best practices. Update test mocks to patch the functions where they are used (flink_tools module) rather than where they are defined (monitoring_tools module), which is required when imports are at module level. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>

Remove code that strips 'config' prefix from config_sha to make it clearer that this is not a git SHA. Now displays 'config123456' instead of '123456'. This addresses non-blocking review feedback from PR #4165. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>

Restore the validation check that ensures config_sha is present in the Flink metadata labels. This prevents silently ignoring missing config_sha which could indicate a serious configuration issue. Previously the validation was in status.py but was removed during the refactoring. Now it's properly placed in get_flink_instance_details() where it validates the input early. This addresses review feedback from PR #4165. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>

Consolidate two separate 'if status["state"] == "running"' blocks into a single block that fetches flink_config, overview, and jobs together. This improves code clarity by grouping all running-state API calls in one place instead of having them split across the function. Also update test expectations to reflect the 'config' prefix being kept in config_sha display (from previous commit). This addresses review feedback from PR #4165. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>

Address Feedback

Address Feedback class FlinkJobDetailsDict(TypedDict, total=False): "just curious: why is this non-total? seems like we only have one function that returns this"

Address Feedback 1. format_flink_instance_header - Changed parameter type from Mapping[str, Any] to FlinkInstanceDetails for better type safety. 2. format_flink_instance_metadata - Changed parameter type from Mapping[str, Any] to FlinkInstanceDetails for better type safety. 3. format_flink_state_and_pods - Simplified to take FlinkJobDetailsDict as a single parameter instead of 6 individual parameters. The function now extracts the values it needs internally. 4. Call site in status.py - Simplified from passing 6 parameters to just passing job_details. 5. Tests for TestFormatFlinkStateAndPods - Updated to construct FlinkJobDetailsDict dicts instead of passing individual parameters.

Address feedback

Co-authored-by: Copilot <[email protected]>

- Convert verbose int to bool for format_flink_instance_header call - Fix extra space in taskmanagers output line 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>

Restores the original comment explaining why we validate config_sha early and raise ValueError if missing. This addresses PR review feedback asking if the early exit behavior was preserved.

The function names format_flink_jobs_table and append_pod_status are self-explanatory; the comments don't add value.

The parenthetical list of fields would get out of date as metadata is added or changed.

Same reasoning as the previous commit - avoids staleness.

Compare against expected dict instead of asserting each key individually, making tests more readable and maintainable.

Test that get_flink_instance_details raises ValueError when config_sha label is missing from metadata, ensuring proper error handling for corrupted/invalid Flink cluster states.

Test that collect_flink_job_details handles status dicts without a pod_status key, returning zero counts instead of crashing during cluster transitions.

Split the single try/except block into separate blocks for each API call (config, overview, jobs). This provides more specific error messages so users can identify which API call failed.

Extract phase via pod.get("phase") instead of direct key access to prevent KeyError if a malformed pod entry is missing the phase field.

The function is now only defined in flink_tools.py. Updated test import to use the flink_tools version.

Return early with error message if flink metadata is missing, preventing AttributeError when passing None to get_flink_instance_details.

nleigh · 2026-01-09T16:30:26Z

@nemacysts Hey, thanks for all the review feedback 🙌
I have addressed your comments
This PR is still rather large so I don't mind splitting it up into smaller PR's if you prefer?

Example Split

  | PR  | Isolated?     | ~Size      | Review Focus          |
  |-----|---------------|------------|-----------------------|
  | 1   | ✅ Yes        | ~50 lines  | Type definitions only |
  | 2   | ❌ Sequential | ~200 lines | Metadata extraction   |
  | 3   | ❌ Sequential | ~150 lines | Links formatting      |
  | 4   | ❌ Sequential | ~300 lines | Job/pod status        |
  | 5   | ✅ Yes        | ~100 lines | Error handling        |

nemacysts · 2026-01-09T17:17:28Z

@nleigh splitting this up would be nice - reviewing this much generated code is slightly painful (which is why i keep putting it off :p)

nleigh requested a review from a team as a code owner November 20, 2025 15:13

nleigh marked this pull request as draft November 20, 2025 15:14

nleigh changed the title ~~Refactor Flink status to modular architecture with comprehensive tests~~ Refactor Flink status to modular architecture Nov 20, 2025

nleigh and others added 4 commits November 20, 2025 07:18

WIP: Add modular helper functions for Flink status (part 1 of refactor)

0e59f4f

nleigh force-pushed the flink-status-refactor-modular branch from 6cd9a1d to e38a0d1 Compare November 20, 2025 15:22

nleigh and others added 2 commits November 20, 2025 07:37

nleigh marked this pull request as ready for review November 20, 2025 17:24

nemacysts reviewed Nov 20, 2025

View reviewed changes

nleigh requested a review from Copilot November 21, 2025 12:26

This comment was marked as outdated.

Sign in to view

nleigh and others added 12 commits November 21, 2025 10:05

Refactor get config

e4f1e86

Update/remove comments

8720b91

FLINK: Remove/Address redundant comments

7eb1dfc

Address Feedback

FLINK: Remove redundant arg

bbc94e1

Address Feedback class FlinkJobDetailsDict(TypedDict, total=False): "just curious: why is this non-total? seems like we only have one function that returns this"

FLINK: Update comments

53b635a

FLINK: Refactor tests

6b330af

Address feedback

nleigh requested a review from Copilot November 28, 2025 16:28

This comment was marked as outdated.

Sign in to view

nleigh and others added 2 commits November 28, 2025 16:56

Update paasta_tools/flink_tools.py

92c66e2

Co-authored-by: Copilot <[email protected]>

Update paasta_tools/cli/cmds/status.py

377cfb2

Co-authored-by: Copilot <[email protected]>

nleigh and others added 3 commits November 28, 2025 16:57

Merge branch 'master' into flink-status-refactor-modular

d43c162

FLINK: fix formatting

d5ab7d7

Fix type error and whitespace in flink status output

cec9861

- Convert verbose int to bool for format_flink_instance_header call - Fix extra space in taskmanagers output line 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>

nleigh requested review from Copilot and nemacysts December 5, 2025 12:32

This comment was marked as outdated.

Sign in to view

Add explanatory comment for config_sha validation

ffab278

Restores the original comment explaining why we validate config_sha early and raise ValueError if missing. This addresses PR review feedback asking if the early exit behavior was preserved.

Yelp deleted a comment from Copilot AI Jan 8, 2026

nleigh added 2 commits January 8, 2026 07:46

Remove redundant comments in flink status

bb1abd9

The function names format_flink_jobs_table and append_pod_status are self-explanatory; the comments don't add value.

Remove specific fields from docstring to avoid staleness

31ca030

The parenthetical list of fields would get out of date as metadata is added or changed.

Yelp deleted a comment from Copilot AI Jan 8, 2026

Remove specific fields from format_flink_instance_metadata docstring

aafb916

Same reasoning as the previous commit - avoids staleness.

Yelp deleted a comment from Copilot AI Jan 8, 2026

nleigh added 10 commits January 8, 2026 08:10

Refactor TestCollectFlinkJobDetails to use expected dicts

0ca5455

Compare against expected dict instead of asserting each key individually, making tests more readable and maintainable.

Add test for missing config_sha ValueError

12c4e04

Test that get_flink_instance_details raises ValueError when config_sha label is missing from metadata, ensuring proper error handling for corrupted/invalid Flink cluster states.

Add test for missing pod_status key

e27fe71

Test that collect_flink_job_details handles status dicts without a pod_status key, returning zero counts instead of crashing during cluster transitions.

Make Flink API error handling more granular

344681c

Split the single try/except block into separate blocks for each API call (config, overview, jobs). This provides more specific error messages so users can identify which API call failed.

Use defensive .get() for pod phase access

ade3721

Extract phase via pod.get("phase") instead of direct key access to prevent KeyError if a malformed pod entry is missing the phase field.

Remove duplicate get_flink_job_name from status.py

ffca785

The function is now only defined in flink_tools.py. Updated test import to use the flink_tools version.

Add validation for metadata being None

6770f89

Return early with error message if flink metadata is missing, preventing AttributeError when passing None to get_flink_instance_details.

Add tests for FAILED/FAILING job states

44db0e4

Remove unused instance param from format_flink_config_links

41fb493

Fix critical issues: catch ValueError and use safe .get() for pod_status

c86122c

Yelp deleted a comment from Copilot AI Jan 9, 2026

Merge branch 'master' into flink-status-refactor-modular

8efb500

Refactor Flink status to modular architecture #4165

Are you sure you want to change the base?

Refactor Flink status to modular architecture #4165

Conversation

nleigh commented Nov 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nemacysts left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

nemacysts Nov 20, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

nemacysts Nov 20, 2025

Choose a reason for hiding this comment

Uh oh!

nemacysts Nov 20, 2025

Choose a reason for hiding this comment

Uh oh!

nleigh Jan 8, 2026

Choose a reason for hiding this comment

Uh oh!

nemacysts Nov 20, 2025

Choose a reason for hiding this comment

Uh oh!

nemacysts Nov 20, 2025

Choose a reason for hiding this comment

Uh oh!

nemacysts Nov 20, 2025

Choose a reason for hiding this comment

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

nleigh commented Jan 9, 2026

Uh oh!

nemacysts commented Jan 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

nleigh commented Nov 20, 2025 •

edited

Loading