Skip to content

Conversation

@nleigh
Copy link
Contributor

@nleigh nleigh commented Nov 20, 2025

Plan to add new features(#4162) but found the current flink paasta status code made it difficult to add/review new changes, refactored existing to be more modular.

Code Refactoring:

  • ✅ Extracted 10 modular, reusable functions to paasta_tools/flink_tools.py
  • ✅ Reduced main function from ~290 lines to ~90 lines
  • ✅ Separated data collection from presentation logic
    New Functions (Data Collection):
  • get_flink_instance_details() - Collects metadata, version, pool, team,
    runbook
  • collect_flink_job_details() - Collects job, pod, and resource information

New Functions (Formatting):

  • format_flink_instance_header() - Formats config SHA, version, URL
  • format_flink_instance_metadata() - Formats repo links, pool, owner, runbook
  • format_flink_config_links() - Formats yelpsoa/srv-configs links
  • format_flink_log_commands() - Formats paasta logs commands
  • format_flink_monitoring_links() - Formats Grafana/cost links
  • format_flink_state_and_pods() - Formats state, pods, jobs summary
  • format_flink_jobs_table() - Formats jobs table with proper columns
  • get_flink_job_name() - Helper to extract job name

Type Safety Improvements:

  • Added 4 TypedDicts for structured return types:
    • PodCounts - Pod count statistics
    • JobCounts - Job count statistics
    • FlinkJobDetailsDict - Collected job details
    • FlinkInstanceDetails - Instance metadata

Same output as existing (Tested 9th Jan after recent commits)

(py310-linux) nathanleigh@dev55-uswest1adevc:~/source/paasta (flink-status-refactor-modular) $ date
Fri Jan  9 08:16:50 AM PST 2026
(py310-linux) nathanleigh@dev55-uswest1adevc:~/source/paasta (flink-status-refactor-modular) $ paasta status -s sqlclient -i happyhour -c pnw-devc -v


sqlclient.happyhour in pnw-devc (EKS)
    Version:    3d559a95 (desired)
    Config SHA: confige958e818
    Flink version: 1.17.2 c0027e5 @ 2023-11-09T13:24:38+01:00
    URL: http://flink.eks.pnw-devc.paasta:31080/sqlclient-845c8c7b6c/
    Repo(git): https://github.yelpcorp.com/services/sqlclient
    Repo(sourcegraph): https://sourcegraph.yelpcorp.com/services/sqlclient
    Flink Pool: flink-spot
    Owner: streaming-infrastructure
    Flink Runbook: y/rb-sqlclient-happyhour
    Yelpsoa configs: https://github.yelpcorp.com/sysgit/yelpsoa-configs/tree/master/sqlclient
    Srv configs: https://github.yelpcorp.com/sysgit/srv-configs/tree/master/ecosystem/devc/sqlclient
==================================================================
    Flink Log Commands:
      Service:     paasta logs -a 1h -c pnw-devc -s sqlclient -i happyhour
      Taskmanager: paasta logs -a 1h -c pnw-devc -s sqlclient -i happyhour.TASKMANAGER
      Jobmanager:  paasta logs -a 1h -c pnw-devc -s sqlclient -i happyhour.JOBMANAGER
      Supervisor:  paasta logs -a 1h -c pnw-devc -s sqlclient -i happyhour.SUPERVISOR
==================================================================
    Flink Monitoring:
      Job Metrics: https://grafana.yelpcorp.com/d/flink-metrics/flink-job-metrics?orgId=1&var-datasource=Prometheus-flink&var-region=uswest2-devc&var-service=sqlclient&var-instance=happyhour&var-job=All&from=now-24h&to=now
      Container Metrics: https://grafana.yelpcorp.com/d/flink-container-metrics/flink-container-metrics?orgId=1&var-datasource=Prometheus-flink&var-region=uswest2-devc&var-service=sqlclient&var-instance=happyhour&from=now-24h&to=now
      JVM Metrics: https://grafana.yelpcorp.com/d/flink-jvm-metrics/flink-jvm-metrics?orgId=1&var-datasource=Prometheus-flink&var-region=uswest2-devc&var-service=sqlclient&var-instance=happyhour&from=now-24h&to=now
      Flink Cost: https://app.cloudzero.com/explorer?activeCostType=invoiced_amortized_cost&partitions=costcontext%3AResource%20Summary&dateRange=Last%2030%20Days&costcontext%3AKube%20Paasta%20Cluster=pnw-devc&costcontext%3APaasta%20Instance=happyhour&costcontext%3APaasta%20Service=sqlclient&showRightFlyout=filters
==================================================================
    State: Running
    Pods: 3 running, 0 evicted, 0 other, 3 total
    Jobs: 0 running, 0 finished, 35 failed, 0 cancelled, 35 total
    1 taskmanagers, 1/1 slots available
    Jobs:
      Job Name  State       Started
      happyhour Failed 2026-01-09 08:12:19 (4 minutes ago)
    Pods:
      Pod Name                                           Host                                        Phase    Uptime
      sqlclient-845c8c7b6c-jobmanager-6847cc4ccd-w82f7   ip-10-81-18-177.us-west-2.compute.internal  Running  0d18h26m26s
      sqlclient-845c8c7b6c-supervisor-srzt6              ip-10-81-22-73.us-west-2.compute.internal   Running  0d2h25m5s
      sqlclient-845c8c7b6c-taskmanager-5d847fc8cf-s48wt  ip-10-81-16-42.us-west-2.compute.internal   Running  0d15h21m36s

@nleigh nleigh requested a review from a team as a code owner November 20, 2025 15:13
@nleigh nleigh marked this pull request as draft November 20, 2025 15:14
@nleigh nleigh changed the title Refactor Flink status to modular architecture with comprehensive tests Refactor Flink status to modular architecture Nov 20, 2025
nleigh and others added 4 commits November 20, 2025 07:18
Extracted data collection and formatting logic from the monolithic
_print_flink_status_from_job_manager() function into reusable,
testable helper functions in flink_tools.py.

Changes:
- Added collect_flink_job_details() to gather pod/job/resource info
- Added format_flink_state_and_pods() to format state and counts
- Added format_flink_jobs_table() to format jobs table
- Added get_flink_job_name() helper for job name extraction
- Refactored main function from ~290 lines to ~90 lines
- Removed unused imports in status.py (shutil, groupby, get_runbook, FlinkJobs)
- Maintained identical output and behavior

This refactor:
- Improves code maintainability and testability
- Separates data collection from presentation logic
- Makes functions reusable for future features
- Reduces cognitive complexity of main function

Related to FLINK-5725

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
Added 10 test classes with 30+ test methods covering all newly extracted
functions from the Flink status refactor:

**Data Collection Tests:**
- TestCollectFlinkJobDetails (3 tests)
  * Handles missing overview (stopped clusters)
  * Counts evicted pods correctly
  * Collects complete job/resource information

**Formatting Tests:**
- TestFormatFlinkStateAndPods (3 tests)
- TestGetFlinkJobName (3 tests)
- TestFormatFlinkJobsTable (5 tests)
- TestGetFlinkInstanceDetails (2 tests)
- TestFormatFlinkInstanceHeader (3 tests)
- TestFormatFlinkInstanceMetadata (1 test)
- TestFormatFlinkConfigLinks (1 test)
- TestFormatFlinkLogCommands (1 test)
- TestFormatFlinkMonitoringLinks (1 test)

Test Coverage:
- All new functions (collect_flink_job_details, format_flink_state_and_pods,
  get_flink_job_name, format_flink_jobs_table)
- All existing formatting functions from first refactor
  (get_flink_instance_details, format_flink_instance_header, etc.)
- Edge cases: stopped clusters, evicted pods, verbose modes, terminal sizing
- All tests pass successfully via tox
- All mocks use autospec=True per repo guidelines

Benefits:
- Prevents regressions in output formatting
- Documents expected behavior
- Makes future changes safer
- Easy to test individual functions in isolation

Related to FLINK-5725

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
Improvements after self-review:

1. **Fix potential KeyError:**
   - Changed pod["reason"] to pod.get("reason") to safely handle
     Failed pods without a reason field
   - Added test case for this edge case

2. **Add TypedDicts for better type safety:**
   - PodCounts: Structured type for pod count statistics
   - JobCounts: Structured type for job count statistics
   - FlinkJobDetailsDict: Return type for collect_flink_job_details()
   - FlinkInstanceDetails: Return type for get_flink_instance_details()
   - Updated function signatures to use these types

Benefits:
- Better IDE autocomplete and type checking
- Prevents runtime errors from missing dictionary keys
- Makes function contracts more explicit
- Improves code documentation through types

All tests pass, all pre-commit checks pass.

Related to FLINK-5725

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
@nleigh nleigh force-pushed the flink-status-refactor-modular branch from 6cd9a1d to e38a0d1 Compare November 20, 2025 15:22
nleigh and others added 2 commits November 20, 2025 07:37
Fixed type checking issues identified by mypy:

1. **TypedDict compatibility:**
   - Added explicit type annotations to pod_counts and job_counts
   - Changed pod_counts to: PodCounts = {...}
   - Changed job_counts to: Optional[JobCounts] = None

2. **Verbose parameter type:**
   - Changed format_flink_instance_header() parameter from bool to int
   - Matches the actual usage where verbose can be 0, 1, 2, etc.

3. **Merge conflict resolution:**
   - Updated format_flink_monitoring_links() to use CloudZero link
   - Replaced Splunk link with CloudZero (from master PR #4152)

All checks pass:
- ✅ mypy: Success: no issues found
- ✅ All unit tests pass
- ✅ All pre-commit checks pass

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
Updated test expectations to match the new modular output format:

1. **Fixed test helper function ordering:**
   - _get_flink_base_status_verbose_1() now outputs in correct order:
     Config SHA → Version → URL → Repos → Pool/Owner/Runbook
   - Matches format_flink_instance_header + format_flink_instance_metadata

2. **Added URL to stopping state tests:**
   - test_output_stopping_jobmanager
   - test_output_stopping_taskmanagers
   - URL is shown even for non-running states (from annotations)

3. **Fixed color code assertion:**
   - test_format_stopped_state_with_evictions
   - Check for "evicted" instead of "2 evicted" (handles ANSI colors)

4. **Fixed mock patch locations:**
   - Changed get_team/get_runbook patches from flink_tools to monitoring_tools
   - These functions are imported inside get_flink_instance_details

5. **Updated CloudZero link test:**
   - Changed from splunk.yelpcorp.com to app.cloudzero.com

All tests now pass (11 status tests + 36 flink_tools tests)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
@nleigh nleigh marked this pull request as ready for review November 20, 2025 17:24
Copy link
Member

@nemacysts nemacysts left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ty - i've been meaning to try to shrink status.py a bit and move the operator-specific logic out as much as possible :)

Comment on lines 684 to 689
state: str,
pod_counts: PodCounts,
job_counts: Optional[JobCounts],
taskmanagers: Optional[int],
slots_available: Optional[int],
slots_total: Optional[int],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, job_details is already a TypedDict - would we lose anything if we just passed job_details as a whole rather than each attribute individually?

Comment on lines 344 to 353
assert result["state"] == "stopped"
assert result["pod_counts"]["running"] == 2
assert result["pod_counts"]["evicted"] == 0
assert result["pod_counts"]["other"] == 0
assert result["pod_counts"]["total"] == 2
assert result["job_counts"] is None
assert result["taskmanagers"] is None
assert result["slots_available"] is None
assert result["slots_total"] is None
assert result["jobs"] == []
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we construct an expected_result dict and compare against that rather than manually comparing each key?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(same for other tests below)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, updated in 0ca5455

Comment on lines 366 to 373
overview = mock.Mock(autospec=True)
overview.jobs_running = 1
overview.jobs_finished = 2
overview.jobs_failed = 0
overview.jobs_cancelled = 1
overview.taskmanagers = 3
overview.slots_available = 5
overview.slots_total = 15
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

overview is a typeddict, no? do we need a mock here? can't we just construct the expected dict?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(same for other tests below)

overview.slots_available = 10
overview.slots_total = 25

mock_jobs = [mock.Mock(autospec=True), mock.Mock(autospec=True)]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we want autospec=True here - i think we want spec=$CLASS_OR_OBJECT or to use https://docs.python.org/3/library/unittest.mock.html#unittest.mock.create_autospec

@nleigh nleigh requested a review from Copilot November 21, 2025 12:26

This comment was marked as outdated.

nleigh and others added 12 commits November 21, 2025 10:05
Replace Any type annotations with concrete types:
- flink_config: Optional[Any] -> Optional[FlinkConfig]
- flink_instance_config: Any -> FlinkDeploymentConfig

This addresses review comment from PR #4165.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
Move get_runbook and get_team imports from inside get_flink_instance_details
function to the top-level imports section, following Python best practices.

Update test mocks to patch the functions where they are used (flink_tools module)
rather than where they are defined (monitoring_tools module), which is required
when imports are at module level.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
Remove code that strips 'config' prefix from config_sha to make it clearer
that this is not a git SHA. Now displays 'config123456' instead of '123456'.

This addresses non-blocking review feedback from PR #4165.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
Restore the validation check that ensures config_sha is present in the
Flink metadata labels. This prevents silently ignoring missing config_sha
which could indicate a serious configuration issue.

Previously the validation was in status.py but was removed during the
refactoring. Now it's properly placed in get_flink_instance_details()
where it validates the input early.

This addresses review feedback from PR #4165.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
Consolidate two separate 'if status["state"] == "running"' blocks into
a single block that fetches flink_config, overview, and jobs together.
This improves code clarity by grouping all running-state API calls in
one place instead of having them split across the function.

Also update test expectations to reflect the 'config' prefix being kept
in config_sha display (from previous commit).

This addresses review feedback from PR #4165.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
Address Feedback

class FlinkJobDetailsDict(TypedDict, total=False):

"just curious: why is this non-total? seems like we only have one function that returns this"
Address Feedback

  1. format_flink_instance_header - Changed parameter type from Mapping[str, Any]
   to FlinkInstanceDetails for better type safety.
  2. format_flink_instance_metadata - Changed parameter type from Mapping[str,
  Any] to FlinkInstanceDetails for better type safety.
  3. format_flink_state_and_pods - Simplified to take FlinkJobDetailsDict as a
  single parameter instead of 6 individual parameters. The function now extracts
  the values it needs internally.
  4. Call site in status.py - Simplified from passing 6 parameters to just
  passing job_details.
  5. Tests for TestFormatFlinkStateAndPods - Updated to construct
  FlinkJobDetailsDict dicts instead of passing individual parameters.
Address feedback
@nleigh nleigh requested a review from Copilot November 28, 2025 16:28

This comment was marked as outdated.

nleigh and others added 3 commits November 28, 2025 16:57
- Convert verbose int to bool for format_flink_instance_header call
- Fix extra space in taskmanagers output line

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
@nleigh nleigh requested review from Copilot and nemacysts December 5, 2025 12:32

This comment was marked as outdated.

Restores the original comment explaining why we validate config_sha
early and raise ValueError if missing. This addresses PR review
feedback asking if the early exit behavior was preserved.
@Yelp Yelp deleted a comment from Copilot AI Jan 8, 2026
nleigh added 2 commits January 8, 2026 07:46
The function names format_flink_jobs_table and append_pod_status
are self-explanatory; the comments don't add value.
The parenthetical list of fields would get out of date as metadata
is added or changed.
@Yelp Yelp deleted a comment from Copilot AI Jan 8, 2026
Same reasoning as the previous commit - avoids staleness.
@Yelp Yelp deleted a comment from Copilot AI Jan 8, 2026
nleigh added 10 commits January 8, 2026 08:10
Compare against expected dict instead of asserting each key
individually, making tests more readable and maintainable.
Test that get_flink_instance_details raises ValueError when
config_sha label is missing from metadata, ensuring proper
error handling for corrupted/invalid Flink cluster states.
Test that collect_flink_job_details handles status dicts
without a pod_status key, returning zero counts instead of
crashing during cluster transitions.
Split the single try/except block into separate blocks for each
API call (config, overview, jobs). This provides more specific
error messages so users can identify which API call failed.
Extract phase via pod.get("phase") instead of direct key access
to prevent KeyError if a malformed pod entry is missing the
phase field.
The function is now only defined in flink_tools.py. Updated test
import to use the flink_tools version.
Return early with error message if flink metadata is missing,
preventing AttributeError when passing None to get_flink_instance_details.
@Yelp Yelp deleted a comment from Copilot AI Jan 9, 2026
@Yelp Yelp deleted a comment from Copilot AI Jan 9, 2026
@nleigh
Copy link
Contributor Author

nleigh commented Jan 9, 2026

@nemacysts Hey, thanks for all the review feedback 🙌
I have addressed your comments
This PR is still rather large so I don't mind splitting it up into smaller PR's if you prefer?

Example Split

  | PR  | Isolated?     | ~Size      | Review Focus          |
  |-----|---------------|------------|-----------------------|
  | 1   | ✅ Yes        | ~50 lines  | Type definitions only |
  | 2   | ❌ Sequential | ~200 lines | Metadata extraction   |
  | 3   | ❌ Sequential | ~150 lines | Links formatting      |
  | 4   | ❌ Sequential | ~300 lines | Job/pod status        |
  | 5   | ✅ Yes        | ~100 lines | Error handling        |

@nemacysts
Copy link
Member

@nleigh splitting this up would be nice - reviewing this much generated code is slightly painful (which is why i keep putting it off :p)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants