[feat][improvement] new arch/implementation for scaling existing nodepools#63
[feat][improvement] new arch/implementation for scaling existing nodepools#63
Conversation
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #63 +/- ##
==========================================
- Coverage 74.19% 66.32% -7.88%
==========================================
Files 37 38 +1
Lines 2279 2634 +355
==========================================
+ Hits 1691 1747 +56
- Misses 452 563 +111
- Partials 136 324 +188 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Pull request overview
This PR refactors the LKE provider to support concurrency-safe multi-node pool scaling with strict 1:1 NodeClaim→Linode mapping. Key changes include replacing pool-level tags with instance-level tags for claim ownership, implementing pool-scoped locking, adding retryable error types, and expanding test coverage for both Standard and Enterprise LKE tiers.
Changes:
- Replaces pool-level
NodeClaimtags with instance-level claim tags usingUpdateInstanceAPI - Introduces keyed mutex for pool-scoped concurrency control and claim/scale operations
- Adds explicit retryable error sentinels (
ErrNoClaimableInstance,ErrClaimFailed, etc.) and bounded retry logic - Expands fake Linode API to support tag filtering,
UpdateInstance, andDeleteLKENodePoolNode
Reviewed changes
Copilot reviewed 25 out of 25 changed files in this pull request and generated 10 comments.
Show a summary per file
| File | Description |
|---|---|
| pkg/providers/lke/lke.go | Core LKE provider refactor: claim-or-scale flow, instance-level tagging, pool locking, tier-specific logic |
| pkg/utils/keyedmutex.go | New keyed mutex for pool-scoped synchronization |
| pkg/utils/utils.go | Removed custom Filter type, added GetInstanceTagsForLKE, GetTagValue, switched to linodego.Filter |
| pkg/fake/linodeapi.go | Enhanced fake API: tag filtering, UpdateInstance, DeleteLKENodePoolNode support |
| pkg/providers/lke/suite_test.go | Expanded LKE tests for Standard/Enterprise tiers: idempotency, claiming, scaling, error paths |
| pkg/controllers/nodeclaim/garbagecollection/suite_test.go | Added LKE-specific GC tests for both tiers |
| pkg/providers/instance/types.go | Removed pool-specific fields from Instance type |
| pkg/linode/sdk.go | Extended LinodeAPI interface with UpdateInstance, DeleteLKENodePoolNode |
| pkg/operator/operator.go | Updated to pass cluster name to LKE provider and use linodego.Filter |
| docs/lke-nodepool-scaleup.md | New design doc detailing architecture, invariants, flow, and API call analysis |
Comments suppressed due to low confidence (1)
pkg/providers/lke/suite_test.go:1
- Corrected spelling of 'malformatedtag' to 'malformedtag'.
/*
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 27 out of 27 changed files in this pull request and generated 4 comments.
Comments suppressed due to low confidence (1)
pkg/providers/lke/suite_test.go:1
- Corrected field name from 'ClusterName' to 'ClusterID' to match the actual parameter type.
/*
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| // TODO: If the cardinality of keys grows beyond these assumptions, consider | ||
| // adding a mechanism for periodic or usage-based cleanup of unused lock entries. | ||
| type KeyedMutex struct { | ||
| mu sync.Mutex |
There was a problem hiding this comment.
The field name 'mu' is ambiguous. It should be renamed to 'mapMutex' or 'lockMapMutex' to clearly indicate that it protects access to the locks map, not the keyed operations themselves.
| // extractTagsFromFilter parses the linodego X-Filter JSON carried in ListOptions.Filter. | ||
| // | ||
| // This fake only implements the "tags" shapes used by this repo: | ||
| // - {"tags":"tag-a,tag-b"} => exact tag match (each comma-separated value must equal a tag string) | ||
| // - {"tags":{"+contains":"substr"}} => contains match (an instance matches if any tag contains substr) | ||
| // | ||
| // If the filter is missing/unparseable or uses unsupported operators, it is treated as no tag filtering. | ||
| // NOTE: This means tests that use other filter operators will silently pass without any filtering | ||
| // being applied, which can lead to false positives. Only the shapes documented above are enforced. |
There was a problem hiding this comment.
The warning about silent failures is important for test maintainability. Consider adding a mechanism to explicitly list which filter operators are supported and log/panic when unsupported operators are encountered, rather than silently ignoring them. This would catch test bugs earlier.
| if *createdPool || *scaledOnce { | ||
| return nil, ErrNoClaimableInstance | ||
| } |
There was a problem hiding this comment.
The logic uses pointer flags to track pool creation and scaling state across retry attempts. Consider using a dedicated struct type (e.g., 'claimAttemptState') to encapsulate these flags, which would make the state management more explicit and easier to extend in the future.
| return nil, cloudprovider.NewCreateError(fmt.Errorf("waiting for instance ID on pool %d", pool.ID), "NodePoolProvisioning", "Waiting for LKE instance to be ready") | ||
|
|
||
| if nodesProvisioning { | ||
| return nil, ErrNodesProvisioning |
There was a problem hiding this comment.
The function returns ErrNodesProvisioning after detecting that nodes are still being created, but this occurs inside a loop that continues checking remaining nodes. Consider extracting this deferred error return logic into a separate variable that's checked after the loop completes, to make the control flow more explicit.
Add design document describing the core model, invariants, and flow for scaling existing LKE node pools in LKE mode. Covers 1:1 NodeClaim-to-VM mapping, tier-specific behavior (enterprise vs standard), concurrency model with keyed mutex, claim-or-scale loop with 15s timeout, and error handling policy.
Add golangci-lint as a managed tool in Makefile with version v2.8.0. Include it in verify target and tools phony target. Add renovate configuration for automated version updates.
This commit introduces comprehensive LKE NodePool scale-up functionality with proper concurrency control and extensive test coverage. Major Changes: - Add LKE NodePool scale-up capability to support existing NodePool expansion - Implement KeyedMutex utility for per-key synchronization to prevent race conditions - Enhance LKE provider with cluster name tracking and operation timeouts - Add comprehensive test suites for garbage collection and LKE operations
Remove the `v1` import alias for the `v1alpha1` package throughout the codebase. Use the full `v1alpha1` package name directly to avoid confusion with the Karpenter v1 API. Update all references in cloudprovider, controllers, drift detection, and utility packages. Also fix incorrect finalizer constant reference from `v1.TerminationFinalizer` to `karpv1.TerminationFinalizer` in nodeclass controller.
Add explicit `ErrNodesProvisioning` error to handle nodes with zero InstanceID in standard tier. Check all nodes before returning error to prioritize finding claimable instances. Add 500ms sleep and retry when nodes are still provisioning. Remove redundant type declaration in fake client.
Extract LKE pool lookup into `findLKENodePoolFromLinodeInstanceID` helper to reduce duplication in Delete method. Remove redundant instance fetching, tag parsing, and pool mutex locking from Delete - all now handled by the helper. Use `findNodeInPool` helper for node lookup instead of inline loop. Update simple example to add 30s consolidation period. Reduce default Karpenter replicas from 2 to 1 in chart values.
…tag in LKE mode Replace custom `lke-pool-id` tag with Linode's automatic `nodepool=<id>` tag for pool identification in enterprise tier. Use `LKEClusterID` field on instances for cluster association instead of tags. Update instance listing to filter by cluster ID field. Simplify pool resolution logic by using native Linode tags. Update fake client to set `LKEClusterID` on created instances. Remove pool ID tag from instance claim flow. Fixed major issues with LKE enterprise creation flow (no more duplicate nodes being created due to API timing). Also fixed List() behaviour that was throwing a lot of errors.
- Instance model cleanup: Remove NodeID, Labels, Taints, PoolID from Instance struct; simplify NewLKEInstance constructor. Reduces per-instance memory and eliminates unused pool-derived fields. These field were not being used at all by the instanceToNodeClaim() consumer. - API call reduction: Eliminate pool lookups in Get/List/hydrateInstanceFromLinode; use direct instance data. Cuts ListInstances + GetLKENodePool calls per operation. Now we just do 1 API call. - Enterprise test parity: Add comprehensive tests for enterprise-tier garbage collection with proper tag-based discovery and provider setup. Did the same comprehensive parity test case additions for lke provider tests. - Test suite restructuring: Reorganize LKE suite tests under explicit Standard/Enterprise tier contexts with consistent subcontexts (Create, Idempotency, Multi-node scaling, Tag verification, Error paths, Create options, Get, List, Delete, CreateTags). Remove duplicate top-level Create options and pool filtering tests. - Documentation: Update LKE exisiting nodepool scaleup docs with clarified tier behaviours, API call budgets, and scalability notes (more info on potential rate limit we can hit).
…helper extraction. Reduced the linter complexity. Extract instance type resolution, existing instance lookup, and pool locking into separate helper methods. Introduce explicit error types (ErrNoClaimableInstance, ErrClaimFailed, ErrPoolScaleFailed) to distinguish retry-able conditions from fatal errors. Replace inline pool mutex lock/unlock with withPoolLock helper for cleaner resource management. Simplify attemptCreate by removing nodeID return value and using instance
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Add ProviderConfig struct to LKE provider with CreateDeadline, TagVerificationTimeout, and RetryDelay fields. Add corresponding CLI flags and environment variables (LKE_CREATE_DEADLINE, LKE_TAG_VERIFICATION_TIMEOUT, LKE_RETRY_DELAY) with defaults matching previous hardcoded values (10s, 4s, 2s). Thread config through operator initialization and test setup. Extract retry error check into isRetryableCreateError helper. Replace hardcoded sleep
Add lkeCreateDeadline, lkeTagVerificationTimeout, and lkeRetryDelay settings to chart values with defaults (10s, 4s, 2s). Wire settings through deployment template as LKE_CREATE_DEADLINE, LKE_TAG_VERIFICATION_TIMEOUT, and LKE_RETRY_DELAY environment variables. Add documentation comments for each timeout parameter.
9985030 to
6534d08
Compare
| name: default | ||
| spec: | ||
| disruption: | ||
| consolidateAfter: 30s |
There was a problem hiding this comment.
Why are we setting this to 30s instead of the default of 0s?
| } | ||
| tagMap[parts[0]] = parts[1] | ||
| tagMap[key] = value | ||
| } |
There was a problem hiding this comment.
Is this to add "=" as a supported separator? If we can't stick with ":" across the board we should standardize on using "=".
| func (p *DefaultProvider) findClaimableInstanceStandard(ctx context.Context, pool *linodego.LKENodePool) (*linodego.Instance, error) { | ||
| freshPool, err := p.client.GetLKENodePool(ctx, p.clusterID, pool.ID) | ||
| if err != nil { | ||
| if utils.IsRetryableError(err) { |
There was a problem hiding this comment.
Do we want to clean up this function from utils / comment it out since this was the only place it was getting used?
Summary
This PR implements a complete, concurrency‑safe LKE node pool scale‑up flow for Karpenter. It enforces a strict 1:1 NodeClaim→Linode mapping, adds retryable error sentinels to distinguish transient failures, and brings Standard and Enterprise LKE tiers to parity. It also expands the fake Linode API to support more realistic tag filtering and node pool operations, adds LKE‑specific GC tests, and documents the full design.
What changed (by area)
LKE Provider
ErrNoClaimableInstance,ErrClaimFailed,ErrPoolScaleFailed,ErrNodesProvisioning) and a bounded retry loop.(nodepool, instanceType).Concurrency / Architecture
pkg/utils/keyedmutex.goFake Linode API
pkg/fake/linodeapi.goLinode SDK Interface
pkg/linode/sdk.goGarbage Collection
pkg/controllers/nodeclaim/garbagecollection/suite_test.goDocs + Examples
docs/lke-nodepool-scaleup.mdexamples/v1/simple.yamlconsolidateAfter: 30sin default NodePool example.verify.Architectural Notes
Tests
make testmake verifyWhy this matters
This makes LKE mode robust for existing pool scale‑up, reduces race conditions, and improves observability and test coverage across both LKE tiers.