Skip to content

[feat][improvement] new arch/implementation for scaling existing nodepools#63

Open
komer3 wants to merge 17 commits intomainfrom
scale-existing-lkenodepools
Open

[feat][improvement] new arch/implementation for scaling existing nodepools#63
komer3 wants to merge 17 commits intomainfrom
scale-existing-lkenodepools

Conversation

@komer3
Copy link
Contributor

@komer3 komer3 commented Feb 4, 2026

Summary
This PR implements a complete, concurrency‑safe LKE node pool scale‑up flow for Karpenter. It enforces a strict 1:1 NodeClaim→Linode mapping, adds retryable error sentinels to distinguish transient failures, and brings Standard and Enterprise LKE tiers to parity. It also expands the fake Linode API to support more realistic tag filtering and node pool operations, adds LKE‑specific GC tests, and documents the full design.

What changed (by area)

LKE Provider

  • pkg/providers/lke/lke.go
    • Refactors Create() into smaller helpers (resolve instance type, lookup existing instance, pool locking).
    • Adds explicit retryable errors (ErrNoClaimableInstance, ErrClaimFailed, ErrPoolScaleFailed, ErrNodesProvisioning) and a bounded retry loop.
    • Introduces pool‑scoped keyed mutex to serialize claim/scale operations per (nodepool, instanceType).
    • Implements deterministic claim selection, tag verification, and pool scale‑up logic.
    • Handles Standard vs Enterprise pool discovery paths explicitly.

Concurrency / Architecture

  • pkg/utils/keyedmutex.go
    • New keyed mutex to guard pool‑scoped operations (pool lookup/create, claim/scale, orphan GC).

Fake Linode API

  • pkg/fake/linodeapi.go
    • Adds support for Linode list filter JSON with tag equality + contains.
    • Adds UpdateInstance + DeleteLKENodePoolNode behaviors.
    • Ensures LKE cluster ID + tags are propagated to fake instances.
    • Improves parity with real API to support LKE tests.

Linode SDK Interface

  • pkg/linode/sdk.go
    • Extends Linode API interface with UpdateInstance + DeleteLKENodePoolNode.

Garbage Collection

  • pkg/controllers/nodeclaim/garbagecollection/suite_test.go
    • Adds LKE Standard + Enterprise GC tests:
      • orphaned node deletion
      • multi‑node pool deletion behavior
      • owned node protection
      • resolution window coverage

Docs + Examples

  • docs/lke-nodepool-scaleup.md
    • New design doc describing invariants, claim‑or‑scale flow, tier differences, and rate‑limit analysis.
  • examples/v1/simple.yaml
    • Sets consolidateAfter: 30s in default NodePool example.
  • Makefile
    • Adds golangci‑lint tooling to verify.

Architectural Notes

  • Pool‑scoped locking ensures no double‑claim or conflicting pool scale operations.
  • Sentinel error types make retryable vs fatal Create errors explicit and testable.
  • Enterprise tier relies on Linode instance list with auto‑tags; Standard relies on pool Linodes + per‑instance tag inspection.

Tests

  • make test
  • make verify

Why this matters
This makes LKE mode robust for existing pool scale‑up, reduces race conditions, and improves observability and test coverage across both LKE tiers.

@codecov-commenter
Copy link

codecov-commenter commented Feb 4, 2026

Codecov Report

❌ Patch coverage is 66.53696% with 172 lines in your changes missing coverage. Please review.
✅ Project coverage is 66.32%. Comparing base (fee2a0b) to head (6534d08).

Files with missing lines Patch % Lines
pkg/providers/lke/lke.go 64.81% 44 Missing and 51 partials ⚠️
pkg/fake/linodeapi.go 68.11% 24 Missing and 20 partials ⚠️
pkg/utils/utils.go 72.72% 6 Missing and 3 partials ⚠️
pkg/controllers/nodeclass/controller.go 28.57% 2 Missing and 3 partials ⚠️
pkg/cloudprovider/cloudprovider.go 63.63% 0 Missing and 4 partials ⚠️
pkg/controllers/nodeclass/validation.go 50.00% 0 Missing and 3 partials ⚠️
pkg/cloudprovider/drift.go 66.66% 0 Missing and 2 partials ⚠️
pkg/controllers/controllers.go 0.00% 2 Missing ⚠️
pkg/operator/options/options.go 60.00% 2 Missing ⚠️
pkg/providers/instance/instance.go 60.00% 1 Missing and 1 partial ⚠️
... and 3 more
Additional details and impacted files
@@            Coverage Diff             @@
##             main      #63      +/-   ##
==========================================
- Coverage   74.19%   66.32%   -7.88%     
==========================================
  Files          37       38       +1     
  Lines        2279     2634     +355     
==========================================
+ Hits         1691     1747      +56     
- Misses        452      563     +111     
- Partials      136      324     +188     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@komer3 komer3 changed the title [WIP] Scale existing lkenodepools [feat][improvement] new arch/implementation for scaling existing nodepools Feb 4, 2026
@komer3 komer3 requested a review from Copilot February 4, 2026 20:41
@komer3 komer3 marked this pull request as ready for review February 4, 2026 20:41
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR refactors the LKE provider to support concurrency-safe multi-node pool scaling with strict 1:1 NodeClaim→Linode mapping. Key changes include replacing pool-level tags with instance-level tags for claim ownership, implementing pool-scoped locking, adding retryable error types, and expanding test coverage for both Standard and Enterprise LKE tiers.

Changes:

  • Replaces pool-level NodeClaim tags with instance-level claim tags using UpdateInstance API
  • Introduces keyed mutex for pool-scoped concurrency control and claim/scale operations
  • Adds explicit retryable error sentinels (ErrNoClaimableInstance, ErrClaimFailed, etc.) and bounded retry logic
  • Expands fake Linode API to support tag filtering, UpdateInstance, and DeleteLKENodePoolNode

Reviewed changes

Copilot reviewed 25 out of 25 changed files in this pull request and generated 10 comments.

Show a summary per file
File Description
pkg/providers/lke/lke.go Core LKE provider refactor: claim-or-scale flow, instance-level tagging, pool locking, tier-specific logic
pkg/utils/keyedmutex.go New keyed mutex for pool-scoped synchronization
pkg/utils/utils.go Removed custom Filter type, added GetInstanceTagsForLKE, GetTagValue, switched to linodego.Filter
pkg/fake/linodeapi.go Enhanced fake API: tag filtering, UpdateInstance, DeleteLKENodePoolNode support
pkg/providers/lke/suite_test.go Expanded LKE tests for Standard/Enterprise tiers: idempotency, claiming, scaling, error paths
pkg/controllers/nodeclaim/garbagecollection/suite_test.go Added LKE-specific GC tests for both tiers
pkg/providers/instance/types.go Removed pool-specific fields from Instance type
pkg/linode/sdk.go Extended LinodeAPI interface with UpdateInstance, DeleteLKENodePoolNode
pkg/operator/operator.go Updated to pass cluster name to LKE provider and use linodego.Filter
docs/lke-nodepool-scaleup.md New design doc detailing architecture, invariants, flow, and API call analysis
Comments suppressed due to low confidence (1)

pkg/providers/lke/suite_test.go:1

  • Corrected spelling of 'malformatedtag' to 'malformedtag'.
/*

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 27 out of 27 changed files in this pull request and generated 4 comments.

Comments suppressed due to low confidence (1)

pkg/providers/lke/suite_test.go:1

  • Corrected field name from 'ClusterName' to 'ClusterID' to match the actual parameter type.
/*

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

// TODO: If the cardinality of keys grows beyond these assumptions, consider
// adding a mechanism for periodic or usage-based cleanup of unused lock entries.
type KeyedMutex struct {
mu sync.Mutex
Copy link

Copilot AI Feb 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The field name 'mu' is ambiguous. It should be renamed to 'mapMutex' or 'lockMapMutex' to clearly indicate that it protects access to the locks map, not the keyed operations themselves.

Copilot uses AI. Check for mistakes.
Comment on lines +300 to +308
// extractTagsFromFilter parses the linodego X-Filter JSON carried in ListOptions.Filter.
//
// This fake only implements the "tags" shapes used by this repo:
// - {"tags":"tag-a,tag-b"} => exact tag match (each comma-separated value must equal a tag string)
// - {"tags":{"+contains":"substr"}} => contains match (an instance matches if any tag contains substr)
//
// If the filter is missing/unparseable or uses unsupported operators, it is treated as no tag filtering.
// NOTE: This means tests that use other filter operators will silently pass without any filtering
// being applied, which can lead to false positives. Only the shapes documented above are enforced.
Copy link

Copilot AI Feb 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The warning about silent failures is important for test maintainability. Consider adding a mechanism to explicitly list which filter operators are supported and log/panic when unsupported operators are encountered, rather than silently ignoring them. This would catch test bugs earlier.

Copilot uses AI. Check for mistakes.
Comment on lines +206 to +208
if *createdPool || *scaledOnce {
return nil, ErrNoClaimableInstance
}
Copy link

Copilot AI Feb 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The logic uses pointer flags to track pool creation and scaling state across retry attempts. Consider using a dedicated struct type (e.g., 'claimAttemptState') to encapsulate these flags, which would make the state management more explicit and easier to extend in the future.

Copilot uses AI. Check for mistakes.
return nil, cloudprovider.NewCreateError(fmt.Errorf("waiting for instance ID on pool %d", pool.ID), "NodePoolProvisioning", "Waiting for LKE instance to be ready")

if nodesProvisioning {
return nil, ErrNodesProvisioning
Copy link

Copilot AI Feb 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The function returns ErrNodesProvisioning after detecting that nodes are still being created, but this occurs inside a loop that continues checking remaining nodes. Consider extracting this deferred error return logic into a separate variable that's checked after the loop completes, to make the control flow more explicit.

Copilot uses AI. Check for mistakes.
Add design document describing the core model, invariants, and flow for scaling existing LKE node pools in LKE mode. Covers 1:1 NodeClaim-to-VM mapping, tier-specific behavior (enterprise vs standard), concurrency model with keyed mutex, claim-or-scale loop with 15s timeout, and error handling policy.
Add golangci-lint as a managed tool in Makefile with version v2.8.0. Include it in verify target and tools phony target. Add renovate configuration for automated version updates.
This commit introduces comprehensive LKE NodePool scale-up functionality with proper concurrency control and extensive test coverage.

Major Changes:
- Add LKE NodePool scale-up capability to support existing NodePool expansion
- Implement KeyedMutex utility for per-key synchronization to prevent race conditions
- Enhance LKE provider with cluster name tracking and operation timeouts
- Add comprehensive test suites for garbage collection and LKE operations
Remove the `v1` import alias for the `v1alpha1` package throughout the codebase. Use the full `v1alpha1` package name directly to avoid confusion with the Karpenter v1 API. Update all references in cloudprovider, controllers, drift detection, and utility packages. Also fix incorrect finalizer constant reference from `v1.TerminationFinalizer` to `karpv1.TerminationFinalizer` in nodeclass controller.
Add explicit `ErrNodesProvisioning` error to handle nodes with zero InstanceID in standard tier. Check all nodes before returning error to prioritize finding claimable instances. Add 500ms sleep and retry when nodes are still provisioning. Remove redundant type declaration in fake client.
Extract LKE pool lookup into `findLKENodePoolFromLinodeInstanceID` helper to reduce duplication in Delete method. Remove redundant instance fetching, tag parsing, and pool mutex locking from Delete - all now handled by the helper. Use `findNodeInPool` helper for node lookup instead of inline loop. Update simple example to add 30s consolidation period. Reduce default Karpenter replicas from 2 to 1 in chart values.
…tag in LKE mode

Replace custom `lke-pool-id` tag with Linode's automatic `nodepool=<id>` tag for pool identification in enterprise tier. Use `LKEClusterID` field on instances for cluster association instead of tags. Update instance listing to filter by cluster ID field. Simplify pool resolution logic by using native Linode tags. Update fake client to set `LKEClusterID` on created instances. Remove pool ID tag from instance claim flow.

Fixed major issues with LKE enterprise creation flow (no more duplicate nodes being created due to API timing). Also fixed List() behaviour that was throwing a lot of errors.
komer3 and others added 8 commits February 5, 2026 15:24
- Instance model cleanup: Remove NodeID, Labels, Taints, PoolID from Instance struct; simplify NewLKEInstance constructor. Reduces per-instance memory and eliminates unused pool-derived fields. These field were not being used at all by the instanceToNodeClaim() consumer.

- API call reduction: Eliminate pool lookups in Get/List/hydrateInstanceFromLinode; use direct instance data. Cuts ListInstances + GetLKENodePool calls per operation. Now we just do 1 API call.

- Enterprise test parity: Add comprehensive tests for enterprise-tier garbage collection with proper tag-based discovery and provider setup. Did the same comprehensive parity test case additions for lke provider tests.

- Test suite restructuring: Reorganize LKE suite tests under explicit Standard/Enterprise tier contexts with consistent subcontexts (Create, Idempotency, Multi-node scaling, Tag verification, Error paths, Create options, Get, List, Delete, CreateTags). Remove duplicate top-level Create options and pool filtering tests.

- Documentation: Update LKE exisiting nodepool scaleup docs with clarified tier behaviours, API call budgets, and scalability notes (more info on potential rate limit we can hit).
…helper extraction. Reduced the linter complexity.

Extract instance type resolution, existing instance lookup, and pool locking into separate helper methods. Introduce explicit error types (ErrNoClaimableInstance, ErrClaimFailed, ErrPoolScaleFailed) to distinguish retry-able conditions from fatal errors. Replace inline pool mutex lock/unlock with withPoolLock helper for cleaner resource management. Simplify attemptCreate by removing nodeID return value and using instance
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Add ProviderConfig struct to LKE provider with CreateDeadline, TagVerificationTimeout, and RetryDelay fields. Add corresponding CLI flags and environment variables (LKE_CREATE_DEADLINE, LKE_TAG_VERIFICATION_TIMEOUT, LKE_RETRY_DELAY) with defaults matching previous hardcoded values (10s, 4s, 2s). Thread config through operator initialization and test setup. Extract retry error check into isRetryableCreateError helper. Replace hardcoded sleep
Add lkeCreateDeadline, lkeTagVerificationTimeout, and lkeRetryDelay settings to chart values with defaults (10s, 4s, 2s). Wire settings through deployment template as LKE_CREATE_DEADLINE, LKE_TAG_VERIFICATION_TIMEOUT, and LKE_RETRY_DELAY environment variables. Add documentation comments for each timeout parameter.
name: default
spec:
disruption:
consolidateAfter: 30s
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we setting this to 30s instead of the default of 0s?

}
tagMap[parts[0]] = parts[1]
tagMap[key] = value
}
Copy link
Contributor

@AshleyDumaine AshleyDumaine Feb 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this to add "=" as a supported separator? If we can't stick with ":" across the board we should standardize on using "=".

func (p *DefaultProvider) findClaimableInstanceStandard(ctx context.Context, pool *linodego.LKENodePool) (*linodego.Instance, error) {
freshPool, err := p.client.GetLKENodePool(ctx, p.clusterID, pool.ID)
if err != nil {
if utils.IsRetryableError(err) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to clean up this function from utils / comment it out since this was the only place it was getting used?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants