Integrate scale sub resouce by means of an HPA and ScheludedScaling CRD #276

girishc13 · 2022-07-25T13:00:59Z

One-line summary

Integrate Horizontal Pod Autoscaler to allow scaling EDS using ScalingSchedule custom resource.

Description

This pr development focusses on using the a scaling schedule or a cluster scaling schedule CRD's for automatically scaling up an EDS to the required number of replicas. The HPA doesn't directly control or edit the replicas of the EDS. The replica count reported by the HPA is stored in a separate property which is taken into account during the calculation for determining the scaling operation. The HPA targets the EDS for registering the target scaling counts.

A new hpa_replicas property is introduced to the EDS spec. This property is controlled by the HPA via the scale subresource. There is also a corresponding status property. By default the HPA will not set the number of replicas to zero. Therefore the default value of hpa_replicas is 1. The HPA scaling pattern respects the existing cool down periods.

The initial scalingHint() method checks for the .spec.hpa_replicas property in addition to the cpu metrics to determine the next scaling operation.

The subsequent scaleUpOrDown has the core logic for incorporating the hpa replica count to calculate the next change in index or node replicas.

There will be no scaling operation if scaling is disabled.
There will be no scaling operation if there are no indices to manage.
MinNodeReplica condition is satisfied first. The hpa replica count is satisfied in the next scaling step.
MaxShardsPerNode is satisfied before satisfying hpa replica count. The hpa replica count maybe satisfied in the next scaling step.
MinIndexReplicas is satisfied before satisfying hpa replica count. The hpa replica count is satisfied in the next scaling step.
For scaling hint UP:
- Preserve the shard to node ratio by increasing index replicas only after the hpa replica count (> 1) has been satisfied. This step prevents the need or desire to maintain the same shard to node ratio until reaching the hpa replica count. This also might add a downside of skewed shard to node ratio when the hpa replica count is satisfied. This step can be revisited to preserve equal shard to node ratio or the MinShardsPerNode condition.
- As before calculate the new desired node replicas by reducing the shard to node ratio 1.
- If hpa replica count is higher than the new node replica count in the previous step then scale up directly to hpa replica count. This step can lead to skew in shard to node ratio. This step can be revisited to incorporate smaller node replica scaling steps.
- The last step is to scale up to newly calculated node replicas calculated before.
For scaling hint DOWN:
- Scale down index replicas if index replicas is > MinIndexReplica setting. Re-calculate the number of node replicas based on the new shard to node ratio.

A custom metrics adapter like kube-metrics-adapter is required to support the custom ScheduledScaling CRD. The custom metrics server is responsible for collecting replica counts or scaling values based on the CRD.

Types of Changes

New feature (non-breaking change which adds functionality)
Refactor/improvements
Documentation / non-code

Tasks

List of tasks you will do to complete the PR

Improvements to scale up in steps instead of a single step scale up to satisfy hpa replica count.
Add tests for EDS status updates.
Update e2e tests.
Update the getting started section.
Add debug notes or links to test the hpa and the scale sub resource.

Review

List of tasks the reviewer must do to review the PR

Tests
Documentation
CHANGELOG

Deployment Notes

Verify the custom metrics server RBAC setup before testing the scheduled scaling. Link to the debugging section.

This pr development focusses on using the a scaling schedule or a cluster scaling schedule CRD's for automatically scaling up an EDS to the required number of replicas. The HPA doesn't directly control or edit the replicas of the EDS. The replic count reported by the HPA is stored in a separate property which is taken into account during the calculation for determining the scaling operation. Signed-off-by: Girish Chandrashekar <[email protected]>

Signed-off-by: Girish Chandrashekar <[email protected]>

README.md

docs/es-operator.yaml

To aviod having unused nodes and nodes with shards less than minShardsPerNode during the HPA scale up step, now we calculate the increase in index replica required to maintain minShardsPerNode ratio or close to the minShardsPerNodeRatio so that: - quick scaling can be achieved in a single step - to avoid an inconsistent state during the first HPA scale up operation - prepare the nodes and indices ready for the expected incrased in load due to a scheduled scaling or any other HPA trigger Signed-off-by: Girish Chandrashekar <[email protected]>

Signed-off-by: Girish Chandrashekar <[email protected]>

The stabel version of metrics-server is available at https://objects.githubusercontent.com/github-production-release-asset-2e65be/92132038/cdc0a1f8-1732-4c26-b68e-876396de00ed?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAIWNJYAX4CSVEH53A%2F20220822%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20220822T080848Z&X-Amz-Expires=300&X-Amz-Signature=a7d7186163ef64461dc4036c86b3a528a40deb969ae63783728b3bcd4f461730&X-Amz-SignedHeaders=host&actor_id=9018501&key_id=0&repo_id=92132038&response-content-disposition=attachment%3B%20filename%3Dcomponents.yaml&response-content-type=application%2Foctet-stream There were some errors when running the metrics-server locally with kind. Signed-off-by: Girish Chandrashekar <[email protected]>

Signed-off-by: Girish Chandrashekar <[email protected]>

girishc13 · 2022-09-06T12:51:12Z

@mikkeloscar Can you paste here the snippet that you prepared for creating the fake clientset. I cannot access our chat message anymore. I have one pending unit test to complete this pr.

The Clientset had to be modified a bit to expose the internal clients for mocking in test package. Signed-off-by: Girish Chandrashekar <[email protected]>

Signed-off-by: Girish Chandrashekar <[email protected]>

girishc13 · 2022-09-12T09:02:29Z

@otrosien I've complete the last task from my side. The e2e and documentation tests are taking longer than before which cause a timeout on the workflow step. I can try and increase the CPU resources in the documentation manifests but I'm not sure if it will help.

otrosien · 2022-09-19T19:50:39Z

Thanks @girishc13, I'm still on a few other topics and will pick it up later this month.

mikkeloscar · 2022-09-20T07:25:33Z

@girishc13 sorry I missed your message. Here it is in case you didn't figure it out yet:

package clientset

import (
    mFake "github.com/zalando-incubator/es-operator/pkg/client/clientset/versioned/fake"
    "github.com/zalando-incubator/es-operator/pkg/clientset"
    "k8s.io/client-go/kubernetes/fake"
    zFake "k8s.io/metrics/pkg/client/clientset/versioned/fake"
)

func NewFakeClientset() *clientset.Clientset {
    return clientset.Clientset{
        Interface:  fake.NewSimpleClientset(),
        zInterface: zFake.NewSimpleClientset(),
        mInterface: mFake.NewSimpleClientset(),
    }
}

Didn't have time to check the PR yet :(

Girish Chandrashekar added 5 commits July 25, 2022 12:21

Add instructions to set up the custom metrics server and HPA resources

edfa6ee

Signed-off-by: Girish Chandrashekar <[email protected]>

Update GETTING_STARTED readme and add additional hpa debugging doc

6e7600f

Signed-off-by: Girish Chandrashekar <[email protected]>

Update main README with HPA scaling example

03f00ee

Signed-off-by: Girish Chandrashekar <[email protected]>

Fix documentation tests

d7624c7

Signed-off-by: Girish Chandrashekar <[email protected]>

girishc13 marked this pull request as ready for review July 26, 2022 14:36

girishc13 requested review from mikkeloscar and otrosien as code owners July 26, 2022 14:36

mikkeloscar reviewed Jul 29, 2022

View reviewed changes

README.md Outdated Show resolved Hide resolved

mikkeloscar reviewed Aug 16, 2022

View reviewed changes

docs/es-operator.yaml Outdated Show resolved Hide resolved

Girish Chandrashekar added 3 commits August 18, 2022 18:12

Fix lint error

2078e93

Signed-off-by: Girish Chandrashekar <[email protected]>

Merge branch 'master' into implement-hpa-part1

fc49d1c

Signed-off-by: Girish Chandrashekar <[email protected]>

girishc13 force-pushed the implement-hpa-part1 branch from 253663d to fc49d1c Compare August 19, 2022 09:28

Girish Chandrashekar added 10 commits August 19, 2022 11:43

Rename kubernetes api property to use camel case

7ddbc5a

Signed-off-by: Girish Chandrashekar <[email protected]>

Fix lint warnings after upgrading to go 1.18

f9892db

Signed-off-by: Girish Chandrashekar <[email protected]>

Update README examples with new HPA single scale up step

f0e6d92

Signed-off-by: Girish Chandrashekar <[email protected]>

Update manifests for easy setup to test the HPA scaling schedule

e7c33f4

Signed-off-by: Girish Chandrashekar <[email protected]>

Merge branch 'master' into implement-hpa-part1

868de2c

Signed-off-by: Girish Chandrashekar <[email protected]>

Commit generated code

92e0064

Signed-off-by: Girish Chandrashekar <[email protected]>

Revert the hardcoded es-operator image version

8a0501f

Signed-off-by: Girish Chandrashekar <[email protected]>

Add namespace to es-data-simple resource

30ca5da

Signed-off-by: Girish Chandrashekar <[email protected]>

Use basic settings for getting started with es-data-simple EDS example

58447c2

Signed-off-by: Girish Chandrashekar <[email protected]>

girishc13 force-pushed the implement-hpa-part1 branch from 2b42361 to 58447c2 Compare August 22, 2022 15:29

Girish Chandrashekar added 4 commits August 23, 2022 12:23

Correctly handle reading from a pointer that might be null

86479fe

Signed-off-by: Girish Chandrashekar <[email protected]>

Add end to end test to verify HPA scale up and scale down

8375bd2

Signed-off-by: Girish Chandrashekar <[email protected]>

Increase e2e default timeout and cpu resources for EDS nodes

05aed3c

Signed-off-by: Girish Chandrashekar <[email protected]>

Increase sleep interval for documentation tests

4c42e2f

Signed-off-by: Girish Chandrashekar <[email protected]>

girishc13 added 2 commits September 9, 2022 17:14

Add unit tests for eds UpdateStatus method

95e71c3

The Clientset had to be modified a bit to expose the internal clients for mocking in test package. Signed-off-by: Girish Chandrashekar <[email protected]>

Increase loop count for documentation tests

a091a93

Signed-off-by: Girish Chandrashekar <[email protected]>

mikkeloscar mentioned this pull request Oct 27, 2023

Using Zalando's ScalingSchedule to Scale EDS Instances #365

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Integrate scale sub resouce by means of an HPA and ScheludedScaling CRD #276

Integrate scale sub resouce by means of an HPA and ScheludedScaling CRD #276

girishc13 commented Jul 25, 2022 •

edited

Loading

girishc13 commented Sep 6, 2022

girishc13 commented Sep 12, 2022

otrosien commented Sep 19, 2022

mikkeloscar commented Sep 20, 2022

Integrate scale sub resouce by means of an HPA and ScheludedScaling CRD #276

Are you sure you want to change the base?

Integrate scale sub resouce by means of an HPA and ScheludedScaling CRD #276

Conversation

girishc13 commented Jul 25, 2022 • edited Loading

One-line summary

Description

Types of Changes

Tasks

Review

Deployment Notes

girishc13 commented Sep 6, 2022

girishc13 commented Sep 12, 2022

otrosien commented Sep 19, 2022

mikkeloscar commented Sep 20, 2022

girishc13 commented Jul 25, 2022 •

edited

Loading