Bug: autoscaling.virtio-mem is flaky #1006

Omrigan · 2024-07-11T11:30:17Z

Fails roughly 50% of the time.

sharnoff · 2024-07-11T13:24:02Z

@Omrigan how long has it been flaky for? Can you link some example runs?

Omrigan · 2024-07-11T13:43:40Z

@Omrigan how long has it been flaky for?

I'd assume from the introduction.

Can you link some example runs?

Your PR: https://github.com/neondatabase/autoscaling/actions/runs/9878750381/job/27283615758#step:14:1034
My PR: https://github.com/neondatabase/autoscaling/actions/runs/9889976645/job/27317397731#step:14:948

I had 3 or 4 such cases, I think.

However, maybe it is not that particular test, here is the other test failing: https://github.com/neondatabase/autoscaling/actions/runs/9890395470/job/27319200362#step:14:987

stradig · 2024-07-15T15:15:17Z

We should check if this is only timeouts or if we suspect an actual problem. Until then marking as P1.

sharnoff · 2024-07-16T05:57:41Z

Haven't yet had a chance to look at it. However - @Omrigan if you're truly observing it fail 50% of the time, it may be due to your PR. The occurrence you flagged was the first time I'd seen it fail on one of my PRs.

sharnoff · 2024-08-07T17:53:09Z

A couple recent runs:

Both cases look like they timed out after 5 minutes on downscaling.

petuhovskiy · 2024-08-21T10:10:49Z

Reproduces locally in about ~2 hours. From the logs it seems like the issue is that VM uses too much memory sometimes and it prevents downscaling. I think the next step for debugging would be to print memory usage in logs periodically. For now I'll put this issue in selected, but want to get back to it later.

My logs

2024/08/21 02:49:07 running without a 'kuttl-test.yaml' configuration
2024/08/21 02:49:07 kutt-test config testdirs is overridden with args: [ tests/e2e ]
=== RUN   kuttl
    harness.go:465: starting setup
    harness.go:255: running tests using configured kubeconfig.
    harness.go:278: Successful connection to cluster at: https://127.0.0.1:33073
    harness.go:363: running tests
    harness.go:75: going to run test suite with timeout of 30 seconds for each step
    harness.go:375: testsuite: tests/e2e has 13 tests
=== RUN   kuttl/harness
=== RUN   kuttl/harness/autoscaling.virtio-mem
=== PAUSE kuttl/harness/autoscaling.virtio-mem
=== CONT  kuttl/harness/autoscaling.virtio-mem
    logger.go:42: 02:49:07 | autoscaling.virtio-mem | Creating namespace: kuttl-test-helpful-dolphin
    logger.go:42: 02:49:07 | autoscaling.virtio-mem/0-create-vm | starting test step 0-create-vm
    logger.go:42: 02:49:07 | autoscaling.virtio-mem/0-create-vm | Service:kuttl-test-helpful-dolphin/example created
    logger.go:42: 02:49:07 | autoscaling.virtio-mem/0-create-vm | VirtualMachine:kuttl-test-helpful-dolphin/example created
    logger.go:42: 02:49:07 | autoscaling.virtio-mem/0-create-vm | ConfigMap:kuttl-test-helpful-dolphin/example-config created
    logger.go:42: 02:49:15 | autoscaling.virtio-mem/0-create-vm | test step completed 0-create-vm
    logger.go:42: 02:49:15 | autoscaling.virtio-mem/1-upscale | starting test step 1-upscale
    logger.go:42: 02:49:15 | autoscaling.virtio-mem/1-upscale | Pod:kuttl-test-helpful-dolphin/workload created
    logger.go:42: 02:49:15 | autoscaling.virtio-mem/1-upscale | ConfigMap:kuttl-test-helpful-dolphin/query created
    logger.go:42: 02:49:26 | autoscaling.virtio-mem/1-upscale | test step completed 1-upscale
    logger.go:42: 02:49:26 | autoscaling.virtio-mem/2-downscale | starting test step 2-downscale
    logger.go:42: 02:54:28 | autoscaling.virtio-mem/2-downscale | test step failed 2-downscale
    case.go:366: failed in step 2-downscale
    case.go:368: --- VirtualMachine:kuttl-test-helpful-dolphin/example
        +++ VirtualMachine:kuttl-test-helpful-dolphin/example
        @@ -1,14 +1,37 @@
         apiVersion: vm.neon.tech/v1
         kind: VirtualMachine
         metadata:
        +  annotations:
        +    autoscaling.neon.tech/bounds: '{ "min": { "cpu": "250m", "mem": "1Gi" }, "max":
        +      {"cpu": 1, "mem": "4Gi" } }'
        +  finalizers:
        +  - vm.neon.tech/finalizer
        +  labels:
        +    autoscaling.neon.tech/enabled: "true"
           name: example
           namespace: kuttl-test-helpful-dolphin
        +spec: '[... elided field over 10 lines long ...]'
         status:
           conditions:
        -  - status: "True"
        +  - lastTransitionTime: "2024-08-21T00:49:10Z"
        +    message: Pod (example-jxxtr) for VirtualMachine (example) created successfully
        +    reason: Reconciling
        +    status: "True"
             type: Available
        -  cpus: 250m
        -  memorySize: 1Gi
        +  cpus: 750m
        +  currentRevision:
        +    revision:
        +      flags: 1
        +      value: 21
        +    updatedAt: "2024-08-21T00:54:20Z"
        +  extraNetIP: 10.100.128.1
        +  extraNetMask: 255.255.0.0
        +  memoryProvider: VirtioMem
        +  memorySize: 3Gi
        +  node: neonvm-arthur-worker
           phase: Running
        +  podIP: 10.244.2.134
        +  podName: example-jxxtr
           restartCount: 0
        +  sshSecretName: ssh-neonvm-example


    case.go:368: resource VirtualMachine:kuttl-test-helpful-dolphin/example: .status.memorySize: value mismatch, expected: 1Gi != actual: 3Gi
    logger.go:42: 02:54:28 | autoscaling.virtio-mem | autoscaling.virtio-mem events from ns kuttl-test-helpful-dolphin:
    logger.go:42: 02:54:28 | autoscaling.virtio-mem | 2024-08-21 02:49:07 +0200 CEST    NormalPod example-jxxtr       Binding Scheduled       Successfully assigned kuttl-test-helpful-dolphin/example-jxxtr to neonvm-arthur-worker        autoscale-scheduler
    logger.go:42: 02:54:28 | autoscaling.virtio-mem | 2024-08-21 02:49:07 +0200 CEST    NormalPod example-jxxtr               AddedInterface  Add eth0 [10.244.2.134/24] from kindnet
    logger.go:42: 02:54:28 | autoscaling.virtio-mem | 2024-08-21 02:49:07 +0200 CEST    NormalVirtualMachine.vm.neon.tech example             OverlayNet      Acquired IP 10.100.128.1/16 for overlay network interface     virtualmachine-controller
    logger.go:42: 02:54:28 | autoscaling.virtio-mem | 2024-08-21 02:49:07 +0200 CEST    NormalVirtualMachine.vm.neon.tech example             Created VirtualMachine example created, Pod example-jxxtr, SSH Secret ssh-neonvm-example      virtualmachine-controller
    logger.go:42: 02:54:28 | autoscaling.virtio-mem | 2024-08-21 02:49:08 +0200 CEST    NormalPod example-jxxtr               AddedInterface  Add net1 [] from neonvm-system/neonvm-overlay-for-vms
    logger.go:42: 02:54:28 | autoscaling.virtio-mem | 2024-08-21 02:49:08 +0200 CEST    NormalPod example-jxxtr.spec.initContainers{init}             Pulled  Container image "vm-postgres:15-bullseye" already present on machine  kubelet
    logger.go:42: 02:54:28 | autoscaling.virtio-mem | 2024-08-21 02:49:08 +0200 CEST    NormalPod example-jxxtr.spec.initContainers{init}             Created Created container init  kubelet
    logger.go:42: 02:54:28 | autoscaling.virtio-mem | 2024-08-21 02:49:08 +0200 CEST    NormalPod example-jxxtr.spec.initContainers{init}             Started Started container init  kubelet
    logger.go:42: 02:54:28 | autoscaling.virtio-mem | 2024-08-21 02:49:09 +0200 CEST    NormalPod example-jxxtr.spec.containers{neonvm-runner}                Pulled  Container image "runner:dev" already present on machine       kubelet
    logger.go:42: 02:54:28 | autoscaling.virtio-mem | 2024-08-21 02:49:09 +0200 CEST    NormalPod example-jxxtr.spec.containers{neonvm-runner}                Created Created container neonvm-runner       kubelet
    logger.go:42: 02:54:28 | autoscaling.virtio-mem | 2024-08-21 02:49:09 +0200 CEST    NormalPod example-jxxtr.spec.containers{neonvm-runner}                Started Started container neonvm-runner       kubelet
    logger.go:42: 02:54:28 | autoscaling.virtio-mem | 2024-08-21 02:49:09 +0200 CEST    NormalPod example-jxxtr.spec.containers{neonvm-container-mgr}         Pulled  Container image "runner:dev" already present on machine       kubelet
    logger.go:42: 02:54:28 | autoscaling.virtio-mem | 2024-08-21 02:49:09 +0200 CEST    NormalPod example-jxxtr.spec.containers{neonvm-container-mgr}         Created Created container neonvm-container-mgr        kubelet
    logger.go:42: 02:54:28 | autoscaling.virtio-mem | 2024-08-21 02:49:10 +0200 CEST    NormalPod example-jxxtr.spec.containers{neonvm-container-mgr}         Started Started container neonvm-container-mgr        kubelet
    logger.go:42: 02:54:28 | autoscaling.virtio-mem | 2024-08-21 02:49:10 +0200 CEST    NormalVirtualMachine.vm.neon.tech example             CpuInfo VirtualMachine example uses 0.5 cpu cores     virtualmachine-controller
    logger.go:42: 02:54:28 | autoscaling.virtio-mem | 2024-08-21 02:49:10 +0200 CEST    NormalVirtualMachine.vm.neon.tech example             MemoryInfo      VirtualMachine example uses 1Gi memory        virtualmachine-controller
    logger.go:42: 02:54:28 | autoscaling.virtio-mem | 2024-08-21 02:49:15 +0200 CEST    NormalVirtualMachine.vm.neon.tech example             ScaleDown       Runner pod cgroups was updated on VM example  virtualmachine-controller
    logger.go:42: 02:54:28 | autoscaling.virtio-mem | 2024-08-21 02:49:15 +0200 CEST    NormalVirtualMachine.vm.neon.tech example             CpuInfo VirtualMachine example uses 0.25 cpu cores    virtualmachine-controller
    logger.go:42: 02:54:28 | autoscaling.virtio-mem | 2024-08-21 02:49:15 +0200 CEST    NormalPod workload            Scheduled       Successfully assigned kuttl-test-helpful-dolphin/workload to neonvm-arthur-worker     default-scheduler
    logger.go:42: 02:54:28 | autoscaling.virtio-mem | 2024-08-21 02:49:15 +0200 CEST    NormalPod workload            AddedInterface  Add eth0 [10.244.2.135/24] from kindnet
    logger.go:42: 02:54:28 | autoscaling.virtio-mem | 2024-08-21 02:49:16 +0200 CEST    NormalPod workload.spec.initContainers{wait-for-pg}           Pulled  Container image "postgres:15-bullseye" already present on machine     kubelet
    logger.go:42: 02:54:28 | autoscaling.virtio-mem | 2024-08-21 02:49:16 +0200 CEST    NormalPod workload.spec.initContainers{wait-for-pg}           Created Created container wait-for-pgkubelet
    logger.go:42: 02:54:28 | autoscaling.virtio-mem | 2024-08-21 02:49:16 +0200 CEST    NormalPod workload.spec.initContainers{wait-for-pg}           Started Started container wait-for-pgkubelet
    logger.go:42: 02:54:28 | autoscaling.virtio-mem | 2024-08-21 02:49:16 +0200 CEST    NormalPod workload.spec.containers{pgbench}           Pulled  Container image "postgres:15-bullseye" already present on machine     kubelet
    logger.go:42: 02:54:28 | autoscaling.virtio-mem | 2024-08-21 02:49:16 +0200 CEST    NormalPod workload.spec.containers{pgbench}           Created Created container pgbench       kubelet
    logger.go:42: 02:54:28 | autoscaling.virtio-mem | 2024-08-21 02:49:16 +0200 CEST    NormalPod workload.spec.containers{pgbench}           Started Started container pgbench       kubelet
    logger.go:42: 02:54:28 | autoscaling.virtio-mem | 2024-08-21 02:49:24 +0200 CEST    NormalVirtualMachine.vm.neon.tech example             ScaleUp Runner pod cgroups was updated on VM example  virtualmachine-controller
    logger.go:42: 02:54:28 | autoscaling.virtio-mem | 2024-08-21 02:49:24 +0200 CEST    NormalVirtualMachine.vm.neon.tech example             ScaleUp Set virtio-mem size for 4Gi total memory      virtualmachine-controller
    logger.go:42: 02:54:28 | autoscaling.virtio-mem | 2024-08-21 02:49:24 +0200 CEST    NormalVirtualMachine.vm.neon.tech example             CpuInfo VirtualMachine example uses 1 cpu cores       virtualmachine-controller
    logger.go:42: 02:54:28 | autoscaling.virtio-mem | 2024-08-21 02:49:24 +0200 CEST    NormalVirtualMachine.vm.neon.tech example             MemoryInfo      VirtualMachine example uses 2Gi memory        virtualmachine-controller
    logger.go:42: 02:54:28 | autoscaling.virtio-mem | 2024-08-21 02:49:25 +0200 CEST    NormalVirtualMachine.vm.neon.tech example             MemoryInfo      VirtualMachine example uses 4Gi memory        virtualmachine-controller
    logger.go:42: 02:54:28 | autoscaling.virtio-mem | 2024-08-21 02:49:26 +0200 CEST    NormalPod workload.spec.containers{pgbench}           Killing Stopping container pgbench      kubelet
    logger.go:42: 02:54:28 | autoscaling.virtio-mem | 2024-08-21 02:50:55 +0200 CEST    NormalVirtualMachine.vm.neon.tech example             ScaleDown       Set virtio-mem size for 3Gi total memory      virtualmachine-controller
    logger.go:42: 02:54:28 | autoscaling.virtio-mem | 2024-08-21 02:50:55 +0200 CEST    NormalVirtualMachine.vm.neon.tech example             CpuInfo VirtualMachine example uses 0.75 cpu cores    virtualmachine-controller
    logger.go:42: 02:54:28 | autoscaling.virtio-mem | 2024-08-21 02:50:55 +0200 CEST    NormalVirtualMachine.vm.neon.tech example             MemoryInfo      VirtualMachine example uses 3840Mi memory     virtualmachine-controller
    logger.go:42: 02:54:28 | autoscaling.virtio-mem | 2024-08-21 02:50:55 +0200 CEST    NormalVirtualMachine.vm.neon.tech example             MemoryInfo      VirtualMachine example uses 3712Mi memory     virtualmachine-controller
    logger.go:42: 02:54:28 | autoscaling.virtio-mem | 2024-08-21 02:50:55 +0200 CEST    NormalVirtualMachine.vm.neon.tech example             MemoryInfo      VirtualMachine example uses 3456Mi memory     virtualmachine-controller
    logger.go:42: 02:54:28 | autoscaling.virtio-mem | 2024-08-21 02:50:55 +0200 CEST    NormalVirtualMachine.vm.neon.tech example             MemoryInfo      VirtualMachine example uses 3328Mi memory     virtualmachine-controller
    logger.go:42: 02:54:28 | autoscaling.virtio-mem | 2024-08-21 02:50:55 +0200 CEST    NormalVirtualMachine.vm.neon.tech example             MemoryInfo      VirtualMachine example uses 3200Mi memory     virtualmachine-controller
    logger.go:42: 02:54:28 | autoscaling.virtio-mem | 2024-08-21 02:50:55 +0200 CEST    NormalVirtualMachine.vm.neon.tech example             MemoryInfo      VirtualMachine example uses 3Gi memory        virtualmachine-controller
    logger.go:42: 02:54:29 | autoscaling.virtio-mem | Deleting namespace: kuttl-test-helpful-dolphin
=== NAME  kuttl
    harness.go:408: run tests finished
    harness.go:516: cleaning up
    harness.go:573: removing temp folder: ""
--- FAIL: kuttl (327.52s)
    --- FAIL: kuttl/harness (0.00s)
        --- FAIL: kuttl/harness/autoscaling.virtio-mem (327.48s)
FAIL

stradig · 2024-10-08T14:49:32Z

This might be fixed by now. Arthur will reproduce and see if it is fixed.

petuhovskiy · 2024-10-09T08:06:17Z

https://gist.github.com/petuhovskiy/226522b34bd85a3a8d2d8ee88fa43dbd

Tried to reproduce, it failed only once in 360 runs. Let's consider it fixed for now.

Omrigan added the t/bug Issue Type: Bug label Jul 11, 2024

Omrigan assigned sharnoff Jul 11, 2024

sharnoff added a/ci Area: related to continuous integration a/test/flaky Area: related to flaky tests and removed t/bug Issue Type: Bug labels Jul 11, 2024

Omrigan added the m/good_first_issue Moment: when doing your first Neon contributions label Aug 5, 2024

sharnoff removed their assignment Aug 6, 2024

petuhovskiy self-assigned this Aug 6, 2024

petuhovskiy closed this as completed Oct 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug: autoscaling.virtio-mem is flaky #1006

Bug: autoscaling.virtio-mem is flaky #1006

Omrigan commented Jul 11, 2024

sharnoff commented Jul 11, 2024

Omrigan commented Jul 11, 2024

stradig commented Jul 15, 2024

sharnoff commented Jul 16, 2024

sharnoff commented Aug 7, 2024

petuhovskiy commented Aug 21, 2024

stradig commented Oct 8, 2024

petuhovskiy commented Oct 9, 2024

Bug: autoscaling.virtio-mem is flaky #1006

Bug: autoscaling.virtio-mem is flaky #1006

Comments

Omrigan commented Jul 11, 2024

sharnoff commented Jul 11, 2024

Omrigan commented Jul 11, 2024

stradig commented Jul 15, 2024

sharnoff commented Jul 16, 2024

sharnoff commented Aug 7, 2024

petuhovskiy commented Aug 21, 2024

stradig commented Oct 8, 2024

petuhovskiy commented Oct 9, 2024