Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug: autoscaling.virtio-mem is flaky #1006

Closed
Omrigan opened this issue Jul 11, 2024 · 8 comments
Closed

Bug: autoscaling.virtio-mem is flaky #1006

Omrigan opened this issue Jul 11, 2024 · 8 comments
Assignees
Labels
a/ci Area: related to continuous integration a/test/flaky Area: related to flaky tests m/good_first_issue Moment: when doing your first Neon contributions

Comments

@Omrigan
Copy link
Contributor

Omrigan commented Jul 11, 2024

Fails roughly 50% of the time.

@Omrigan Omrigan added the t/bug Issue Type: Bug label Jul 11, 2024
@sharnoff
Copy link
Member

@Omrigan how long has it been flaky for? Can you link some example runs?

@sharnoff sharnoff added a/ci Area: related to continuous integration a/test/flaky Area: related to flaky tests and removed t/bug Issue Type: Bug labels Jul 11, 2024
@Omrigan
Copy link
Contributor Author

Omrigan commented Jul 11, 2024

@Omrigan how long has it been flaky for?

I'd assume from the introduction.

Can you link some example runs?

Your PR: https://github.com/neondatabase/autoscaling/actions/runs/9878750381/job/27283615758#step:14:1034
My PR: https://github.com/neondatabase/autoscaling/actions/runs/9889976645/job/27317397731#step:14:948

I had 3 or 4 such cases, I think.

However, maybe it is not that particular test, here is the other test failing: https://github.com/neondatabase/autoscaling/actions/runs/9890395470/job/27319200362#step:14:987

@stradig
Copy link
Contributor

stradig commented Jul 15, 2024

We should check if this is only timeouts or if we suspect an actual problem. Until then marking as P1.

@sharnoff
Copy link
Member

Haven't yet had a chance to look at it. However - @Omrigan if you're truly observing it fail 50% of the time, it may be due to your PR. The occurrence you flagged was the first time I'd seen it fail on one of my PRs.

@Omrigan Omrigan added the m/good_first_issue Moment: when doing your first Neon contributions label Aug 5, 2024
@sharnoff sharnoff removed their assignment Aug 6, 2024
@petuhovskiy petuhovskiy self-assigned this Aug 6, 2024
@sharnoff
Copy link
Member

sharnoff commented Aug 7, 2024

@petuhovskiy
Copy link
Member

Reproduces locally in about ~2 hours. From the logs it seems like the issue is that VM uses too much memory sometimes and it prevents downscaling. I think the next step for debugging would be to print memory usage in logs periodically. For now I'll put this issue in selected, but want to get back to it later.

My logs
2024/08/21 02:49:07 running without a 'kuttl-test.yaml' configuration
2024/08/21 02:49:07 kutt-test config testdirs is overridden with args: [ tests/e2e ]
=== RUN   kuttl
    harness.go:465: starting setup
    harness.go:255: running tests using configured kubeconfig.
    harness.go:278: Successful connection to cluster at: https://127.0.0.1:33073
    harness.go:363: running tests
    harness.go:75: going to run test suite with timeout of 30 seconds for each step
    harness.go:375: testsuite: tests/e2e has 13 tests
=== RUN   kuttl/harness
=== RUN   kuttl/harness/autoscaling.virtio-mem
=== PAUSE kuttl/harness/autoscaling.virtio-mem
=== CONT  kuttl/harness/autoscaling.virtio-mem
    logger.go:42: 02:49:07 | autoscaling.virtio-mem | Creating namespace: kuttl-test-helpful-dolphin
    logger.go:42: 02:49:07 | autoscaling.virtio-mem/0-create-vm | starting test step 0-create-vm
    logger.go:42: 02:49:07 | autoscaling.virtio-mem/0-create-vm | Service:kuttl-test-helpful-dolphin/example created
    logger.go:42: 02:49:07 | autoscaling.virtio-mem/0-create-vm | VirtualMachine:kuttl-test-helpful-dolphin/example created
    logger.go:42: 02:49:07 | autoscaling.virtio-mem/0-create-vm | ConfigMap:kuttl-test-helpful-dolphin/example-config created
    logger.go:42: 02:49:15 | autoscaling.virtio-mem/0-create-vm | test step completed 0-create-vm
    logger.go:42: 02:49:15 | autoscaling.virtio-mem/1-upscale | starting test step 1-upscale
    logger.go:42: 02:49:15 | autoscaling.virtio-mem/1-upscale | Pod:kuttl-test-helpful-dolphin/workload created
    logger.go:42: 02:49:15 | autoscaling.virtio-mem/1-upscale | ConfigMap:kuttl-test-helpful-dolphin/query created
    logger.go:42: 02:49:26 | autoscaling.virtio-mem/1-upscale | test step completed 1-upscale
    logger.go:42: 02:49:26 | autoscaling.virtio-mem/2-downscale | starting test step 2-downscale
    logger.go:42: 02:54:28 | autoscaling.virtio-mem/2-downscale | test step failed 2-downscale
    case.go:366: failed in step 2-downscale
    case.go:368: --- VirtualMachine:kuttl-test-helpful-dolphin/example
        +++ VirtualMachine:kuttl-test-helpful-dolphin/example
        @@ -1,14 +1,37 @@
         apiVersion: vm.neon.tech/v1
         kind: VirtualMachine
         metadata:
        +  annotations:
        +    autoscaling.neon.tech/bounds: '{ "min": { "cpu": "250m", "mem": "1Gi" }, "max":
        +      {"cpu": 1, "mem": "4Gi" } }'
        +  finalizers:
        +  - vm.neon.tech/finalizer
        +  labels:
        +    autoscaling.neon.tech/enabled: "true"
           name: example
           namespace: kuttl-test-helpful-dolphin
        +spec: '[... elided field over 10 lines long ...]'
         status:
           conditions:
        -  - status: "True"
        +  - lastTransitionTime: "2024-08-21T00:49:10Z"
        +    message: Pod (example-jxxtr) for VirtualMachine (example) created successfully
        +    reason: Reconciling
        +    status: "True"
             type: Available
        -  cpus: 250m
        -  memorySize: 1Gi
        +  cpus: 750m
        +  currentRevision:
        +    revision:
        +      flags: 1
        +      value: 21
        +    updatedAt: "2024-08-21T00:54:20Z"
        +  extraNetIP: 10.100.128.1
        +  extraNetMask: 255.255.0.0
        +  memoryProvider: VirtioMem
        +  memorySize: 3Gi
        +  node: neonvm-arthur-worker
           phase: Running
        +  podIP: 10.244.2.134
        +  podName: example-jxxtr
           restartCount: 0
        +  sshSecretName: ssh-neonvm-example


    case.go:368: resource VirtualMachine:kuttl-test-helpful-dolphin/example: .status.memorySize: value mismatch, expected: 1Gi != actual: 3Gi
    logger.go:42: 02:54:28 | autoscaling.virtio-mem | autoscaling.virtio-mem events from ns kuttl-test-helpful-dolphin:
    logger.go:42: 02:54:28 | autoscaling.virtio-mem | 2024-08-21 02:49:07 +0200 CEST    NormalPod example-jxxtr       Binding Scheduled       Successfully assigned kuttl-test-helpful-dolphin/example-jxxtr to neonvm-arthur-worker        autoscale-scheduler
    logger.go:42: 02:54:28 | autoscaling.virtio-mem | 2024-08-21 02:49:07 +0200 CEST    NormalPod example-jxxtr               AddedInterface  Add eth0 [10.244.2.134/24] from kindnet
    logger.go:42: 02:54:28 | autoscaling.virtio-mem | 2024-08-21 02:49:07 +0200 CEST    NormalVirtualMachine.vm.neon.tech example             OverlayNet      Acquired IP 10.100.128.1/16 for overlay network interface     virtualmachine-controller
    logger.go:42: 02:54:28 | autoscaling.virtio-mem | 2024-08-21 02:49:07 +0200 CEST    NormalVirtualMachine.vm.neon.tech example             Created VirtualMachine example created, Pod example-jxxtr, SSH Secret ssh-neonvm-example      virtualmachine-controller
    logger.go:42: 02:54:28 | autoscaling.virtio-mem | 2024-08-21 02:49:08 +0200 CEST    NormalPod example-jxxtr               AddedInterface  Add net1 [] from neonvm-system/neonvm-overlay-for-vms
    logger.go:42: 02:54:28 | autoscaling.virtio-mem | 2024-08-21 02:49:08 +0200 CEST    NormalPod example-jxxtr.spec.initContainers{init}             Pulled  Container image "vm-postgres:15-bullseye" already present on machine  kubelet
    logger.go:42: 02:54:28 | autoscaling.virtio-mem | 2024-08-21 02:49:08 +0200 CEST    NormalPod example-jxxtr.spec.initContainers{init}             Created Created container init  kubelet
    logger.go:42: 02:54:28 | autoscaling.virtio-mem | 2024-08-21 02:49:08 +0200 CEST    NormalPod example-jxxtr.spec.initContainers{init}             Started Started container init  kubelet
    logger.go:42: 02:54:28 | autoscaling.virtio-mem | 2024-08-21 02:49:09 +0200 CEST    NormalPod example-jxxtr.spec.containers{neonvm-runner}                Pulled  Container image "runner:dev" already present on machine       kubelet
    logger.go:42: 02:54:28 | autoscaling.virtio-mem | 2024-08-21 02:49:09 +0200 CEST    NormalPod example-jxxtr.spec.containers{neonvm-runner}                Created Created container neonvm-runner       kubelet
    logger.go:42: 02:54:28 | autoscaling.virtio-mem | 2024-08-21 02:49:09 +0200 CEST    NormalPod example-jxxtr.spec.containers{neonvm-runner}                Started Started container neonvm-runner       kubelet
    logger.go:42: 02:54:28 | autoscaling.virtio-mem | 2024-08-21 02:49:09 +0200 CEST    NormalPod example-jxxtr.spec.containers{neonvm-container-mgr}         Pulled  Container image "runner:dev" already present on machine       kubelet
    logger.go:42: 02:54:28 | autoscaling.virtio-mem | 2024-08-21 02:49:09 +0200 CEST    NormalPod example-jxxtr.spec.containers{neonvm-container-mgr}         Created Created container neonvm-container-mgr        kubelet
    logger.go:42: 02:54:28 | autoscaling.virtio-mem | 2024-08-21 02:49:10 +0200 CEST    NormalPod example-jxxtr.spec.containers{neonvm-container-mgr}         Started Started container neonvm-container-mgr        kubelet
    logger.go:42: 02:54:28 | autoscaling.virtio-mem | 2024-08-21 02:49:10 +0200 CEST    NormalVirtualMachine.vm.neon.tech example             CpuInfo VirtualMachine example uses 0.5 cpu cores     virtualmachine-controller
    logger.go:42: 02:54:28 | autoscaling.virtio-mem | 2024-08-21 02:49:10 +0200 CEST    NormalVirtualMachine.vm.neon.tech example             MemoryInfo      VirtualMachine example uses 1Gi memory        virtualmachine-controller
    logger.go:42: 02:54:28 | autoscaling.virtio-mem | 2024-08-21 02:49:15 +0200 CEST    NormalVirtualMachine.vm.neon.tech example             ScaleDown       Runner pod cgroups was updated on VM example  virtualmachine-controller
    logger.go:42: 02:54:28 | autoscaling.virtio-mem | 2024-08-21 02:49:15 +0200 CEST    NormalVirtualMachine.vm.neon.tech example             CpuInfo VirtualMachine example uses 0.25 cpu cores    virtualmachine-controller
    logger.go:42: 02:54:28 | autoscaling.virtio-mem | 2024-08-21 02:49:15 +0200 CEST    NormalPod workload            Scheduled       Successfully assigned kuttl-test-helpful-dolphin/workload to neonvm-arthur-worker     default-scheduler
    logger.go:42: 02:54:28 | autoscaling.virtio-mem | 2024-08-21 02:49:15 +0200 CEST    NormalPod workload            AddedInterface  Add eth0 [10.244.2.135/24] from kindnet
    logger.go:42: 02:54:28 | autoscaling.virtio-mem | 2024-08-21 02:49:16 +0200 CEST    NormalPod workload.spec.initContainers{wait-for-pg}           Pulled  Container image "postgres:15-bullseye" already present on machine     kubelet
    logger.go:42: 02:54:28 | autoscaling.virtio-mem | 2024-08-21 02:49:16 +0200 CEST    NormalPod workload.spec.initContainers{wait-for-pg}           Created Created container wait-for-pgkubelet
    logger.go:42: 02:54:28 | autoscaling.virtio-mem | 2024-08-21 02:49:16 +0200 CEST    NormalPod workload.spec.initContainers{wait-for-pg}           Started Started container wait-for-pgkubelet
    logger.go:42: 02:54:28 | autoscaling.virtio-mem | 2024-08-21 02:49:16 +0200 CEST    NormalPod workload.spec.containers{pgbench}           Pulled  Container image "postgres:15-bullseye" already present on machine     kubelet
    logger.go:42: 02:54:28 | autoscaling.virtio-mem | 2024-08-21 02:49:16 +0200 CEST    NormalPod workload.spec.containers{pgbench}           Created Created container pgbench       kubelet
    logger.go:42: 02:54:28 | autoscaling.virtio-mem | 2024-08-21 02:49:16 +0200 CEST    NormalPod workload.spec.containers{pgbench}           Started Started container pgbench       kubelet
    logger.go:42: 02:54:28 | autoscaling.virtio-mem | 2024-08-21 02:49:24 +0200 CEST    NormalVirtualMachine.vm.neon.tech example             ScaleUp Runner pod cgroups was updated on VM example  virtualmachine-controller
    logger.go:42: 02:54:28 | autoscaling.virtio-mem | 2024-08-21 02:49:24 +0200 CEST    NormalVirtualMachine.vm.neon.tech example             ScaleUp Set virtio-mem size for 4Gi total memory      virtualmachine-controller
    logger.go:42: 02:54:28 | autoscaling.virtio-mem | 2024-08-21 02:49:24 +0200 CEST    NormalVirtualMachine.vm.neon.tech example             CpuInfo VirtualMachine example uses 1 cpu cores       virtualmachine-controller
    logger.go:42: 02:54:28 | autoscaling.virtio-mem | 2024-08-21 02:49:24 +0200 CEST    NormalVirtualMachine.vm.neon.tech example             MemoryInfo      VirtualMachine example uses 2Gi memory        virtualmachine-controller
    logger.go:42: 02:54:28 | autoscaling.virtio-mem | 2024-08-21 02:49:25 +0200 CEST    NormalVirtualMachine.vm.neon.tech example             MemoryInfo      VirtualMachine example uses 4Gi memory        virtualmachine-controller
    logger.go:42: 02:54:28 | autoscaling.virtio-mem | 2024-08-21 02:49:26 +0200 CEST    NormalPod workload.spec.containers{pgbench}           Killing Stopping container pgbench      kubelet
    logger.go:42: 02:54:28 | autoscaling.virtio-mem | 2024-08-21 02:50:55 +0200 CEST    NormalVirtualMachine.vm.neon.tech example             ScaleDown       Set virtio-mem size for 3Gi total memory      virtualmachine-controller
    logger.go:42: 02:54:28 | autoscaling.virtio-mem | 2024-08-21 02:50:55 +0200 CEST    NormalVirtualMachine.vm.neon.tech example             CpuInfo VirtualMachine example uses 0.75 cpu cores    virtualmachine-controller
    logger.go:42: 02:54:28 | autoscaling.virtio-mem | 2024-08-21 02:50:55 +0200 CEST    NormalVirtualMachine.vm.neon.tech example             MemoryInfo      VirtualMachine example uses 3840Mi memory     virtualmachine-controller
    logger.go:42: 02:54:28 | autoscaling.virtio-mem | 2024-08-21 02:50:55 +0200 CEST    NormalVirtualMachine.vm.neon.tech example             MemoryInfo      VirtualMachine example uses 3712Mi memory     virtualmachine-controller
    logger.go:42: 02:54:28 | autoscaling.virtio-mem | 2024-08-21 02:50:55 +0200 CEST    NormalVirtualMachine.vm.neon.tech example             MemoryInfo      VirtualMachine example uses 3456Mi memory     virtualmachine-controller
    logger.go:42: 02:54:28 | autoscaling.virtio-mem | 2024-08-21 02:50:55 +0200 CEST    NormalVirtualMachine.vm.neon.tech example             MemoryInfo      VirtualMachine example uses 3328Mi memory     virtualmachine-controller
    logger.go:42: 02:54:28 | autoscaling.virtio-mem | 2024-08-21 02:50:55 +0200 CEST    NormalVirtualMachine.vm.neon.tech example             MemoryInfo      VirtualMachine example uses 3200Mi memory     virtualmachine-controller
    logger.go:42: 02:54:28 | autoscaling.virtio-mem | 2024-08-21 02:50:55 +0200 CEST    NormalVirtualMachine.vm.neon.tech example             MemoryInfo      VirtualMachine example uses 3Gi memory        virtualmachine-controller
    logger.go:42: 02:54:29 | autoscaling.virtio-mem | Deleting namespace: kuttl-test-helpful-dolphin
=== NAME  kuttl
    harness.go:408: run tests finished
    harness.go:516: cleaning up
    harness.go:573: removing temp folder: ""
--- FAIL: kuttl (327.52s)
    --- FAIL: kuttl/harness (0.00s)
        --- FAIL: kuttl/harness/autoscaling.virtio-mem (327.48s)
FAIL

@stradig
Copy link
Contributor

stradig commented Oct 8, 2024

This might be fixed by now. Arthur will reproduce and see if it is fixed.

@petuhovskiy
Copy link
Member

https://gist.github.com/petuhovskiy/226522b34bd85a3a8d2d8ee88fa43dbd

Tried to reproduce, it failed only once in 360 runs. Let's consider it fixed for now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
a/ci Area: related to continuous integration a/test/flaky Area: related to flaky tests m/good_first_issue Moment: when doing your first Neon contributions
Projects
None yet
Development

No branches or pull requests

4 participants