Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

服务的 pod 数较多时,ehpa 报错 #907

Open
yangyegang opened this issue Jul 17, 2024 · 2 comments
Open

服务的 pod 数较多时,ehpa 报错 #907

yangyegang opened this issue Jul 17, 2024 · 2 comments
Labels
kind/bug Something isn't working

Comments

@yangyegang
Copy link

当服务的 pod 较多时(我的服务时 150个 pod),get ehpa 命令看到如下错误,pod 数少时没问题:

the HPA was unable to compute the replica count: unable to get metric
tensorflow_serving_latency_999: unable to fetch metrics from custom metrics
API: Internal error occurred: unable to fetch metrics

prometheus-adapter 里有如下日志:
E0717 06:07:04.633224 1 provider.go:150] unable to fetch metrics from prometheus: bad_response: unknown response code 414
I0717 06:07:04.633771 1 httplog.go:132] "HTTP" verb="GET" URI="/apis/custom.metrics.k8s.io/v1beta1/namespaces/qke-generic-jarvis-cupid-algo/pods/%2A/tensorflow_serving_latency_999?labelSelector=name%3Djarvis-ads-algo-cpx-e2-episode-pcvr-26035-qpaas-hslf" latency="339.610784ms" userAgent="kube-controller-manager/v1.24.15 (linux/amd64) kubernetes/887f5c3/system:serviceaccount:kube-system:horizontal-pod-autoscaler" audit-ID="08f823f8-afd8-43e0-b44c-fdc09a64b612" srcIP="10.188.121.103:58614" resp=500 statusStack=<

    goroutine 1930959 [running]:
    k8s.io/apiserver/pkg/server/httplog.(*respLogger).recordStatus(0xc001253a20, 0xc06407afc0?)
            /go/pkg/mod/k8s.io/[email protected]/pkg/server/httplog/httplog.go:320 +0x105
    k8s.io/apiserver/pkg/server/httplog.(*respLogger).WriteHeader(0xc001253a20, 0xc0885aedc0?)
            /go/pkg/mod/k8s.io/[email protected]/pkg/server/httplog/httplog.go:300 +0x25
    k8s.io/apiserver/pkg/server/filters.(*baseTimeoutWriter).WriteHeader(0xc06407aff0, 0x9100000000000010?)
            /go/pkg/mod/k8s.io/[email protected]/pkg/server/filters/timeout.go:239 +0x1c8
    k8s.io/apiserver/pkg/endpoints/metrics.(*ResponseWriterDelegator).WriteHeader(0x1f559e0?, 0xc0885aedc0?)
            /go/pkg/mod/k8s.io/[email protected]/pkg/endpoints/metrics/metrics.go:737 +0x29
    k8s.io/apiserver/pkg/endpoints/handlers/responsewriters.(*deferredResponseWriter).Write(0xc0b112a120, {0xc00f028000, 0x99, 0x9f})
            /go/pkg/mod/k8s.io/[email protected]/pkg/endpoints/handlers/responsewriters/writers.go:243 +0x642
    k8s.io/apimachinery/pkg/runtime/serializer/protobuf.(*Serializer).doEncode(0xc000a19a00, {0x27335e8?, 0xc0015ac320?}, {0x272b1e0, 0xc0b112a120}, {0x272adc0?, 0x392f298?})
            /go/pkg/mod/k8s.io/[email protected]/pkg/runtime/serializer/protobuf/protobuf.go:228 +0x5b9
    k8s.io/apimachinery/pkg/runtime/serializer/protobuf.(*Serializer).encode(0xc000a19a00, {0x27335e8, 0xc0015ac320}, {0x272b1e0, 0xc0b112a120}, {0x272adc0?, 0x392f298?})
            /go/pkg/mod/k8s.io/[email protected]/pkg/runtime/serializer/protobuf/protobuf.go:181 +0x13d
    k8s.io/apimachinery/pkg/runtime/serializer/protobuf.(*Serializer).Encode(0x0?, {0x27335e8?, 0xc0015ac320?}, {0x272b1e0?, 0xc0b112a120?})
            /go/pkg/mod/k8s.io/[email protected]/pkg/runtime/serializer/protobuf/protobuf.go:174 +0x3b
    k8s.io/apimachinery/pkg/runtime/serializer/versioning.(*codec).doEncode(0xc0015ac3c0, {0x27335e8, 0xc0015ac320}, {0x272b1e0, 0xc0b112a120}, {0x0?, 0x0?})
            /go/pkg/mod/k8s.io/[email protected]/pkg/runtime/serializer/versioning/versioning.go:268 +0xc05
    k8s.io/apimachinery/pkg/runtime/serializer/versioning.(*codec).encode(0xc0015ac3c0, {0x27335e8, 0xc0015ac320}, {0x272b1e0, 0xc0b112a120}, {0x0?, 0x0?})
            /go/pkg/mod/k8s.io/[email protected]/pkg/runtime/serializer/versioning/versioning.go:214 +0x167
    k8s.io/apimachinery/pkg/runtime/serializer/versioning.(*codec).Encode(0x274ae08?, {0x27335e8?, 0xc0015ac320?}, {0x272b1e0?, 0xc0b112a120?})
@yangyegang yangyegang added the kind/bug Something isn't working label Jul 17, 2024
@yangyegang
Copy link
Author

ehpa 中配置的 prometheus query 如下:
annotations:
metric-query.autoscaling.crane.io/services.tensorflow_serving_latency_999: avg(tensorflow_serving_latency_999{namespace="namespace",pod~="abcd."})
但是 prometheus-adapter 中有如下的 api 报错 uri 太长:
GET http://..../api/v1/query?query=sum%28tensorflow_serving_latency_999%7Bnamespace%3D%22qke-generic-jarvis-cupid-algo%22%2Cpod%3D~%22jarvis-ads-algo-cpx-e2-episode-pcvr-26035-qpaas-hslf-6d56d225cd.......
该 uri 是把负载的所有pod都列入了,导致 URI 过长。但 ehpa 中的 query 是 avg(tensorflow_serving_latency_999{namespace="namespace",pod~="abcd.
"}),怎么会有报错的 uri 请求呢?

@whitebear009
Copy link
Contributor

如果你这个 metrics 类型是 Pod 的话,prometheus-adapter 那边的查询会自动带上 pod 标签的。你可以把 prometheus-adapter 的查询方式改成 POST,这个在启动参数里可以更改

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants