Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

flaky TestInstantQuerySplittingCorrectness #10064

Open
seizethedave opened this issue Nov 30, 2024 · 8 comments
Open

flaky TestInstantQuerySplittingCorrectness #10064

seizethedave opened this issue Nov 30, 2024 · 8 comments
Assignees

Comments

@seizethedave
Copy link
Contributor

CI log.

--- FAIL: TestInstantQuerySplittingCorrectness (26.68s)
    --- FAIL: TestInstantQuerySplittingCorrectness/start=2020-01-01T03:00:00Z (0.23s)
        --- FAIL: TestInstantQuerySplittingCorrectness/start=2020-01-01T03:00:00Z/sum(rate)_for_native_histogram (0.26s)
            --- FAIL: TestInstantQuerySplittingCorrectness/start=2020-01-01T03:00:00Z/sum(rate)_for_native_histogram/*querymiddleware.PrometheusInstantQueryRequest (0.26s)
                querysharding_test.go:132: 
                    	Error Trace:	/__w/mimir/mimir/pkg/frontend/querymiddleware/querysharding_test.go:132
                    	            				/__w/mimir/mimir/pkg/frontend/querymiddleware/querysharding_test.go:103
                    	            				/__w/mimir/mimir/pkg/frontend/querymiddleware/querysharding_test.go:110
                    	            				/__w/mimir/mimir/pkg/frontend/querymiddleware/split_by_instant_interval_test.go:550
                    	Error:      	Relative error is too high: 1e-12 (expected)
                    	            	        < 0.0163576750[77](https://github.com/grafana/mimir/actions/runs/12092041667/job/33721270921#step:8:78)691952 (actual)
                    	Test:       	TestInstantQuerySplittingCorrectness/start=2020-01-01T03:00:00Z/sum(rate)_for_native_histogram/*querymiddleware.PrometheusInstantQueryRequest
                    	Messages:   	histogram value at position 0 with timestamp 157[78](https://github.com/grafana/mimir/actions/runs/12092041667/job/33721270921#step:8:79)49400000 for series []
FAIL
FAIL	github.com/grafana/mimir/pkg/frontend/querymiddleware	201.859s
@dimitarvdimitrov
Copy link
Contributor

this happened again. Failure is in the same place. @krajorama @fionaliao do any recent histogram changes come to mind?

Details

--- FAIL: TestInstantQuerySplittingCorrectness (27.38s)
    --- FAIL: TestInstantQuerySplittingCorrectness/start=2020-01-01T03:00:00Z (0.28s)
        --- FAIL: TestInstantQuerySplittingCorrectness/start=2020-01-01T03:00:00Z/sum(rate)_for_native_histogram (0.26s)
            --- FAIL: TestInstantQuerySplittingCorrectness/start=2020-01-01T03:00:00Z/sum(rate)_for_native_histogram/*querymiddleware.PrometheusInstantQueryRequest (0.26s)
                querysharding_test.go:132: 
                    	Error Trace:	/__w/mimir/mimir/pkg/frontend/querymiddleware/querysharding_test.go:132
                    	            				/__w/mimir/mimir/pkg/frontend/querymiddleware/querysharding_test.go:103
                    	            				/__w/mimir/mimir/pkg/frontend/querymiddleware/querysharding_test.go:110
                    	            				/__w/mimir/mimir/pkg/frontend/querymiddleware/split_by_instant_interval_test.go:550
                    	Error:      	Relative error is too high: 1e-12 (expected)
                    	            	        < 0.016357675077691952 (actual)
                    	Test:       	TestInstantQuerySplittingCorrectness/start=2020-01-01T03:00:00Z/sum(rate)_for_native_histogram/*querymiddleware.PrometheusInstantQueryRequest
                    	Messages:   	histogram value at position 0 with timestamp 1577849400000 for series []
FAIL

@colega
Copy link
Contributor

colega commented Dec 9, 2024

It's failing very frequently.

@chencs
Copy link
Contributor

chencs commented Dec 9, 2024

This is making doing 2.15 release PRs pretty painful.

--- FAIL: TestInstantQuerySplittingCorrectness (27.99s)
    --- FAIL: TestInstantQuerySplittingCorrectness/start=2020-01-01T03:00:00Z (0.25s)
        --- FAIL: TestInstantQuerySplittingCorrectness/start=2020-01-01T03:00:00Z/sum(rate)_grouping_'by'_for_native_histogram (0.40s)
            --- FAIL: TestInstantQuerySplittingCorrectness/start=2020-01-01T03:00:00Z/sum(rate)_grouping_'by'_for_native_histogram/*querymiddleware.PrometheusInstantQueryRequest (0.40s)
                querysharding_test.go:132: 
                    	Error Trace:	/__w/mimir/mimir/pkg/frontend/querymiddleware/querysharding_test.go:132
                    	            				/__w/mimir/mimir/pkg/frontend/querymiddleware/querysharding_test.go:103
                    	            				/__w/mimir/mimir/pkg/frontend/querymiddleware/querysharding_test.go:110
                    	            				/__w/mimir/mimir/pkg/frontend/querymiddleware/split_by_instant_interval_test.go:550
                    	Error:      	Relative error is too high: 1e-12 (expected)
                    	            	        < 0.13116335669173304 (actual)
                    	Test:       	TestInstantQuerySplittingCorrectness/start=2020-01-01T03:00:00Z/sum(rate)_grouping_'by'_for_native_histogram/*querymiddleware.PrometheusInstantQueryRequest
                    	Messages:   	histogram value at position 0 with timestamp 1577849400000 for series [{group_1 2}]

@krajorama krajorama self-assigned this Dec 10, 2024
@krajorama
Copy link
Contributor

krajorama commented Dec 10, 2024

Got pinged in slack, but actually I see a different case failing locally more often:

Error:      	Relative error is too high: 1e-12 (expected)
                    	            	        < 0.13827172919340028 (actual)
                    	Test:       	TestInstantQuerySplittingCorrectness/start=2020-01-01T03:00:00Z/sum(rate)_grouping_'by'_for_native_histogram/*querymiddleware.PrometheusInstantQueryRequest
                    	Messages:   	histogram value at position 0 with timestamp 1577849400000 for series [{group_1 2}]

@krajorama
Copy link
Contributor

Seems like it started to fail after #7219 was merged back in January :(

@krajorama
Copy link
Contributor

krajorama commented Dec 10, 2024

This is duplicate of #7808 . And the workaround in #7504 still works. Maybe we should reconsider applying the workaround as I'm not sure when we'll get to improving the engine itself where the problem originates.

Also it's weird that the race detector doesn't catch it.

@dimitarvdimitrov
Copy link
Contributor

And the workaround in #7504 still works.

can we merge that workaround and add a linter rule so that we don't use *promql.StorageSeries in tests?

@zenador
Copy link
Contributor

zenador commented Dec 18, 2024

Consecutive failures ):

https://github.com/grafana/mimir/actions/runs/12400788936/job/34618764750?pr=10277

--- FAIL: TestInstantQuerySplittingCorrectness (27.51s)
    --- FAIL: TestInstantQuerySplittingCorrectness/start=2020-01-01T03:00:00Z (0.23s)
        --- FAIL: TestInstantQuerySplittingCorrectness/start=2020-01-01T03:00:00Z/sum(rate)_grouping_'by'_for_native_histogram (0.30s)
            --- FAIL: TestInstantQuerySplittingCorrectness/start=2020-01-01T03:00:00Z/sum(rate)_grouping_'by'_for_native_histogram/*querymiddleware.PrometheusInstantQueryRequest (0.30s)
                querysharding_test.go:132: 
                    	Error Trace:	/__w/mimir/mimir/pkg/frontend/querymiddleware/querysharding_test.go:132
                    	            				/__w/mimir/mimir/pkg/frontend/querymiddleware/querysharding_test.go:103
                    	            				/__w/mimir/mimir/pkg/frontend/querymiddleware/querysharding_test.go:110
                    	            				/__w/mimir/mimir/pkg/frontend/querymiddleware/split_by_instant_interval_test.go:550
                    	Error:      	Relative error is too high: 1e-12 (expected)
                    	            	        < 0.13116335669173304 (actual)
                    	Test:       	TestInstantQuerySplittingCorrectness/start=2020-01-01T03:00:00Z/sum(rate)_grouping_'by'_for_native_histogram/*querymiddleware.PrometheusInstantQueryRequest
                    	Messages:   	histogram value at position 0 with timestamp 1577849400000 for series [{group_1 2}]

https://github.com/grafana/mimir/actions/runs/12400788936/job/34619827656?pr=10277

 --- FAIL: TestInstantQuerySplittingCorrectness (27.82s)
    --- FAIL: TestInstantQuerySplittingCorrectness/start=2020-01-01T03:00:00Z (0.25s)
        --- FAIL: TestInstantQuerySplittingCorrectness/start=2020-01-01T03:00:00Z/sum(rate)_for_native_histogram (0.28s)
            --- FAIL: TestInstantQuerySplittingCorrectness/start=2020-01-01T03:00:00Z/sum(rate)_for_native_histogram/*querymiddleware.PrometheusInstantQueryRequest (0.28s)
                querysharding_test.go:132: 
                    	Error Trace:	/__w/mimir/mimir/pkg/frontend/querymiddleware/querysharding_test.go:132
                    	            				/__w/mimir/mimir/pkg/frontend/querymiddleware/querysharding_test.go:103
                    	            				/__w/mimir/mimir/pkg/frontend/querymiddleware/querysharding_test.go:110
                    	            				/__w/mimir/mimir/pkg/frontend/querymiddleware/split_by_instant_interval_test.go:550
                    	Error:      	Relative error is too high: 1e-12 (expected)
                    	            	        < 0.016357675077691952 (actual)
                    	Test:       	TestInstantQuerySplittingCorrectness/start=2020-01-01T03:00:00Z/sum(rate)_for_native_histogram/*querymiddleware.PrometheusInstantQueryRequest
                    	Messages:   	histogram value at position 0 with timestamp 1577849400000 for series []

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants