Skip to content

[fix](cloud) CloudUpgradeMgr inspect and abort failed conflict txns while waiting#60830

Open
deardeng wants to merge 3 commits intoapache:masterfrom
deardeng:cloud-upgrade-conflict-txn-abort-tests
Open

[fix](cloud) CloudUpgradeMgr inspect and abort failed conflict txns while waiting#60830
deardeng wants to merge 3 commits intoapache:masterfrom
deardeng:cloud-upgrade-conflict-txn-abort-tests

Conversation

@deardeng
Copy link
Contributor

When CloudUpgradeMgr waits for unfinished transactions after registering

watershed txn ids, it now proactively inspects conflict transactions for

the target db/table set and logs sampled txn details for diagnosis.

If enable_abort_txn_by_checking_conflict_txn is enabled, the manager

invokes GlobalTransactionMgr.checkFailedTxns() and aborts failed txns to

reduce the chance of upgrade being blocked by stale/conflicting txns.

Abort failures are handled per txn and do not stop processing the rest.

This commit also adds tests:

  • FE UT CloudUpgradeMgrTest to verify enabled/disabled behavior and

    continue-on-abort-error semantics.

  • cloud multi_cluster docker regression case test_unfinished_txn_2pc.groovy

    to reproduce and validate long-running unfinished 2PC txn behavior.

What problem does this PR solve?

Issue Number: close #xxx

Related PR: #xxx

Problem Summary:

Release note

None

Check List (For Author)

  • Test

    • Regression test
    • Unit Test
    • Manual test (add detailed scripts or steps below)
    • No need to test or manual test. Explain why:
      • This is a refactor/code format and no logic has been changed.
      • Previous test can cover this change.
      • No code files have been changed.
      • Other reason
  • Behavior changed:

    • No.
    • Yes.
  • Does this need documentation?

    • No.
    • Yes.

Check List (For Reviewer who merge this PR)

  • Confirm the release note
  • Confirm test cases
  • Confirm document
  • Add branch pick label

…hile waiting

When CloudUpgradeMgr waits for unfinished transactions after registering

watershed txn ids, it now proactively inspects conflict transactions for

the target db/table set and logs sampled txn details for diagnosis.

If enable_abort_txn_by_checking_conflict_txn is enabled, the manager

invokes GlobalTransactionMgr.checkFailedTxns() and aborts failed txns to

reduce the chance of upgrade being blocked by stale/conflicting txns.

Abort failures are handled per txn and do not stop processing the rest.

This commit also adds tests:

- FE UT CloudUpgradeMgrTest to verify enabled/disabled behavior and

  continue-on-abort-error semantics.

- cloud multi_cluster docker regression case test_unfinished_txn_2pc.groovy

  to reproduce and validate long-running unfinished 2PC txn behavior.
@Thearas
Copy link
Contributor

Thearas commented Feb 25, 2026

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@deardeng
Copy link
Contributor Author

run buildall

@doris-robot
Copy link

TPC-H: Total hot run time: 28808 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit 292e82d8902fe59ec975129cb4ed2f9dbfc175f5, data reload: false

------ Round 1 ----------------------------------
============================================
q1	17619	4484	4282	4282
q2	q3	10645	780	517	517
q4	4676	360	251	251
q5	7561	1189	1001	1001
q6	179	177	148	148
q7	788	846	653	653
q8	9305	1442	1350	1350
q9	4761	4727	4716	4716
q10	6776	1867	1654	1654
q11	450	263	244	244
q12	699	584	465	465
q13	17793	4247	3427	3427
q14	237	240	212	212
q15	922	788	784	784
q16	725	724	682	682
q17	725	865	446	446
q18	5953	5367	5242	5242
q19	1235	983	593	593
q20	503	496	390	390
q21	4807	2050	1490	1490
q22	359	297	261	261
Total cold run time: 96718 ms
Total hot run time: 28808 ms

----- Round 2, with runtime_filter_mode=off -----
============================================
q1	4632	4554	4450	4450
q2	q3	1786	2228	1755	1755
q4	867	1203	771	771
q5	4041	4336	4409	4336
q6	190	181	146	146
q7	1765	1649	1516	1516
q8	2446	2837	2582	2582
q9	7404	7273	7486	7273
q10	2659	2862	2446	2446
q11	515	437	407	407
q12	500	617	460	460
q13	4023	4389	3555	3555
q14	309	307	281	281
q15	843	798	825	798
q16	715	755	693	693
q17	1235	1693	1313	1313
q18	7160	6955	6696	6696
q19	995	901	909	901
q20	2058	2217	2064	2064
q21	4148	3433	3495	3433
q22	483	460	380	380
Total cold run time: 48774 ms
Total hot run time: 46256 ms

@doris-robot
Copy link

TPC-DS: Total hot run time: 184147 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit 292e82d8902fe59ec975129cb4ed2f9dbfc175f5, data reload: false

query5	4876	613	510	510
query6	325	223	199	199
query7	4217	467	266	266
query8	348	250	239	239
query9	8761	2690	2696	2690
query10	537	363	326	326
query11	17053	16823	16560	16560
query12	179	122	121	121
query13	1262	440	339	339
query14	6385	3211	2919	2919
query14_1	2800	2781	2861	2781
query15	219	191	180	180
query16	990	473	454	454
query17	1069	709	614	614
query18	2534	422	342	342
query19	199	201	179	179
query20	133	125	124	124
query21	221	147	125	125
query22	4744	6226	5924	5924
query23	17606	17110	16884	16884
query23_1	17244	16847	16605	16605
query24	7116	1602	1230	1230
query24_1	1238	1228	1226	1226
query25	567	484	451	451
query26	1231	266	154	154
query27	2776	464	290	290
query28	4529	1841	1833	1833
query29	812	607	489	489
query30	310	252	210	210
query31	873	726	655	655
query32	79	78	75	75
query33	528	336	291	291
query34	927	910	562	562
query35	641	700	603	603
query36	1081	1091	993	993
query37	141	93	84	84
query38	2917	2917	2886	2886
query39	918	877	846	846
query39_1	816	808	810	808
query40	235	157	140	140
query41	67	63	63	63
query42	108	107	106	106
query43	378	403	375	375
query44	
query45	203	191	188	188
query46	877	1008	622	622
query47	2089	2127	2089	2089
query48	311	321	228	228
query49	645	476	387	387
query50	674	273	209	209
query51	4069	4139	4039	4039
query52	110	109	97	97
query53	292	339	281	281
query54	312	280	267	267
query55	92	90	83	83
query56	312	325	328	325
query57	1398	1324	1283	1283
query58	301	285	279	279
query59	2549	2666	2578	2578
query60	347	344	332	332
query61	170	160	166	160
query62	633	619	548	548
query63	311	283	275	275
query64	4916	1258	991	991
query65	
query66	1443	468	348	348
query67	16371	16466	16382	16382
query68	
query69	390	306	269	269
query70	985	926	975	926
query71	333	295	293	293
query72	2680	2629	2356	2356
query73	525	544	314	314
query74	10059	9912	9742	9742
query75	2822	2731	2453	2453
query76	2287	1021	671	671
query77	353	382	292	292
query78	11314	11639	10734	10734
query79	1121	794	587	587
query80	1567	612	532	532
query81	619	278	251	251
query82	1018	153	121	121
query83	351	251	249	249
query84	251	120	101	101
query85	1184	479	419	419
query86	410	306	296	296
query87	3117	3081	2991	2991
query88	3589	2674	2641	2641
query89	428	377	335	335
query90	1906	176	181	176
query91	166	155	128	128
query92	91	76	69	69
query93	1134	862	504	504
query94	651	312	308	308
query95	585	347	387	347
query96	647	526	226	226
query97	2444	2480	2466	2466
query98	228	218	222	218
query99	1016	982	920	920
Total cold run time: 254016 ms
Total hot run time: 184147 ms

@deardeng
Copy link
Contributor Author

run buildall

@doris-robot
Copy link

TPC-H: Total hot run time: 28991 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit ea108a52e261fe7acd8fc102cf95dc9cfd82c4df, data reload: false

------ Round 1 ----------------------------------
============================================
q1	17647	4466	4312	4312
q2	q3	10649	804	525	525
q4	4677	367	258	258
q5	7616	1202	1023	1023
q6	179	174	145	145
q7	791	866	661	661
q8	10257	1514	1360	1360
q9	5251	4825	4710	4710
q10	6825	1896	1650	1650
q11	471	268	247	247
q12	691	563	462	462
q13	17770	4240	3440	3440
q14	235	233	211	211
q15	952	819	798	798
q16	776	739	668	668
q17	706	883	447	447
q18	6013	5354	5312	5312
q19	1365	997	603	603
q20	527	500	397	397
q21	4827	1989	1491	1491
q22	376	315	271	271
Total cold run time: 98601 ms
Total hot run time: 28991 ms

----- Round 2, with runtime_filter_mode=off -----
============================================
q1	4705	4582	4651	4582
q2	q3	1944	2277	1798	1798
q4	1019	1220	796	796
q5	4078	4346	4347	4346
q6	191	186	146	146
q7	1797	1650	1554	1554
q8	2480	2726	2597	2597
q9	7547	7401	7448	7401
q10	2642	2821	2408	2408
q11	502	440	407	407
q12	564	620	444	444
q13	4027	4531	3548	3548
q14	276	285	259	259
q15	867	810	799	799
q16	710	775	756	756
q17	1215	1588	1323	1323
q18	7106	6833	6686	6686
q19	923	1022	907	907
q20	2099	2203	1979	1979
q21	4066	3500	3388	3388
q22	450	404	379	379
Total cold run time: 49208 ms
Total hot run time: 46503 ms

@doris-robot
Copy link

TPC-DS: Total hot run time: 184238 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit ea108a52e261fe7acd8fc102cf95dc9cfd82c4df, data reload: false

query5	4334	644	522	522
query6	333	225	213	213
query7	4241	476	283	283
query8	338	246	240	240
query9	8690	2785	2766	2766
query10	497	396	331	331
query11	17044	17518	17249	17249
query12	194	129	135	129
query13	1276	516	359	359
query14	7283	3301	3114	3114
query14_1	2970	3051	2958	2958
query15	214	192	193	192
query16	1021	495	478	478
query17	1339	808	653	653
query18	2608	487	381	381
query19	210	219	198	198
query20	148	162	148	148
query21	355	146	137	137
query22	4921	4677	4914	4677
query23	17336	16784	16650	16650
query23_1	16761	16710	16683	16683
query24	7136	1629	1221	1221
query24_1	1241	1244	1229	1229
query25	578	532	401	401
query26	1227	275	146	146
query27	2750	466	281	281
query28	4453	1857	1830	1830
query29	802	568	468	468
query30	313	243	207	207
query31	877	743	645	645
query32	82	72	72	72
query33	511	340	278	278
query34	929	922	550	550
query35	633	671	591	591
query36	1068	1077	994	994
query37	142	95	84	84
query38	2931	2923	2910	2910
query39	888	883	854	854
query39_1	876	828	822	822
query40	229	149	136	136
query41	66	61	59	59
query42	108	102	101	101
query43	372	380	345	345
query44	
query45	199	187	180	180
query46	883	994	604	604
query47	2084	2109	2037	2037
query48	331	340	238	238
query49	639	452	380	380
query50	683	275	214	214
query51	4081	4129	4089	4089
query52	109	111	95	95
query53	297	335	280	280
query54	289	258	266	258
query55	90	92	85	85
query56	318	354	308	308
query57	1362	1314	1250	1250
query58	295	297	263	263
query59	2552	2660	2539	2539
query60	340	331	333	331
query61	146	144	145	144
query62	615	585	565	565
query63	316	280	274	274
query64	4915	1262	984	984
query65	
query66	1414	447	350	350
query67	16539	16443	16372	16372
query68	
query69	389	316	277	277
query70	1003	957	996	957
query71	334	319	302	302
query72	2727	2617	2426	2426
query73	551	549	323	323
query74	10031	9914	9819	9819
query75	2843	2754	2452	2452
query76	2302	1031	698	698
query77	359	414	295	295
query78	11210	11373	10745	10745
query79	2682	793	607	607
query80	1760	609	535	535
query81	563	287	250	250
query82	1009	151	113	113
query83	326	260	242	242
query84	255	122	101	101
query85	926	508	437	437
query86	407	311	315	311
query87	3167	3048	3024	3024
query88	3599	2676	2661	2661
query89	429	368	351	351
query90	2019	173	165	165
query91	158	159	153	153
query92	76	71	72	71
query93	1140	843	513	513
query94	630	325	307	307
query95	594	337	371	337
query96	640	526	227	227
query97	2453	2497	2404	2404
query98	228	219	217	217
query99	1041	992	916	916
Total cold run time: 256015 ms
Total hot run time: 184238 ms

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants