Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[bugfix](becore) Fix thread safety issue in BaseTablet destructor #45747

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

felixwluo
Copy link
Contributor

What problem does this PR solve?

Core Dump

(gdb) bt
#0  0x000055f476bcda1d in std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release (this=0x7f1187acbb00)
    at /var/local/ldb-toolchain/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/shared_ptr_base.h:168
#1  std::__shared_count<(__gnu_cxx::_Lock_policy)2>::~__shared_count (this=0x7f12bbeaac98)
    at /var/local/ldb-toolchain/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/shared_ptr_base.h:702
#2  std::__shared_ptr<doris::MetricEntity, (__gnu_cxx::_Lock_policy)2>::~__shared_ptr (this=0x7f12bbeaac90)
    at /var/local/ldb-toolchain/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/shared_ptr_base.h:1149
#3  doris::BaseTablet::~BaseTablet (this=0x7f12bbeaac10) at /root/be/src/olap/base_tablet.cpp:53
#4  0x000055f476beabbb in std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release (this=0x7f12bbeaac00)
    at /var/local/ldb-toolchain/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/shared_ptr_base.h:168
#5  std::__shared_count<(__gnu_cxx::_Lock_policy)2>::~__shared_count (this=0x7f11b8d046c8)
    at /var/local/ldb-toolchain/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/shared_ptr_base.h:702
#6  std::__shared_ptr<doris::Tablet, (__gnu_cxx::_Lock_policy)2>::~__shared_ptr (this=0x7f11b8d046c0)
    at /var/local/ldb-toolchain/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/shared_ptr_base.h:1149
#7  std::destroy_at<std::shared_ptr<doris::Tablet> > (__location=0x7f11b8d046c0)
    at /var/local/ldb-toolchain/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/stl_construct.h:88
#8  std::_Destroy<std::shared_ptr<doris::Tablet> > (__pointer=0x7f11b8d046c0)
    at /var/local/ldb-toolchain/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/stl_construct.h:138
#9  std::_Destroy_aux<false>::__destroy<std::shared_ptr<doris::Tablet>*> (__first=0x7f11b8d046c0, __last=0x7f11b8d04c80)
    at /var/local/ldb-toolchain/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/stl_construct.h:152
#10 std::_Destroy<std::shared_ptr<doris::Tablet>*> (__first=<optimized out>, __last=0x7f11b8d04c80)
    at /var/local/ldb-toolchain/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/stl_construct.h:184
#11 std::_Destroy<std::shared_ptr<doris::Tablet>*, std::shared_ptr<doris::Tablet> > (__first=<optimized out>, __last=0x7f11b8d04c80)
    at /var/local/ldb-toolchain/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/alloc_traits.h:746
#12 std::vector<std::shared_ptr<doris::Tablet>, std::allocator<std::shared_ptr<doris::Tablet> > >::~vector (this=<optimized out>)
    at /var/local/ldb-toolchain/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/stl_vector.h:680
#13 doris::TabletManager::start_trash_sweep()::$_2::operator()() const (this=<optimized out>) at /root/be/src/olap/tablet_manager.cpp:1105
#14 doris::TabletManager::start_trash_sweep (this=0x7f17fc2d1d00) at /root/be/src/olap/tablet_manager.cpp:1110
#15 0x000055f4761ac0c6 in doris::StorageEngine::start_trash_sweep (this=0x7f17fbef7000, usage=0x7f150f1bf3d0, ignore_guard=<optimized out>)
    at /root/be/src/olap/storage_engine.cpp:803
#16 0x000055f476a355e6 in doris::StorageEngine::_garbage_sweeper_thread_callback (this=0x7f17fbef7000) at /root/be/src/olap/olap_server.cpp:300
#17 0x000055f47707da51 in std::function<void ()>::operator()() const (this=0x7f1187acbb00)
    at /var/local/ldb-toolchain/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/std_function.h:560
#18 doris::Thread::supervise_thread (arg=0x7f17fbf7da40) at /root/be/src/util/thread.cpp:498
#19 0x00007f182d17fea5 in start_thread () from /lib64/libpthread.so.0
#20 0x00007f182dbae9fd in clone () from /lib64/libc.so.6

Cause of occurrence
The crash occurred during the processing of _metric_entity at BaseTablet destructor, from memory, the reference count for _metric_entity is already 0, but there is still a weak reference, n a multithreaded environment, a race condition may occur between deregister_entity and reset_metric_entity

GDB

(gdb) f 0
#0  0x000055f476bcda1d in std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release (this=0x7f1187acbb00)
    at /var/local/ldb-toolchain/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/shared_ptr_base.h:168
168     in /var/local/ldb-toolchain/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/shared_ptr_base.h
(gdb) p *this
$18 = {<std::_Mutex_base<(__gnu_cxx::_Lock_policy)2>> = {<No data fields>}, _vptr$_Sp_counted_base = 0x55f46f61696a, _M_use_count = 0, 
  _M_weak_count = 1}

@Thearas
Copy link
Contributor

Thearas commented Dec 20, 2024

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@felixwluo
Copy link
Contributor Author

run buildall

@doris-robot
Copy link

TPC-H: Total hot run time: 40032 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit ccb216ba4dcb0a7e19c78799386153ba3a5a7256, data reload: false

------ Round 1 ----------------------------------
q1	17622	7517	7297	7297
q2	2052	180	174	174
q3	10529	1137	1226	1137
q4	10224	764	723	723
q5	7607	2701	2741	2701
q6	239	150	149	149
q7	999	631	605	605
q8	9256	1864	1955	1864
q9	6613	6441	6503	6441
q10	7059	2314	2348	2314
q11	494	251	266	251
q12	418	229	228	228
q13	17789	2903	2966	2903
q14	240	210	213	210
q15	561	511	491	491
q16	652	584	583	583
q17	1014	552	608	552
q18	7296	6801	6700	6700
q19	1344	960	1028	960
q20	467	190	193	190
q21	4026	3334	3246	3246
q22	393	313	317	313
Total cold run time: 106894 ms
Total hot run time: 40032 ms

----- Round 2, with runtime_filter_mode=off -----
q1	7265	7234	7233	7233
q2	333	237	225	225
q3	2939	2791	3123	2791
q4	2176	1894	1842	1842
q5	5544	5679	5623	5623
q6	225	136	140	136
q7	2178	1739	1772	1739
q8	3384	3540	3458	3458
q9	8912	9081	9043	9043
q10	3631	3570	3555	3555
q11	603	498	489	489
q12	821	628	620	620
q13	13931	3071	3067	3067
q14	315	262	279	262
q15	557	511	500	500
q16	696	653	626	626
q17	1860	1604	1574	1574
q18	7852	7351	7384	7351
q19	1684	1422	1656	1422
q20	2034	1799	1834	1799
q21	5445	5285	5393	5285
q22	671	547	602	547
Total cold run time: 73056 ms
Total hot run time: 59187 ms

@doris-robot
Copy link

TPC-DS: Total hot run time: 190298 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit ccb216ba4dcb0a7e19c78799386153ba3a5a7256, data reload: false

query1	989	383	374	374
query2	6524	2561	2404	2404
query3	6708	223	219	219
query4	33645	23528	23359	23359
query5	4428	477	461	461
query6	282	199	207	199
query7	4662	311	322	311
query8	314	242	242	242
query9	9434	2723	2693	2693
query10	446	249	244	244
query11	17968	15129	15407	15129
query12	173	106	111	106
query13	1669	470	438	438
query14	11044	7724	7175	7175
query15	319	182	186	182
query16	8136	460	437	437
query17	2222	593	574	574
query18	2010	329	286	286
query19	363	152	150	150
query20	126	116	109	109
query21	203	109	107	107
query22	4665	4388	4261	4261
query23	34248	33586	33707	33586
query24	11535	2468	2526	2468
query25	675	382	385	382
query26	1811	154	152	152
query27	2925	322	322	322
query28	7977	2416	2429	2416
query29	1044	404	413	404
query30	301	144	154	144
query31	1043	785	815	785
query32	104	56	57	56
query33	775	296	279	279
query34	1021	526	521	521
query35	855	778	742	742
query36	1090	921	955	921
query37	264	80	74	74
query38	4411	4234	3993	3993
query39	1463	1468	1465	1465
query40	280	103	100	100
query41	48	46	46	46
query42	127	101	102	101
query43	543	510	482	482
query44	1201	825	812	812
query45	186	161	162	161
query46	1191	724	707	707
query47	1933	1841	1855	1841
query48	411	324	335	324
query49	1272	379	393	379
query50	810	378	389	378
query51	7337	7097	6880	6880
query52	104	95	89	89
query53	261	185	184	184
query54	1269	417	420	417
query55	82	83	82	82
query56	268	256	238	238
query57	1255	1127	1096	1096
query58	239	228	229	228
query59	3495	3198	2929	2929
query60	276	273	240	240
query61	116	108	140	108
query62	918	686	665	665
query63	220	188	189	188
query64	5024	679	637	637
query65	3262	3168	3227	3168
query66	1376	315	303	303
query67	15927	15619	15574	15574
query68	5721	547	558	547
query69	451	246	251	246
query70	1109	1124	1066	1066
query71	452	258	255	255
query72	6442	4150	4149	4149
query73	781	364	366	364
query74	10236	8938	8904	8904
query75	3459	2649	2640	2640
query76	3577	1126	1100	1100
query77	552	274	281	274
query78	10616	9446	9374	9374
query79	1688	606	598	598
query80	889	433	450	433
query81	536	231	228	228
query82	956	121	118	118
query83	249	148	148	148
query84	237	71	71	71
query85	1138	308	309	308
query86	355	304	300	300
query87	4444	4375	4313	4313
query88	3567	2243	2192	2192
query89	417	286	294	286
query90	2183	190	190	190
query91	146	102	113	102
query92	67	53	62	53
query93	1208	542	545	542
query94	1101	305	260	260
query95	363	251	251	251
query96	680	282	283	282
query97	2892	2675	2678	2675
query98	219	201	197	197
query99	1555	1312	1305	1305
Total cold run time: 305589 ms
Total hot run time: 190298 ms

@doris-robot
Copy link

ClickBench: Total hot run time: 32.6 s
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/clickbench-tools
ClickBench test result on commit ccb216ba4dcb0a7e19c78799386153ba3a5a7256, data reload: false

query1	0.04	0.04	0.03
query2	0.07	0.04	0.03
query3	0.23	0.07	0.07
query4	1.60	0.10	0.10
query5	0.44	0.41	0.40
query6	1.18	0.66	0.64
query7	0.02	0.02	0.02
query8	0.04	0.03	0.04
query9	0.57	0.56	0.51
query10	0.55	0.59	0.56
query11	0.16	0.11	0.11
query12	0.14	0.11	0.11
query13	0.61	0.61	0.60
query14	2.75	2.79	2.87
query15	0.90	0.82	0.82
query16	0.37	0.38	0.37
query17	1.05	1.01	1.08
query18	0.22	0.20	0.20
query19	1.95	1.79	1.98
query20	0.01	0.02	0.01
query21	15.36	0.59	0.59
query22	3.10	2.60	1.65
query23	17.06	1.02	0.87
query24	3.20	1.09	0.99
query25	0.25	0.12	0.07
query26	0.59	0.15	0.14
query27	0.04	0.04	0.05
query28	10.56	1.10	1.06
query29	12.57	3.24	3.22
query30	0.25	0.06	0.06
query31	2.86	0.38	0.40
query32	3.24	0.47	0.46
query33	3.08	3.26	3.06
query34	16.95	4.53	4.54
query35	4.51	4.51	4.50
query36	0.69	0.48	0.51
query37	0.09	0.06	0.06
query38	0.05	0.03	0.03
query39	0.03	0.02	0.03
query40	0.18	0.13	0.12
query41	0.08	0.02	0.03
query42	0.03	0.02	0.02
query43	0.04	0.03	0.03
Total cold run time: 107.71 s
Total hot run time: 32.6 s

@doris-robot
Copy link

TeamCity be ut coverage result:
Function Coverage: 38.88% (10134/26064)
Line Coverage: 29.80% (85270/286131)
Region Coverage: 28.93% (43546/150506)
Branch Coverage: 25.44% (22171/87158)
Coverage Report: http://coverage.selectdb-in.cc/coverage/ccb216ba4dcb0a7e19c78799386153ba3a5a7256_ccb216ba4dcb0a7e19c78799386153ba3a5a7256/report/index.html

@felixwluo
Copy link
Contributor Author

run p0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants