-
Notifications
You must be signed in to change notification settings - Fork 160
fix(nuid): replace non-thread-safe rand()/srand() with internal LCG #963
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
This proposal is meant to fix a problem that I had using the library with mTLS where the app crashed because of rand() function that is not thread safe (error below). ? @ 0x0000000000042520
10.~/ClickHouse/base/harmful/harmful.c:295:1: rand @ 0x000000001efaebd8
11. ~/ClickHouse/contrib/nats-io/src/nuid.c:29:14: natsNUID_init @ 0x0000000023a02085
12. ~/ClickHouse/contrib/nats-io/src/glib/glib.c:272:13: nats_openLib @ 0x0000000023a0e3b9
13. ~/ClickHouse/contrib/nats-io/src/nats.c:108:12: nats_Open @ 0x0000000023a01ac8
14. ~/ClickHouse/contrib/nats-io/src/opts.c:1681:9: natsOptions_Create @ 0x0000000023a04153
15. ~/ClickHouse/src/Storages/NATS/NATSHandler.cpp:176:15: DB::NATSHandler::createOptions() @ 0x000000001e850562
16. ~/ClickHouse/src/Storages/NATS/NATSHandler.cpp:141:92: void std::__function::__policy_func<void ()>::__call_func[abi:se210105]<DB::NATSHandler::createConnection(DB::NATSConfiguration const&)::$_0>(std::__function::__policy_storage const*) @ 0x000000001e85092d
17. ~/ClickHouse/contrib/llvm-project/libcxx/include/__functional/function.h:508:12: ? @ 0x000000001e84f995 |
rand() and srand() use global state that is shared across all threads, which can cause duplicate NUIDs or corrupted CMWC state when multiple connections initialize concurrently. Replace with a file-local LCG seeded from mixed entropy (time, PID, thread ID, stack ASLR) for CMWC initialization, and use nats_Rand64() for runtime randomness in conn.c, comsock.c, and srvpool.c.
Extract hardcoded iteration counts into a RAND_TEST_ITER define (1,000,000) for NUID uniqueness and Rand64 range tests.
aff839b to
5613f7f
Compare
|
@Kirskov GitHub actions seem to have a bit of an issue since the tests are queued but not executing. I have canceled twice already. I will run them later today/tomorrow. At first glance, the PR looks good but would want the full test suite to run before approving. Thanks! |
|
@kozlovic Thanks ! Maybe it has to do with my request for review again your requested change. But thanks again for your insight ! EDIT: nevermind I saw the Github status page |
|
@Kirskov I checked GitHub status and there is indeed an incident with GitHub Actions, so I will wait for that to be resolved. |
|
@Kirskov I am afraid that your approach is actually creating a thread safety issue that did not exist before. Note that ClickHouse issue is just that it detected the use of |
|
@Kirskov I guess the data race could have existed before with concurrent calls to nats_Rand64(), but the changes here have introduced the concurrent calls between the reconnect thread and |
|
We may need to move But would need to check a bit more... |
|
I implemented your idea, I had the same actually but I don't know how to test it locally though |
|
@kozlovic I launched the tests with my updated branch: cd ~/nats.c/build && rm -rf * && cmake .. -DNATS_BUILD_STREAMING=OFF && make -j$(nproc) testsuite 2>&1 | tail -5
export NATS_TEST_SERVER_VERSION="nats-server version 2.12.4" && ./bin/testsuite JetStreamPublishAckHandler
Test server version: nats-server version 2.12.4
== JetStreamPublishAckHandler ==
#01 Start JS Server: PASSED
#02 Connect: PASSED
#03 Get context: PASSED
#04 Prepare JS options: PASSED
#05 Get context: PASSED
#06 Add stream: PASSED
#07 Publish async ok: PASSED
#08 Publish async (duplicate): PASSED
#09 Publish async (max msgs): PASSED
#10 Publish async with timeouts: PASSED
#11 Publish async timeout (1): PASSED
#12 Publish async timeout (2): PASSED
#13 Publish async timeout (3): PASSED
#14 Ctx destroy releases timer: PASSED
#15 Check refs: PASSED
ALL PASSED
export NATS_TEST_SERVER_VERSION="nats-server version 2.12.4" && ./bin/testsuite JetStreamPublishAsync 2>&1
Test server version: nats-server version 2.12.4
== JetStreamPublishAsync ==
#01 Start JS Server: PASSED
#02 Connect: PASSED
#03 Get context: PASSED
#04 Create control sub: PASSED
#05 Prepare JS options: PASSED
#06 Get context: PASSED
#07 Stream config init: PASSED
#08 Add stream: PASSED
#09 Publish bad args: PASSED
#10 PublishAsyncComplete bad args: PASSED
#11 PublishAsyncComplete with no pending: PASSED
#12 Publish data: PASSED
#13 Check pub msg no header and reply set: PASSED
#14 Publish msg (bad args): PASSED
#15 Publish msg: PASSED
#16 Check pub msg reply set: PASSED
#17 Check msg ID header set: PASSED
#18 Wait for complete (bad args): PASSED
#19 Wait for complete: PASSED
#20 Send fails due to wrong last ID: PASSED
#21 Check pub msg reply set: PASSED
#22 Check msg ID header not set: PASSED
#23 Check expect last msg ID header set: PASSED
#24 Wait for complete: PASSED
#25 Check cb got proper failure: PASSED
#26 Send new failed message, will be resent in cb: PASSED
#27 Wait complete: PASSED
#28 Send new failed messages which will block cb: PASSED
#29 Check complete timeout: PASSED
#30 Release cb which will destroy context: PASSED
#31 Check that last msg was not delivered to CB: PASSED
#32 Stall wait bad args: PASSED
#33 Recreate context: PASSED
#34 Block CB: PASSED
#35 Send should fail due to stall: PASSED
#36 Pub will stall: PASSED
#37 Wait complete: PASSED
#38 Wait for CB to return: PASSED
#39 Msg needs to be destroyed on failure: PASSED
#40 Msg destroy: PASSED
#41 Connect: PASSED
#42 Create context: PASSED
#43 Publish async no responders: PASSED
#44 Enqueue message with bad subject: PASSED
#45 Publish async cb received non existent pid: PASSED
#46 Produce failed message: PASSED
#47 Wait for msg in CB: PASSED
#48 Destroy context, notify CB: PASSED
#49 Wait for CB to return: PASSED
#50 Reply subject can be set: PASSED
#51 Wait complete: PASSED
#52 Publish async: PASSED
#53 Get pending (bad args): PASSED
#54 Get pending: PASSED
#55 Verify pending list: PASSED
#56 Destroy list leaves msg1 valid: PASSED
#57 Get pending, no msg: PASSED
#58 Publish async: PASSED
#59 Get pending: PASSED
#60 Check that if Msgs set to NULL, no crash: PASSED
#61 Publish timeout: PASSED
ALL PASSEDI just hope the mutex will not impact much performance |
|
By the way I tried the bench test and I saw some average that are negatives : |
Changes:
ASLR stack address for better seed quality
nats_Rand64() (mutex-protected CMWC engine)
collisions, length validation, and buffer overflow handling
variation, modulo range, and bucket distribution