Skip to content

Conversation

@Kirskov
Copy link

@Kirskov Kirskov commented Feb 1, 2026

Changes:

  • Replace rand()/srand() in nuid.c with static _lcg_next()/_rand32()
  • Add _entropyMix() combining nanosecond time, PID, thread ID, and
    ASLR stack address for better seed quality
  • Replace rand() calls in conn.c, comsock.c, srvpool.c with
    nats_Rand64() (mutex-protected CMWC engine)
  • Remove srand() calls from srvpool.c and glib.c (no longer needed)
  • Add Windows compatibility via GetCurrentProcessId/GetCurrentThreadId
  • Add test_natsNUID_Uniqueness: 1M consecutive NUIDs checked for
    collisions, length validation, and buffer overflow handling
  • Add test_natsRand64_Range: 1M iterations testing positivity,
    variation, modulo range, and bucket distribution

@Kirskov
Copy link
Author

Kirskov commented Feb 1, 2026

This proposal is meant to fix a problem that I had using the library with mTLS where the app crashed because of rand() function that is not thread safe (error below).

? @ 0x0000000000042520
10.~/ClickHouse/base/harmful/harmful.c:295:1: rand @ 0x000000001efaebd8
11. ~/ClickHouse/contrib/nats-io/src/nuid.c:29:14: natsNUID_init @ 0x0000000023a02085
12. ~/ClickHouse/contrib/nats-io/src/glib/glib.c:272:13: nats_openLib @ 0x0000000023a0e3b9
13. ~/ClickHouse/contrib/nats-io/src/nats.c:108:12: nats_Open @ 0x0000000023a01ac8
14. ~/ClickHouse/contrib/nats-io/src/opts.c:1681:9: natsOptions_Create @ 0x0000000023a04153
15. ~/ClickHouse/src/Storages/NATS/NATSHandler.cpp:176:15: DB::NATSHandler::createOptions() @ 0x000000001e850562
16. ~/ClickHouse/src/Storages/NATS/NATSHandler.cpp:141:92: void std::__function::__policy_func<void ()>::__call_func[abi:se210105]<DB::NATSHandler::createConnection(DB::NATSConfiguration const&)::$_0>(std::__function::__policy_storage const*) @ 0x000000001e85092d
17. ~/ClickHouse/contrib/llvm-project/libcxx/include/__functional/function.h:508:12: ? @ 0x000000001e84f995

rand() and srand() use global state that is shared across all threads,
which can cause duplicate NUIDs or corrupted CMWC state when multiple
connections initialize concurrently. Replace with a file-local LCG
seeded from mixed entropy (time, PID, thread ID, stack ASLR) for CMWC
initialization, and use nats_Rand64() for runtime randomness in
conn.c, comsock.c, and srvpool.c.
Extract hardcoded iteration counts into a RAND_TEST_ITER define
(1,000,000) for NUID uniqueness and Rand64 range tests.
@Kirskov Kirskov force-pushed the fix/migrate_rand_to_LGC branch from aff839b to 5613f7f Compare February 2, 2026 20:30
@Kirskov Kirskov requested a review from kozlovic February 2, 2026 20:35
@kozlovic
Copy link
Member

kozlovic commented Feb 2, 2026

@Kirskov GitHub actions seem to have a bit of an issue since the tests are queued but not executing. I have canceled twice already. I will run them later today/tomorrow. At first glance, the PR looks good but would want the full test suite to run before approving. Thanks!

@Kirskov
Copy link
Author

Kirskov commented Feb 2, 2026

@kozlovic Thanks ! Maybe it has to do with my request for review again your requested change.

But thanks again for your insight !

EDIT: nevermind I saw the Github status page

@kozlovic
Copy link
Member

kozlovic commented Feb 2, 2026

@Kirskov I checked GitHub status and there is indeed an incident with GitHub Actions, so I will wait for that to be resolved.

@kozlovic
Copy link
Member

kozlovic commented Feb 3, 2026

@Kirskov I am afraid that your approach is actually creating a thread safety issue that did not exist before. Note that ClickHouse issue is just that it detected the use of rand() but not necessarily that there was a data race issue. Your implementation however does create a data race as shown by the thread sanitizer:

WARNING: ThreadSanitizer: data race (pid=6568)
  Write of size 4 at 0x561fc8dca4f0 by main thread (mutexes: write M0):
    #0 _randCMWC /home/runner/work/nats.c/nats.c/src/nuid.c:92:7 (testsuite+0x21da2e) (BuildId: ab4a324cde51f61b207087c3e7ae33376c6d608e)
    #1 _rand64 /home/runner/work/nats.c/nats.c/src/nuid.c:111:20 (testsuite+0x21da2e)
    #2 nats_Rand64 /home/runner/work/nats.c/nats.c/src/nuid.c:124:12 (testsuite+0x21da2e)
    #3 _newAsyncReply /home/runner/work/nats.c/nats.c/src/js.c:823:13 (testsuite+0x1f569d) (BuildId: ab4a324cde51f61b207087c3e7ae33376c6d608e)
    #4 _registerPubMsg /home/runner/work/nats.c/nats.c/src/js.c:972:9 (testsuite+0x1f569d)
    #5 js_PublishMsgAsync /home/runner/work/nats.c/nats.c/src/js.c:1051:5 (testsuite+0x1f569d)
    #6 js_PublishAsync /home/runner/work/nats.c/nats.c/src/js.c:1017:5 (testsuite+0x1f5462) (BuildId: ab4a324cde51f61b207087c3e7ae33376c6d608e)
    #7 test_JetStreamPublishAckHandler /home/runner/work/nats.c/nats.c/test/test.c:27852:9 (testsuite+0x12d55a) (BuildId: ab4a324cde51f61b207087c3e7ae33376c6d608e)
    #8 main /home/runner/work/nats.c/nats.c/test/test.c:42124:9 (testsuite+0x1df6f8) (BuildId: ab4a324cde51f61b207087c3e7ae33376c6d608e)

  Previous write of size 4 at 0x561fc8dca4f0 by thread T8 (mutexes: write M1):
    #0 _randCMWC /home/runner/work/nats.c/nats.c/src/nuid.c:92:7 (testsuite+0x21da2e) (BuildId: ab4a324cde51f61b207087c3e7ae33376c6d608e)
    #1 _rand64 /home/runner/work/nats.c/nats.c/src/nuid.c:111:20 (testsuite+0x21da2e)
    #2 nats_Rand64 /home/runner/work/nats.c/nats.c/src/nuid.c:124:12 (testsuite+0x21da2e)
    #3 _doReconnect /home/runner/work/nats.c/nats.c/src/conn.c:1653:34 (testsuite+0x1ebd9e) (BuildId: ab4a324cde51f61b207087c3e7ae33376c6d608e)
    #4 _threadStart /home/runner/work/nats.c/nats.c/src/unix/thread.c:41:5 (testsuite+0x243737) (BuildId: ab4a324cde51f61b207087c3e7ae33376c6d608e)

  Location is global '_randCMWC.i' of size 4 at 0x561fc8dca4f0 (testsuite+0x2f94f0)

  Mutex M0 (0x720c00002250) created at:
    #0 pthread_mutex_init <null> (testsuite+0x6d483) (BuildId: ab4a324cde51f61b207087c3e7ae33376c6d608e)
    #1 natsMutex_Create /home/runner/work/nats.c/nats.c/src/unix/mutex.c:41:13 (testsuite+0x2430df) (BuildId: ab4a324cde51f61b207087c3e7ae33376c6d608e)
    #2 natsConnection_JetStream /home/runner/work/nats.c/nats.c/src/js.c:303:9 (testsuite+0x1f44a0) (BuildId: ab4a324cde51f61b207087c3e7ae33376c6d608e)
    #3 test_JetStreamPublishAckHandler /home/runner/work/nats.c/nats.c/test/test.c:27816:9 (testsuite+0x12d1ac) (BuildId: ab4a324cde51f61b207087c3e7ae33376c6d608e)
    #4 main /home/runner/work/nats.c/nats.c/test/test.c:42124:9 (testsuite+0x1df6f8) (BuildId: ab4a324cde51f61b207087c3e7ae33376c6d608e)

  Mutex M1 (0x720c00001f20) created at:
    #0 pthread_mutex_init <null> (testsuite+0x6d483) (BuildId: ab4a324cde51f61b207087c3e7ae33376c6d608e)
    #1 natsMutex_Create /home/runner/work/nats.c/nats.c/src/unix/mutex.c:41:13 (testsuite+0x2430df) (BuildId: ab4a324cde51f61b207087c3e7ae33376c6d608e)
    #2 natsConn_create /home/runner/work/nats.c/nats.c/src/conn.c:3311:9 (testsuite+0x1e8165) (BuildId: ab4a324cde51f61b207087c3e7ae33376c6d608e)
    #3 natsConnection_ConnectTo /home/runner/work/nats.c/nats.c/src/conn.c:3464:13 (testsuite+0x1e86c1) (BuildId: ab4a324cde51f61b207087c3e7ae33376c6d608e)
    #4 natsConnection_Connect /home/runner/work/nats.c/nats.c/src/conn.c:3365:13 (testsuite+0x1e8427) (BuildId: ab4a324cde51f61b207087c3e7ae33376c6d608e)
    #5 test_JetStreamPublishAckHandler /home/runner/work/nats.c/nats.c/test/test.c:27798:5 (testsuite+0x12ce2a) (BuildId: ab4a324cde51f61b207087c3e7ae33376c6d608e)
    #6 main /home/runner/work/nats.c/nats.c/test/test.c:42124:9 (testsuite+0x1df6f8) (BuildId: ab4a324cde51f61b207087c3e7ae33376c6d608e)

  Thread T8 (tid=6586, running) created by thread T4 at:
    #0 pthread_create <null> (testsuite+0x6bc5f) (BuildId: ab4a324cde51f61b207087c3e7ae33376c6d608e)
    #1 natsThread_Create /home/runner/work/nats.c/nats.c/src/unix/thread.c:71:15 (testsuite+0x24366e) (BuildId: ab4a324cde51f61b207087c3e7ae33376c6d608e)
    #2 _processOpError /home/runner/work/nats.c/nats.c/src/conn.c:2277:18 (testsuite+0x1e6f40) (BuildId: ab4a324cde51f61b207087c3e7ae33376c6d608e)
    #3 _readLoop /home/runner/work/nats.c/nats.c/src/conn.c:2341:13 (testsuite+0x1eed41) (BuildId: ab4a324cde51f61b207087c3e7ae33376c6d608e)
    #4 _threadStart /home/runner/work/nats.c/nats.c/src/unix/thread.c:41:5 (testsuite+0x243737) (BuildId: ab4a324cde51f61b207087c3e7ae33376c6d608e)

SUMMARY: ThreadSanitizer: data race /home/runner/work/nats.c/nats.c/src/nuid.c:92:7 in _randCMWC
==================

@kozlovic
Copy link
Member

kozlovic commented Feb 3, 2026

@Kirskov I guess the data race could have existed before with concurrent calls to nats_Rand64(), but the changes here have introduced the concurrent calls between the reconnect thread and _newAsyncReply...

@kozlovic
Copy link
Member

kozlovic commented Feb 3, 2026

We may need to move nats_Rand64() at the end of the file and change to:

int64_t
nats_Rand64(void)
{
    int64_t r = 0;

    natsMutex_Lock(globalNUID.mu);
    r = _rand64(0x7FFFFFFFFFFFFFFF);
    natsMutex_Unlock(globalNUID.mu);

    return r;
}

But would need to check a bit more...

@Kirskov
Copy link
Author

Kirskov commented Feb 3, 2026

I implemented your idea, I had the same actually but I don't know how to test it locally though

@Kirskov
Copy link
Author

Kirskov commented Feb 3, 2026

@kozlovic I launched the tests with my updated branch:

cd ~/nats.c/build && rm -rf * && cmake .. -DNATS_BUILD_STREAMING=OFF && make -j$(nproc) testsuite 2>&1 | tail -5

export NATS_TEST_SERVER_VERSION="nats-server version 2.12.4" && ./bin/testsuite JetStreamPublishAckHandler
Test server version: nats-server version 2.12.4

== JetStreamPublishAckHandler ==
#01 Start JS Server: PASSED
#02 Connect: PASSED
#03 Get context: PASSED
#04 Prepare JS options: PASSED
#05 Get context: PASSED
#06 Add stream: PASSED
#07 Publish async ok: PASSED
#08 Publish async (duplicate): PASSED
#09 Publish async (max msgs): PASSED
#10 Publish async with timeouts: PASSED
#11 Publish async timeout (1): PASSED
#12 Publish async timeout (2): PASSED
#13 Publish async timeout (3): PASSED
#14 Ctx destroy releases timer: PASSED
#15 Check refs: PASSED
ALL PASSED

export NATS_TEST_SERVER_VERSION="nats-server version 2.12.4" && ./bin/testsuite JetStreamPublishAsync 2>&1
Test server version: nats-server version 2.12.4

== JetStreamPublishAsync ==
#01 Start JS Server: PASSED
#02 Connect: PASSED
#03 Get context: PASSED
#04 Create control sub: PASSED
#05 Prepare JS options: PASSED
#06 Get context: PASSED
#07 Stream config init: PASSED
#08 Add stream: PASSED
#09 Publish bad args: PASSED
#10 PublishAsyncComplete bad args: PASSED
#11 PublishAsyncComplete with no pending: PASSED
#12 Publish data: PASSED
#13 Check pub msg no header and reply set: PASSED
#14 Publish msg (bad args): PASSED
#15 Publish msg: PASSED
#16 Check pub msg reply set: PASSED
#17 Check msg ID header set: PASSED
#18 Wait for complete (bad args): PASSED
#19 Wait for complete: PASSED
#20 Send fails due to wrong last ID: PASSED
#21 Check pub msg reply set: PASSED
#22 Check msg ID header not set: PASSED
#23 Check expect last msg ID header set: PASSED
#24 Wait for complete: PASSED
#25 Check cb got proper failure: PASSED
#26 Send new failed message, will be resent in cb: PASSED
#27 Wait complete: PASSED
#28 Send new failed messages which will block cb: PASSED
#29 Check complete timeout: PASSED
#30 Release cb which will destroy context: PASSED
#31 Check that last msg was not delivered to CB: PASSED
#32 Stall wait bad args: PASSED
#33 Recreate context: PASSED
#34 Block CB: PASSED
#35 Send should fail due to stall: PASSED
#36 Pub will stall: PASSED
#37 Wait complete: PASSED
#38 Wait for CB to return: PASSED
#39 Msg needs to be destroyed on failure: PASSED
#40 Msg destroy: PASSED
#41 Connect: PASSED
#42 Create context: PASSED
#43 Publish async no responders: PASSED
#44 Enqueue message with bad subject: PASSED
#45 Publish async cb received non existent pid: PASSED
#46 Produce failed message: PASSED
#47 Wait for msg in CB: PASSED
#48 Destroy context, notify CB: PASSED
#49 Wait for CB to return: PASSED
#50 Reply subject can be set: PASSED
#51 Wait complete: PASSED
#52 Publish async: PASSED
#53 Get pending (bad args): PASSED
#54 Get pending: PASSED
#55 Verify pending list: PASSED
#56 Destroy list leaves msg1 valid: PASSED
#57 Get pending, no msg: PASSED
#58 Publish async: PASSED
#59 Get pending: PASSED
#60 Check that if Msgs set to NULL, no crash: PASSED
#61 Publish timeout: PASSED
ALL PASSED

I just hope the mutex will not impact much performance

@Kirskov
Copy link
Author

Kirskov commented Feb 3, 2026

By the way I tried the bench test and I saw some average that are negatives :

        {"subs":1, "threads":0, "messages":100000, "best":35, "average":36, "worst":39},
        {"subs":1, "threads":5, "messages":100000, "best":-2441, "average":-455, "worst":42},
        {"subs":2, "threads":0, "messages":100000, "best":39, "average":41, "worst":46},
        {"subs":2, "threads":5, "messages":100000, "best":40, "average":40, "worst":42},
        {"subs":23, "threads":0, "messages":99981, "best":107, "average":188, "worst":250},
        {"subs":23, "threads":5, "messages":99981, "best":57, "average":59, "worst":61},
        {"subs":23, "threads":11, "messages":99981, "best":71, "average":81, "worst":103},
        {"subs":23, "threads":23, "messages":99981, "best":159, "average":251, "worst":341},
        {"subs":23, "threads":47, "messages":99981, "best":123, "average":210, "worst":340},
        {"subs":47, "threads":0, "messages":99969, "best":165, "average":300, "worst":349},
        {"subs":47, "threads":5, "messages":99969, "best":68, "average":70, "worst":78},
        {"subs":47, "threads":11, "messages":99969, "best":72, "average":101, "worst":121},
        {"subs":47, "threads":23, "messages":99969, "best":153, "average":240, "worst":315},
        {"subs":47, "threads":47, "messages":99969, "best":148, "average":283, "worst":339},
        {"subs":47, "threads":91, "messages":99969, "best":237, "average":311, "worst":357},
        {"subs":81, "threads":0, "messages":99954, "best":275, "average":339, "worst":368},
        {"subs":81, "threads":5, "messages":99954, "best":79, "average":80, "worst":84},
        {"subs":81, "threads":11, "messages":99954, "best":120, "average":148, "worst":164},
        {"subs":81, "threads":23, "messages":99954, "best":118, "average":240, "worst":313},
        {"subs":81, "threads":47, "messages":99954, "best":332, "average":340, "worst":349},
        {"subs":81, "threads":91, "messages":99954, "best":334, "average":352, "worst":359},
        {"subs":120, "threads":0, "messages":99960, "best":344, "average":368, "worst":395},
        {"subs":120, "threads":5, "messages":99960, "best":99, "average":102, "worst":107},
        {"subs":120, "threads":11, "messages":99960, "best":179, "average":198, "worst":211},
        {"subs":120, "threads":23, "messages":99960, "best":288, "average":320, "worst":349},
        {"subs":120, "threads":47, "messages":99960, "best":325, "average":347, "worst":379},
        {"subs":120, "threads":91, "messages":99960, "best":357, "average":384, "worst":405}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants