Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle limited open Connections due to keepalive connections #9916

Open
wants to merge 8 commits into
base: master
Choose a base branch
from

Conversation

sgothel
Copy link

@sgothel sgothel commented Aug 28, 2024

Summary

(This commit is a WIP, pushed for discussion and to be amended!)

  • All network limits and timeouts are configurable now, see COOLWSD

    • { "net.ws.ping.timeout", "2000000" }, // WebSocketHandler ping timeout in us (2s). Zero disables metric.
    • { "net.ws.ping.period", "3000000" }, // WebSocketHandler ping period in us (3s), i.e. duration until next ping.
    • { "net.http.timeout", "30000000" }, // http::Session timeout in us (30s). Zero disables metric.
    • { "net.maxconnections", "100000" }, // Socket maximum connections (100000). Zero disables metric.
    • { "net.maxduration", "43200" }, // Socket maximum duration in seconds (12h). Zero disables metric.
    • { "net.minbps", "0" }, // Socket minimum bits per seconds throughput (0). Increase for debugging. Zero disables metric.
    • { "net.socketpoll.timeout", "64000000" }, // SocketPoll timeout in us (64s).
      and net::Defaults in NetUtil.hpp
  • ProtocolHandlerInterface::checkTimeout(..)
    is now called via StreamSocket::checkRemoval()
    to ensure tests even w/o input data (was: handlePoll())

  • ProtocolHandlerInterface::checkTimeout(..)

    • Add bool return value: true -> shutdown connection, caller shall stop processing
    • Implemented for http::Session
      • Timeout (30s) net::Defaults::HTTPTimeout with missing response
    • Implemented for WebSocketHandler
      • Timeout (2s = net::Defaults::WSPingTimeout)
        after missing WS native frame ping/pong (server only)
  • StreamSocket -> Socket (properties moved)

    • bytes sent/received
    • closed state
  • Socket (added properties)

    • creation- and last-seen -time
    • socket type and port
    • checkRemoval(..)
      • called directly from SocketPoll::poll()
      • only for IPv4/v6 network connections
      • similar to ProtocolHandlerInterface::checkTimeout(..)
      • added further criteria (age, throughput, ..)
        • Timeout (64s = net::Defaults::SocketPollTimeout)
          if (now - lastSeen) > timeout
        • Timeout (12 hours) net::Defaults::MaxDuration
          if (now - creationTime) > timeout
        • TODO: Throughput/bandwitdh disabled, find proper metrics/timing
        • TODO: Add maximimal IPv4/IPv6 socket-count criteria, drop oldest.
  • SocketPoll::poll()

    • Additionally erases if !socket->isOpen() || socket->checkRemoval()

TODO

  • Add maximimal IPv4/IPv6 socket-count criteria, drop oldest.
  • More elaborated tests
    • WebSocket
    • ..

Checklist

  • I have run make prettier-write and formatted the code.
  • All commits have Change-Id
  • I have run tests with make check
  • Passed certain unit tests and cypress desktop annotations.js
  • I have issued make run and manually verified that everything looks okay
  • Documentation (manuals or wiki) has been updated or is not required

@sgothel sgothel force-pushed the private/sgothel/cool_9833_connection_limit branch 3 times, most recently from a2c3068 to 3ef99c5 Compare August 28, 2024 00:58
@sgothel sgothel self-assigned this Aug 28, 2024
@sgothel
Copy link
Author

sgothel commented Aug 28, 2024

This change set created a linkage error (android), the 'usual' type_info, vtable thing.
Can't figure a missed 'virtual' or even pure virtual '=0' virtual key in the code, also reviewed the MOBILEAPP macro.
Hence I need to reproduce and build the android locally here .. WIP.

net/Socket.cpp Outdated
@@ -1298,6 +1385,52 @@ LocalServerSocket::~LocalServerSocket()
# define LOG_CHUNK(X)
#endif

bool StreamSocket::checkForcedRemoval(std::chrono::steady_clock::time_point now) noexcept
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

might need to move this impl to after

"#endif // !MOBILEAPP" (if mobile needs it) and/or put ifdef MOBILEAPP around it in the header if it doesn't or a !MOBILE stub, something like that anyway

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah. Great eyes. Thx!

Copy link
Contributor

@Ashod Ashod left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, @sgothel. There are some really good stuff here!

I understand this is a draft, but it reads like a mix of random fixes and improvements (some of it is good and useful, like adding const and = default, make_shared, etc.), re-inventing methods for doing things already supported, and debugging/tracing support for your understanding (to be removed).

It's very hard to read, as there is no clear theme or purpose.

I left various comments, but I didn't dive too deep in a number of areas because it was unclear why we are either replacing an existing approach, or inventing a new one when we can improve on the existing one (where there are shortcomings).

At a minimum, I'd separate the backtrace improvements from the socket improvements, and likely the misc. improvements from new code/features/tests.

Finally, I'm troubled by vformatString and the re-working a lot of the socket states and internals with a huge set of new members that change the existing semantics with unclear benefits.

Perhaps there is some context I'm missing. Perhaps this was agreed on and I'm unaware.

net/HttpRequest.hpp Outdated Show resolved Hide resolved
net/HttpRequest.hpp Show resolved Hide resolved
net/HttpRequest.hpp Show resolved Hide resolved
net/ServerSocket.hpp Outdated Show resolved Hide resolved
net/Socket.cpp Outdated Show resolved Hide resolved
test/UnitSocketLifecycle.cpp Outdated Show resolved Hide resolved
test/UnitSocketLifecycle.cpp Outdated Show resolved Hide resolved
test/WebSocketSession.hpp Show resolved Hide resolved
common/Util.cpp Outdated Show resolved Hide resolved
net/Socket.cpp Outdated Show resolved Hide resolved
@sgothel
Copy link
Author

sgothel commented Aug 28, 2024

Thank you @Ashod for the review.

I understand this is a draft, but it reads like a mix of random fixes and improvements (some of it is good and useful, like adding const and = default, make_shared, etc.), re-inventing methods for doing things already supported, and debugging/tracing support for your understanding (to be removed).

The draft status was motivated by my intention to discuss the actual semantics and goals,
i.e. mitigate limited connections - or better, ensure connections are closed as well as
to reduce the risk of running out of sockets/file-handles.

Which brings us to the next point I guess.

It's very hard to read, as there is no clear theme or purpose.

When reading commit by commit, I believe everything is well split apart semantically.
I believed the purpose was well described in #9833 and the commit 3ef99c5
actually starting to address (matching this drafts' description) was meaningful.

Indeed, I added a few unrelated single commits into this review. Understanding that we allow
reasonable small changes w/o issue report, esp of intrinsic nature, I believe it should be OK.
(admin vs workflow compromise as discussed earlier)

I left various comments, but I didn't dive too deep in a number of areas because it was unclear why we are either replacing an existing approach, or inventing a new one when we can improve on the existing one (where there are shortcomings).

I am not aware of really replacing anything. I merely moved and added functionality.

At a minimum, I'd separate the backtrace improvements from the socket improvements, and likely the misc. improvements from new code/features/tests.

As described above, a zero side-effect change in its own commit.

The Backtrace object in particular is useful for all, as we can produce an instance
capturing the current stack, move it around and output where desired.
As discussed, I did not add addr2line or libunwind dependencies, which I understand ofc.

Finally, I'm troubled by vformatString and the re-working a lot of the socket states and internals with a huge set of new members that change the existing semantics with unclear benefits.

I just like printf formatting as it is more performant and flexible (at least to me).
IMHO I use it in a most reasonable save fashion, while being supported by the compiler to
remove the risk of wrong format specifiers and stack values. The latter is still a weakness, which I am aware of.

Perhaps there is some context I'm missing. Perhaps this was agreed on and I'm unaware.

Allow me to continue answering your other remarks below.

@sgothel sgothel force-pushed the private/sgothel/cool_9833_connection_limit branch from 2d51475 to 508ccb0 Compare August 28, 2024 19:01
@caolanm
Copy link
Contributor

caolanm commented Aug 29, 2024

I think it'll be easier to manage this if its split up so that the mostly uncontroversial misc cleanups of existing stuff where no change in logic is intended, the make_unique, make_shared, privator ctor -> = delete, constifying, expansion of auto& to real type, emplace_back, etc goes into a separate pr.

The backtrace debugging aid looks probably useful to have to me, but can also be its own pr I think (and probably not worth fighting the "CodeQL scanning - action / Analyze" step which apparently can't handle the constexpr use which is blocking ci from passing this pr).

And generally trim this pr down to the core change to make it easier to think about it

@Ashod
Copy link
Contributor

Ashod commented Aug 30, 2024

Thanks for the responses, @sgothel! We've discussed some offline, so didn't reply here (e.g. printf-style formatting being fast--in fact it's very slow compared to only serialization).

Indeed, I added a few unrelated single commits into this review. Understanding that we allow reasonable small changes w/o issue report, esp of intrinsic nature, I believe it should be OK. (admin vs workflow compromise as discussed earlier)

In general, agreed. But I'll echo @caolanm's comment here that breaking up into PRs, that can be decided to merge independent of other parts, is always helpful.

I should also point out that this PR is well above a 1000 lines. This is a rare size. Considering that some parts change the interface of a large hierarchy (the sockets), which is a fundamental core part of the whole project and very delicate (can easily break things catastrophically), it requires reading and studying far more code (i.e. the context and consequences) to review properly. An isolated and low-risk change of a few 100 lines is fine. A complex few 100 line change (such as in the sockets) is challenging and time consuming. Sometimes we have no choice because we can't break it up. But a 1000 lines that can be easily broken up is hardly necessary.

Finally, even though we can review individual commits, it's preferable to have a single theme to a PR. Helps the reviewer. In some cases a few misc. changes can go together (comment fixes, cosmetics, formatting, logging, etc.). But mixing functional changes with non-functional changes dilutes the process and can reduce the quality of the review. More so with high-risk changes (i.e. socket code) mixed with debugging code (i.e. backtracing).

Regardless, you have a lot of great improvements here, and I don't want to undermine the importance of your contribution!
Thank you.

@sgothel sgothel force-pushed the private/sgothel/cool_9833_connection_limit branch from 508ccb0 to cb801db Compare September 2, 2024 03:41
@sgothel
Copy link
Author

sgothel commented Sep 2, 2024

Thank you both for your review and kind words.

I have split up single commits into 4 branches, i.e. this pull-request branch and

For this feature work, I need to complete the unit test incl. the timeout configuration.
Looks OK .. shall provide it in 1-2 days.

net/HttpRequest.hpp Fixed Show fixed Hide fixed
{
_socketHandler->onDisconnect();
if( isOpen() ) {
// FIXME: Ensure proper semantics of onDisconnect()

Check notice

Code scanning / CodeQL

FIXME comment Note

FIXME comment: Ensure proper semantics of onDisconnect()
net/Socket.hpp Fixed Show fixed Hide fixed
net/WebSocketHandler.hpp Dismissed Show dismissed Hide dismissed
void sendPong(std::chrono::steady_clock::time_point now,
const char* data, const size_t len,
const std::shared_ptr<StreamSocket>& socket)
{
if (!_isClient)
LOG_WRN("Servers should not send pongs, only clients");
// assert(!_isClient);

Check notice

Code scanning / CodeQL

Commented-out code Note

This comment appears to contain commented-out code.
net/WebSocketHandler.hpp Fixed Show fixed Hide fixed
@sgothel sgothel force-pushed the private/sgothel/cool_9833_connection_limit branch from cb801db to 6fc374e Compare September 6, 2024 11:28
net/Socket.cpp Fixed Show fixed Hide fixed
net/Socket.hpp Fixed Show fixed Hide fixed
net/Socket.hpp Fixed Show fixed Hide fixed
net/Socket.hpp Fixed Show fixed Hide fixed
@sgothel sgothel force-pushed the private/sgothel/cool_9833_connection_limit branch from 6fc374e to 31c1b94 Compare September 6, 2024 11:51
net/NetUtil.hpp Outdated Show resolved Hide resolved
@Ashod
Copy link
Contributor

Ashod commented Sep 6, 2024

@sgothel, there were multiple suggestions to split this large PR up to reduce its size. That was when it had 18 commits and was "only" a 1000 lines. Now it's past 1500 lines and is 1 commit! I'm sensing this isn't going in the direction of the feedback.

What's the plan, please?

test/UnitTimeoutNone.cpp Outdated Show resolved Hide resolved
Copy link
Contributor

@mmeeks mmeeks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As said needs to be split up; lets try to get the less controversial pieces in and merged as separate pull requests - to slim this down =) Can you work on that Sven ?

wsd/COOLWSD.cpp Show resolved Hide resolved
@sgothel
Copy link
Author

sgothel commented Sep 6, 2024

As said needs to be split up; lets try to get the less controversial pieces in and merged as separate pull requests - to slim this down =) Can you work on that Sven ?

will do the split later then (tomorrow) + finishing the tests

@sgothel
Copy link
Author

sgothel commented Sep 6, 2024

@sgothel, there were multiple suggestions to split this large PR up to reduce its size. That was when it had 18 commits and was "only" a 1000 lines. Now it's past 1500 lines and is 1 commit! I'm sensing this isn't going in the direction of the feedback.

What's the plan, please?

As I wrote, I will split out the config and TimeAverage (~400 lines or so) and put the remainder at the bottom -> 3 commits within this pull request. Perhaps more, w/ a fresh look.

But splitting this up in multiple pull request (PR) makes would probably makes no sense, as these pieces are required and each PR must be tested. IMHO.

@sgothel sgothel force-pushed the private/sgothel/cool_9833_connection_limit branch from d6d5711 to 661dcde Compare September 9, 2024 09:07
…own and queries. SocketPoll::poll robustness

Preparation for cool#9833.

Socket Restructuring:
- Unify closing, query, stats(bytes, time-points)
- Add type, port info.
- Add std::string queries/toString

SocketPoll::poll robustness
- Ensure erasing Socket if !isOpen()

WebSocketHandler::shutdown*
- Add shutdownSilent(), preparation to force shutdown on timeout
  w/o closing frame.

Signed-off-by: Sven Göthel <[email protected]>
Change-Id: I7e1a9329e0848c40a210f6250e29e26950da6fbc
Copy link
Contributor

@mmeeks mmeeks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So - some good things here; but Socket* is the foundation of our I/O work.
As such - expect lots of scrutiny and review here, far more than elsewhere.

Again I would like to separate the cosmetic re-working of things from functional changes.

I'd like each functional change to be its own separate thing. Eg. silent-shutdown - great feature,non-controversial, we need it - pull it out into its own commit. Lets get that in early.

Pushing down the bytesSent etc. statistics from StreamSocket -> Socket - interesting, not sure of the rationale - perhaps harmless, but would like to know why - where does this simplify something else ?

And of course, other things - not so convinced about ;-)

Thanks for persisting and iterating! - we will get there.

net/Socket.cpp Show resolved Hide resolved
net/Socket.cpp Show resolved Hide resolved
net/Socket.cpp Outdated Show resolved Hide resolved
net/Socket.cpp Show resolved Hide resolved
net/Socket.cpp Outdated Show resolved Hide resolved
{
if (type != Type::Unix)
if (_type != Type::Unix)
{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again changing this in lots of places is not my favourite change =)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

type became a private field (underscore) .. and I wanted to clarify access (see above).
The overall work was to clean up socket, side effects etc .. while reading and understanding the code.
Mixed field access is very often a PIA .. and nowadays our C++ "constexpr" getter or queries are validated and inlined, see e.g. isIPType(). Perhaps a matter of taster.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking of the extra braces burning vertical space =) nothing more.

net/Socket.hpp Outdated Show resolved Hide resolved
const int _fd;
/// True if this socket is open.
bool _open;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to care about size here. Can we use bit-fields to pack _type, _clientPort, and _open into the same space ?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I often use bitfields when having considerable flags indeed.
However, each socket is acompanied by much larger buffers (1-8k) and I doubt 3 x sizeof(int) .. but well, keeping it in mind that its OK for you. An artifical encoding of different types (int + bool) is however .. overkill?
Sure, if we would talk about an object handle, function-ptr or any short lived type in quantities of >= 100k w/o any other side-effects or memory costs otherwise (sure) .. but I don't see it here.
Was thinking about the added memory costs of Socket with added statistics, correct. But as described above .. I doubt this is material here. I could be wrong ofc.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not lift _closed from StreamSocket to its parent, here?

At a minimum we should review StreamSocket::_closed, now that we have _open. Otherwise, horrors when these two go out of sync and the left hand fights the right.

Of course it's also good when adding such common flags to review how we dealt with them for so long, to avoid redundancy and unify the design. Perhaps _open was enough and this is redundant. Perhaps not. Worth sharing your analysis either way.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

StreamSocket::_closed has been moved to Socket::_open

net/WebSocketHandler.hpp Show resolved Hide resolved
net/Socket.cpp Outdated Show resolved Hide resolved
net/Socket.hpp Outdated Show resolved Hide resolved
net/Socket.hpp Show resolved Hide resolved
…s with efficient types

Also have their `toString` string literal conversion method return `const char*`.

Signed-off-by: Sven Göthel <[email protected]>
Change-Id: I7e1a9329e0848c40a210f6250e29e26950da6fbc
@sgothel sgothel force-pushed the private/sgothel/cool_9833_connection_limit branch from 661dcde to 17a7f5c Compare September 11, 2024 13:57
@sgothel
Copy link
Author

sgothel commented Sep 11, 2024

@Ashod and @mmeeks .. I have cleaned up the commits and followed most if not all of your ideas/recommendations.

For this .. I have rearranged the single commits, so that the first ones merely cleanup/restructure Socket etc w/o features.
Indeed, they are all compile clean and should be bisectable.

Then I completed the feature and unit tests, also based on latest discussion (max-connection -> keep old-connections alive but block new ones as already implemented via MAX_CONNECTIONS). So this would be a double-safety-net.

Since I am quite happy now, I move this to 'review'.

@sgothel sgothel marked this pull request as ready for review September 11, 2024 14:02
@sgothel
Copy link
Author

sgothel commented Sep 11, 2024

.. oops, scan detected a merge conflict of mine .. one sec.

@sgothel sgothel force-pushed the private/sgothel/cool_9833_connection_limit branch from 17a7f5c to 3863949 Compare September 11, 2024 14:15
Sven Göthel added 6 commits September 11, 2024 16:31
…uns only for short constexpr getter. Socket::notifyBytes*() -> protected

This change has been condensed based on our code review, same as with part-2.

get<Noun>() might imply more processing, hence using plain <noun>()
as a non-processing get accessor.

The accessor are helpful to split read/write access as writing
for certain fields is only allowed for derived classes.

Signed-off-by: Sven Göthel <[email protected]>
Change-Id: I7e1a9329e0848c40a210f6250e29e26950da6fbc
…eout and maxConnections)

Preparation for cool#9833.

Using runtime configuration for network properties allows us to
test the socket limitation code as well as to fine-tune certain properties for clients.

Costs are removal of constexpr properties, using local (cached immutable) variables,
hence rendering the code less optimized and more pessimistic.

However, the singleton instance is fetched at object creation w/o cache hit using
a static-local, networking timing outweights memory reads and no hot-spot
performance heavy loop is affected.
Hence rendering the feature testable at runtime shall outweight the lesser costs.

- All network limits and timeouts are configurable now,
  - see net::Config in NetUtil.hpp
  - see COOLWSD
    { "net.ws.ping.timeout", "2000000" }, // WebSocketHandler ping timeout in us (2s). Zero disables metric.
    { "net.ws.ping.period", "3000000" }, // WebSocketHandler ping period in us (3s), i.e. duration until next ping.
    { "net.http.timeout", "30000000" }, // http::Session timeout in us (30s). Zero disables metric.
    { "net.maxconnections", "100000" }, // Socket maximum connections (100000). Zero disables metric.
    { "net.maxduration", "43200" }, // Socket maximum duration in seconds (12h). Zero disables metric.
    { "net.minbps", "0" }, // Socket minimum bits per seconds throughput (0). Increase for debugging. Zero disables metric.
    { "net.socketpoll.timeout", "64000000" }, // SocketPoll timeout in us (64s).

Signed-off-by: Sven Göthel <[email protected]>
Change-Id: I7e1a9329e0848c40a210f6250e29e26950da6fbc
Preparation for cool#9833, to be used to provide a time-based
ping response time average.

TimeAverage allows us to calculate the time based average of a value
while adding new time-point/value tuples at arbitrary time-points.

The result is stable against small fluctuations
and shall provide a reasonable actionable criteria.

Signed-off-by: Sven Göthel <[email protected]>
Change-Id: I7e1a9329e0848c40a210f6250e29e26950da6fbc
…oint as creationTime to ctor

This change has been condensed based on our code review, same as with parts [2-3].

This could reduce costs involved via std::chrono::steady_clock::now(),
but also allows socket creation using a dedicated time-point.

Further moved `StreamSocket::_lastSeenHTTPHeader` to
`std::chrono::steady_clock::time_point &lastHTTPHeader`
where it is being required - reduces general footprint.

Signed-off-by: Sven Göthel <[email protected]>
Change-Id: I7e1a9329e0848c40a210f6250e29e26950da6fbc
- ProtocolHandlerInterface::checkTimeout(..)
  is now called via StreamSocket::checkRemoval()
  to ensure tests even w/o input data (was: handlePoll())

- ProtocolHandlerInterface::checkTimeout(..)
  - Add bool return value: true -> shutdown connection, caller shall stop processing
  - Implemented for http::Session
    - Timeout (30s) net::Config::HTTPTimeout with missing response
  - Implemented for WebSocketHandler
    - Timeout (2s = net::Config::WSPingTimeout)
      after missing WS native frame ping/pong (server only)

- StreamSocket::checkRemoval(..)
  - called directly from SocketPoll::poll()
  - only for IPv4/v6 network connections
  - similar to ProtocolHandlerInterface::checkTimeout(..)
  - calls ProtocolHandlerInterface::checkTimeout(..)
  - added further criteria (age, throughput, ..)
    - Timeout (64s = net::Config::SocketPollTimeout)
      if (now - lastSeen) > timeout
    - Timeout (12 hours) net::Config::MaxDuration
      if (now - creationTime) > timeout
    - TODO: Throughput/bandwitdh disabled, find proper metrics/timing
    - TODO: Add maximimal IPv4/IPv6 socket-count criteria, drop oldest.

- SocketPoll::poll()
  - Additionally erases if !socket->isOpen() || socket->checkRemoval()

Signed-off-by: Sven Göthel <[email protected]>
Change-Id: I7e1a9329e0848c40a210f6250e29e26950da6fbc
Adding connection limit where registered via SocketPoll::setLimiter(),
i.e. DocumentBrokerPoll and WebServerPoll only.

SocketPoll::poll() will drop _new_ overhead connections exceeding MaxConnections.
This has been discussed, in favor of dropping the oldest connections.

Aligned net::Config::MaxConnections w/ pre-existing
MAX_CONNECTIONS and COOLWSD::MaxConnections.

COOLWSD / net::Config
- Aligned MAX_CONNECTIONS/COOLWSD::MaxConnections with
  config "net.maxconnections", net::Config::MaxConnections,
  having a minimum of 3 - defaults to 9999.

SocketPoll::setLimiter():
- Increments given non-zero connectionLimit by one
  for WS upgrade socket.

Added http::StatusCode::None(0), allowing to use for debugging

Unit Tests:
- Using base class UnitTimeoutBase for UnitTimeoutConnections + UnitTimeoutNone,
  - Testing http, ws-ping and wsd-chat-ping on multiple parallel connections
    w/ and w/o a connection limit

- UnitTimeoutSocket tests socket max-duration on http + WS sessions

- UnitTimeoutWSPing tests WS Ping (native frame) timeout limit on WS sessions

Signed-off-by: Sven Göthel <[email protected]>
Change-Id: I7e1a9329e0848c40a210f6250e29e26950da6fbc
@sgothel sgothel force-pushed the private/sgothel/cool_9833_connection_limit branch from 3863949 to b8ec82a Compare September 11, 2024 14:37
net/DelaySocket.cpp Dismissed Show dismissed Hide dismissed
@sgothel
Copy link
Author

sgothel commented Sep 11, 2024

.. oops, scan detected a merge conflict of mine .. one sec.

Re-arranged sneaking in 2 missing commits in the chain.

@mmeeks
Copy link
Contributor

mmeeks commented Sep 11, 2024

Since I am quite happy now, I move this to 'review'.

Thanks Sven - sounds very promising =) good stuff.

@mmeeks
Copy link
Contributor

mmeeks commented Sep 11, 2024

I'm confused. I still see some patches eg. this one:
e5d6dc5
that has a combination of totally safe, new debugging methods, as well as much the more risky 'isClosed' change, as well as all this renaming and accessor changes - which need splitting in isolation to be reviewed carefully,
or am I mis-reading the github UI - TBH I find the UI confusing; perhaps those are old patch fragments from earlier commits or something.
Did you push the split ? and/or am I just not seeing it ?

@sgothel
Copy link
Author

sgothel commented Sep 11, 2024

I'm confused. I still see some patches eg. this one: e5d6dc5 that has a combination of totally safe, new debugging methods, as well as much the more risky 'isClosed' change, as well as all this renaming and accessor changes - which need splitting in isolation to be reviewed carefully, or am I mis-reading the github UI - TBH I find the UI confusing; perhaps those are old patch fragments from earlier commits or something. Did you push the split ? and/or am I just not seeing it ?

I didn't remove this one (Socket and WebSocketHandler) as visible by the date + hash.
I reorganized on top of it to separate feature from restructuring code, thought this was the intention
and I do not see anything risky here (flipped a boolean close -> open, OK).

The next parts 2, 3 address single issues and also fix the std::string 'abuse' for osstream,
scoped enums, shorter accessors etc.

All of these are orthogonal to the feature - so could be seen as bisectable ex-feature.

A later Socket change for the passing of time-point, then the 2 feature commits.

Perhaps not all 100% to your preference in regards to sliced down commits ..
but feature and restructuring is.

If you insist on re-splitting it (i.e. this 1st commit perhaps), advise .. would do it tomorrow then
incl. clean-build tests at least (as done here).

(edit: re this web-UI, yes esp when commenting etc .. I usually use git commandline, git-cola, gitk and qgit .. also when reviewing for the overview)

@mmeeks
Copy link
Contributor

mmeeks commented Sep 11, 2024

Well - CI passes which is a good sign; I'm really not happy at having cosmetic cleanups merged with drastic functional change - today you can see where the functional change is amid the big chunk of non-functional change: how about in two years ? =) We can get this in, but not again for socket code.

My only concern now is what happens during save-on shutdown; I had a nightmare last night (literally) that the lifecycle of ClientSession is essentially owned by the SocketPoll (is it?) - and if that is the last ClientSession that has been hard disconnected from, and we have un-saved data - we need to keep that ClientSession alive until it is saved.

It struck me that this might well be why we have shutdown but not deleted Socket's in the SocketPoll - or did I miss something there?

@Ashod can you do a last quick review for that use-case, and double-check we have a unit test that covers it effectively ? =)

Thanks!

@Ashod
Copy link
Contributor

Ashod commented Sep 12, 2024

Hi @mmeeks,

My only concern now is what happens during save-on shutdown;

It's complicated and one of the most sensitive parts of the code, with a lot of history.

I had a nightmare last night (literally) that the lifecycle of ClientSession is essentially owned by the SocketPoll (is it?)

Sort of.

SocketPoll does in fact control when we find out that the socket is disconnected (ClientSession::onDisconnect()), which lets DocBroker know to start cleaning up (DocumentBroker::removeSession()).

and if that is the last ClientSession that has been hard disconnected from, and we have un-saved data - we need to keep that ClientSession alive until it is saved.

Indeed, as you expect, that's what we do.

We don't blindly remove the ClientSession's reference from DocBroker. Instead, we intentionally keep the last reference to ClientSession in DocBroker and start the unload process, if we need to save. We keep this last reference until we successfully save and upload, or we fail and give up.

I have really tried to understand if this PR could affect this in any way, but I can't with any confidence say I found out the answer. On the face of it, nothing stood out. But the sockets are such a foundational part of everything we do, that it's hard to say with a straight face there are no side-effects to worry about.

It struck me that this might well be why we have shutdown but not deleted Socket's in the SocketPoll - or did I miss something there?

I can't tell. So much has changed, from socket lifetime management, to the socket poll, to refactorings of socket open/closed state, to the timeout management, that I cannot estimate the impact of all of this. (My worry is that we've mangled things badly enough that we either randomly disconnect sockets, or we don't properly handle edge cases that many other functions depend on.)

But I do know that we had far more benign code than this, that had harbored great ills and nasty surprises.

@Ashod can you do a last quick review for that use-case, and double-check we have a unit test that covers it effectively ? =)

We do have unit-tests for the last-editor disconnecting with modifications to save. Unfortunately, the unit-tests related to save and upload have become very unstable of late, with very high rate of random failure. So that's no consolation.

P.S. This 5th pass was anything but quick. Considering there is ~1000 new lines changed since my last review, at ~2400 lines, I'm surprised it only took me 3 hours. (Actually, no, I'm not surprised; I had to stop.)

Copy link
Contributor

@Ashod Ashod left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, @sgothel for not making this any larger.

I left some concerns and thoughts. I refrained from repeating myself. My questions aren't rhetorical. Some decisions in this code were discussed, and I had made various suggestions, but they were ignored.

But perhaps there has been discussions and decisions that I wasn't part of.

At any rate, I leave my comments for your consideration.

Thank you.

EofFlushWrites, // finish up writes and close
Closed };

static const char* toString(State s) noexcept {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't using STATE_ENUM be better?

Let's reuse and improve what we have.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had the same thought, but then didn't want to change even more code. So yes, we could reuse our macro .. even if it would render the enum underling bigger (I used uint8_t .. the macro uses int in general, no biggy).

net/Socket.cpp Outdated
std::string Socket::toStringImpl() const noexcept
{
std::string s("Socket[#");
s.append(std::to_string(getFD()))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using to_string() is a good hint that you want ostringstream instead.

This code is simply inefficient. It's also not very readable. So what's the advantage?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

has been revised earlier, is now std::ostream& Socket::streamImpl(std::ostream& os) const

net/Socket.cpp Outdated

std::string Socket::toString() const noexcept
{
return toStringImpl();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not have the contents of toStringImpl() here?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1st .. the code has changed to offer the ostream Socket::stream() version.

The reason for the private implementation variant is that the Socket::stream() is virtual
to allow speciation to give more detailed information.
Then I use the private Socket implementation only for ctor + dtor, as they should not call into virtual methods.

net/Socket.cpp Show resolved Hide resolved
net/Socket.cpp Outdated
return "HTTP";
}

std::string StreamSocket::toString() const noexcept
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this noexcept? It certainly isn't because it doesn't throw, nor because it is fatal if it throws. It seems unnecessarily strict to take the whole server down simply because this raises an exception.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has been revised to std::ostream& StreamSocket::stream(std::ostream& os) const in my latest change set.
Pls note that I also have removed noexcept as of late for more complex methods.

if( added <= std::numeric_limits<size_t>::max() - res ) {
res += added;
} else {
// overflow
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fun fact: If we create 4 billion sockets per second, even after 136 years, we would still not overflow size_t.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know, this was my paranoia code .. instead of forcing w/ a static_cast.

LOG_WRN("SocketPoll::ConnectionCount: Underflow " << res << " - " << removed);
res = 0;
}
StatsConnectionCount.store(res, std::memory_order_relaxed);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not the proper way of doing this.

The atomic must be used for incrementing and decrementing.

Using intermediate variables and adding and subtracting and then storing is a certain way of writing thread unsafe code.

Either we don't need the atomic, and this function is fine, or we do need the atomic, and this function is broken. Either way, this code is wrong.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Writing happens in a locked state, no data-race.
Reading is allowed to see older values (just in case), so using a relaxed atomic.
This is an IMHO nice pattern to still allow lock-free decisions while ensuring correct modification.
In this use case max-connections IMHO warranted.
The actual lock-free read to make a decision is in the poll() where we may drop newSockets.

If this lazy decision is deemed too risky, OK - then we have to hard lock it in poll for dropping newSockets as well (not for simple logging of course).

In case the whole max-connection handling is redundant here (see above re MAX_CONNECTIONS), great - then we shall drop this altogether.

std::chrono::steady_clock::time_point _lastPingSentTime;
int _pingTimeUs;
size_t _openPingCount;
Util::TimeAverage _pingMicroS;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the difference between pingPeriod, pingTimeout, and pingMicroS ? And why we need a ping count? Isn't it enough to compare now with the time we got the last data?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_pingMicroS is a TimeAverage for this value, i.e. averaging out multiple measurements over time, which removes one-offs etc.

_openPingCount is the open ping count, i.e. pings from server w/o client reply. Initially I intended this for debugging only, 1st validation.
Reviewing its use in WebSocketHandler::checkTimeout() is wrongly used IMHO:

  if (_openPingCount > 0 && _pingTimeout.count() > std::numeric_limits<double>::epsilon() &&
       _pingMicroS.average() >= _pingTimeout.count())
   {

instead it should just read

  if (_pingTimeout.count() > std::numeric_limits<double>::epsilon() &&
       _pingMicroS.average() >= _pingTimeout.count())
   {

So yes, _openPingCount should not be material and be removed.

--_openPingCount;
}
LOGA_TRC(WebSocket, "Pong received: " << us << "us, avg " << avg_us << "us over "
<< (int)_pingMicroS.duration() << "s, open pings left "
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Of all the excessive renaming in this PR, _pingMicroS is perhaps the only one that is now misleading and should've been changed. Alas.

And why not log the fractional seconds?

Copy link
Author

@sgothel sgothel Sep 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As replied below (my 2nd one) .. TimeAverage manages arbitrary values, here ping-micro-seconds.
The TimeAverage::duration() in seconds only gives us a rough idea for how long this average has been taken, where values are added to the average .. like every 3 seconds by default (net::Config::WSPingPeriod).
The granularity in seconds is enough for the logging and manual validation.

const auto now = std::chrono::steady_clock::now();
const int64_t us = std::chrono::duration_cast<std::chrono::microseconds>
(now - _lastPingSentTime).count();
const double avg_us = _pingMicroS.add(_lastPingSentTime, static_cast<double>(us));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_pingMicroS.duration() returns seconds, but we are adding here micro seconds. Is this intentional? TimeAverage is so convoluted and unnecessarily over-engineered that it's not obvious what the units are (and no useful comments).

Copy link
Author

@sgothel sgothel Sep 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is a bug of mine, outch .. yes, should pass static_cast<double>(us)/1000000.0. Great catch, thank you! Note: My unit test didn't catch this either, b/c fluctuations are very low but I should have done a better job in interpretation of the values. So I take the blame here - bad mistake.

The us value in microseconds managed by TimeAverage is correct! (and so is the unit test)

Sorry, your comment confused myself .. the value managed by TimeAverage is of arbitrary nature (and dimension), as mentioned in TimeAverage::add So here microseconds is used.

The confusing part is probably that this value itself reflects time duration, however, TimeAverage::duration is orthogonal, as it only refers over the measured total time and not the semantics of this value (ping response duration in microseconds).
So yes, annotations in TimeAverage probably should be more emphasized, only the add method does. sigh

@mmeeks
Copy link
Contributor

mmeeks commented Sep 12, 2024

I had a nightmare last night (literally) that the lifecycle of ClientSession is essentially owned by the SocketPoll (is it?)

Sort of.

So I did a bit of reading myself; this is fine: I thought SocketPoll controlled lifecycle of the ClientSession - but in fact the DocumentBroker member:

/// All session of this DocBroker by ID.
SessionMap<ClientSession> _sessions;

template
class SessionMap : public std::map<std::string, std::shared_ptr >

holds a hard reference on it - so not a big deal =)

@sgothel
Copy link
Author

sgothel commented Sep 12, 2024

I had a nightmare last night (literally) that the lifecycle of ClientSession is essentially owned by the SocketPoll (is it?)

...

template class SessionMap : public std::map<std::string, std::shared_ptr >

holds a hard reference on it - so not a big deal =)

I was testing (~1 week ago) the socket-close on these criteria of this patch with firefox editing a document. The connection was build-up again and no artifacts were visible.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: In Progress
Development

Successfully merging this pull request may close these issues.

Handle limited open Connections due to keepalive connections (cool#9621)
4 participants