Stability Improvements by lionkor · Pull Request #489 · BeamMP/BeamMP-Server

lionkor · 2026-04-19T18:27:30Z

Issue: Broken QueueThread joining logic which relies on thread being joinable which is unrelated to doing join, leading to the case where the network thread crashes when it reaches the end of scope, due to a std::terminate in the dtor of std::thread (throws on destruction when unjoined). Fix: Using std::jthread which uses RAII to join on destruction.
Issue: Slow hardware or overloaded server can cause a client to disconnect and socket to close before TNetwork::DisconnectClient is called, but IsDisconnected() will still be false. In this case, .remote_endpoint() can fail, which throws an exception, which is not caught and thus std::terminates the server. I observed this, even though it was only in extremely contrived scenarios, it still happened and crashed the server. Fix: Wrapping the whole block in a try/catch. Alternate fix would have been to pass an error_code but the result is the same. This fix is not quite enough though, a later fix resolves the remaining issue that this causes the mClientMap to ignore the disconnection. Also, all paths that do if (c.IsDisconnected()) will fail to decrement the mClientMap, which isn't always the right behavior afaik. Fixed in another fix though.
Issue: Connection limiting ("DDoS protection") is broken as exceptions cause it to not decrement (and slowly fill up the mClientMap) in special cases. Mutexes are locked and unlocked manually which can (and will) lead to cases where the mutex is locked, an exception is thrown in the subsequent line (address().to_string() can throw, same with .remote_endpoint(), both of which are being called in the locked context without RAII unlocking). Fix: Replace manual map and mutex handling with a new class, TConnectionLimiter, and an associated "Guard" object TConnectionLimiter::TGuard which uses RAII to correctly keep track of IP-and-connection-count associations, the way the previous code was trying to do. This works across exceptions and other weird issues. Each connection's main thread now owns a guard, which, on destruction, decrements the counter. This way both the per-IP limits as well as the global limits are enforced. Also added some stats about this to the status command to ensure that server owners can observe this in action.
Issue: .address().to_string() can throw and is called even if accept failed, in TNetwork::TCPServerMain's accept loop. I didn't observe this and it didn't cause crashes, but I was touching that part anyway. Fix: Explicitly handle the error case first, then get the IP, etc.
Issue: ReadWithTimeout spawned a new async context for each read, and then ran that context's event loop in a new thread. This means that, not only did every one of those reads SPAWN A THREAD(!!!), it also started an io context, which gets an fd, so this made DDoS arguably more effective, not less.
Issue: Client disconnect can race due to being done on multiple threads (TOCTOU bug). For example Looper and a normal disconnect call can happen at the same time, because they check for is_open which can be true, and then change to false right after the check, causing a segfault in asio internals. Fix: Added an atomic compare-and-swap (CAS) mechanic that acts like a lock for the socket disconnect/close, and adjusted other places that checked is_open.
Issue: Lua panic calls the panic handler, and if the error supplied in the panic is not a string, or is otherwise invalid, it will trigger another panic within the panic handler. This continues and eventually crashes the program in one of many fun ways. Fix: Use raw lua functions to check if the top of the stack is a string, and only then print it, otherwise print that there was a panic and leave it at that.
Issue: error() crashes the server, due to sol::error's constructor expecting a std::string (lua_tostring or __tostring meta method), which doesn't exist if the error is, for example, nil. I reported this to sol2, but it might be an issue only in this older version we're using. Fix: Fixed as part of the next issue:
Issue: TLuaResult was used/accessed from multiple threads, including sol::object accessed from multiple threads. This lead to each access of a TLuaResult::Result accessing the Lua stack of that state (from outside that state's thread, which is unsafe). This consistently lead to issues and sometimes crashes. Fix: TLuaResult now always marshals results into a detached result variant. This allocates, but this is unlikely to impact the hot paths, as most results will be empty or have primitive types.
Issue: HTTP retains a curl handle per thread and never cleans them up. With one new thread per client, each doing an auth request at least, this quickly exhausts all file descriptors. This manifests itself as dns resolutions failing, as the server fails to open a socket to send a DNS query. Fix: The HTTP code now retains a pool of reusable handles, which clean up automatically via RAII. I tried to build this in a way that doesn't modify the code too much, so I kept it global and static.
Issue: Crash when accessing an expired std::weak_ptr<TClient>. This can happen when we check for .expired(), and then .lock(), which is, of course, another TOCTOU (time of check vs time of use) bug. What youre SUPPOSED to do instead, is locking, which always returns a std::shared_ptr<>, and then check std::shared_ptr<>::operator bool. An expired std::weak_ptr will return a default-constructed std::shared_ptr, which evaluates to false when converted to bool. Fix: Replaced all uses of .expired() and other such checks with the correct pattern. This was a lot of search and manual replace :D.

I used LLMs to help with writing unit-tests, but those do not compile into the final executable anyway. If this is undesired, I'm happy to remove that code.

By creating this pull request, I understand that code that is AI generated or otherwise automatically generated may be rejected without further discussion.
I declare that I fully understand all code I pushed into this PR, and wrote all this code myself and own the rights to this code.

this happens when, somehow, the client disconnects before we get here. I had this happen when breaking in the debugger and continuing, which leads to clients timing out (client-side timeouts).

this was exhausting file descriptors with enough concurrent reads, from what I can tell. Either way, spawning a new OS thread per read is not the way. Because this is so critical, I added unit-tests for that behavior.

with this many http connnections, we were exhausting all available file descriptors, leading to a dead server that keeps CLOSE_WAIT tcp sockets. Because we want to retain the behavior that we keep connections open for reuse, we instead make a pool of 8 curl instances now, shared between all the different requests.

… jthread jthread so it's cancellable

the previous IoCtx was never being polled

between the time we check for `is_open` and the actual disconnect, the socket could already have been disconnected by another thread (TOCTOU). Furthermore, the disconnects can race causing a segfault or similar issue in the asio's internals.

When sol2 does stack::get, it can panic, which causes the stack to explode, corrupt it, and then any subsequent action crashes the server.

this massively improves thread safety and cleanly serializes accesses into the lua engine's result objects where accesses before were extremely unsafe and could access a corrupt/invalid stack. this fixes various obscure crashes related to accessing results, without changing any observable behavior.

oops

You're supposed to .lock() instead of TOCTOU checking, of course. Not sure what I was thinking when I built that. .lock() returns a default constructed std::shared_ptr on error, which is `false` via `operator bool`.

GCC 11's C++ stdlib does a weird maneuver here where it needs to know the size of the std::pair<>::second's type. So we wrap it in a ptr.

lionkor added 30 commits April 7, 2026 20:25

use jthread to join thread on scope exit

8cdee37

catch boost tcp's remote_endpoint() throwing and crashing the server

7925caf

this happens when, somehow, the client disconnects before we get here. I had this happen when breaking in the debugger and continuing, which leads to clients timing out (client-side timeouts).

implement connection limiter to replace manual limiting code

6fca901

avoid logging moved-from ip

a59f729

set moved-from ip to explicitly obvious moved-from string

a46f2d3

accept without EP, explicitly (and fallibly) get the endpoint

b564224

add connection limiter stats to status command

0ac5da2

fix spammy logs on guard release

be6f3a2

refactor ReadWithTimeout to not spawn a thread + use an fd each read

58e5317

this was exhausting file descriptors with enough concurrent reads, from what I can tell. Either way, spawning a new OS thread per read is not the way. Because this is so critical, I added unit-tests for that behavior.

fix ReadSocketWithTimeout to use dedicated io context polled on a new…

1f55c35

… jthread jthread so it's cancellable

make connection reject msg a debug message, avoiding spam on ddos

be0d8d5

move TIoPollThread out and use it in TServer

c08eefd

the previous IoCtx was never being polled

use ReadWithTimeout until fully completed auth

b1946ef

increase http curl pool to 128

4da5a74

handle error cases early

d26e53a

make send file accept a non-blocking socket

e9ce71d

fix race on disconnect

7089e5d

between the time we check for `is_open` and the actual disconnect, the socket could already have been disconnected by another thread (TOCTOU). Furthermore, the disconnects can race causing a segfault or similar issue in the asio's internals.

fix PanicHandler crashing itself with another panic inside sol2

7096fe0

When sol2 does stack::get, it can panic, which causes the stack to explode, corrupt it, and then any subsequent action crashes the server.

fix sol::error crash

e260de5

fix error handling from SetErrorMessageFromResult

66f5f2b

fix GetIdentifiers()

58d9f2e

fix GetIdentifiers building the wrong kind of table

4bb7232

fix missing semicolon

0c86466

fix success/error handling in lua engine

3e12e48

fix holding lock during sleep

98fb12b

oops

fix std::weak_ptr locking and expiry checks

9ca12fc

You're supposed to .lock() instead of TOCTOU checking, of course. Not sure what I was thinking when I built that. .lock() returns a default constructed std::shared_ptr on error, which is `false` via `operator bool`.

fix TLuaValue handling to be less odd

f820d75

add AGPL headers to more new files

8f2b8b2

lionkor added 6 commits April 19, 2026 19:16

force enable SOL_ALL_SAFETIES_ON

dd6d90a

fix server cmake version

c97d7b1

fix unit-tests crashing due to sol::object::~object

4045727

clarify/fix std::visit compile time check

06bd719

fix never incrementing i in lua result print

7c6acfd

fix GCC 11 compiler/libstdc++ error

57fe7cb

GCC 11's C++ stdlib does a weird maneuver here where it needs to know the size of the std::pair<>::second's type. So we wrap it in a ptr.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Stability Improvements#489

Stability Improvements#489
lionkor wants to merge 36 commits intoBeamMP:minorfrom
lionkor:minor

lionkor commented Apr 19, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

lionkor commented Apr 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

lionkor commented Apr 19, 2026 •

edited

Loading