Skip to content

Stability Improvements#489

Draft
lionkor wants to merge 36 commits intoBeamMP:minorfrom
lionkor:minor
Draft

Stability Improvements#489
lionkor wants to merge 36 commits intoBeamMP:minorfrom
lionkor:minor

Conversation

@lionkor
Copy link
Copy Markdown
Contributor

@lionkor lionkor commented Apr 19, 2026

  1. Issue: Broken QueueThread joining logic which relies on thread being joinable which is unrelated to doing join, leading to the case where the network thread crashes when it reaches the end of scope, due to a std::terminate in the dtor of std::thread (throws on destruction when unjoined). Fix: Using std::jthread which uses RAII to join on destruction.
  2. Issue: Slow hardware or overloaded server can cause a client to disconnect and socket to close before TNetwork::DisconnectClient is called, but IsDisconnected() will still be false. In this case, .remote_endpoint() can fail, which throws an exception, which is not caught and thus std::terminates the server. I observed this, even though it was only in extremely contrived scenarios, it still happened and crashed the server. Fix: Wrapping the whole block in a try/catch. Alternate fix would have been to pass an error_code but the result is the same. This fix is not quite enough though, a later fix resolves the remaining issue that this causes the mClientMap to ignore the disconnection. Also, all paths that do if (c.IsDisconnected()) will fail to decrement the mClientMap, which isn't always the right behavior afaik. Fixed in another fix though.
  3. Issue: Connection limiting ("DDoS protection") is broken as exceptions cause it to not decrement (and slowly fill up the mClientMap) in special cases. Mutexes are locked and unlocked manually which can (and will) lead to cases where the mutex is locked, an exception is thrown in the subsequent line (address().to_string() can throw, same with .remote_endpoint(), both of which are being called in the locked context without RAII unlocking). Fix: Replace manual map and mutex handling with a new class, TConnectionLimiter, and an associated "Guard" object TConnectionLimiter::TGuard which uses RAII to correctly keep track of IP-and-connection-count associations, the way the previous code was trying to do. This works across exceptions and other weird issues. Each connection's main thread now owns a guard, which, on destruction, decrements the counter. This way both the per-IP limits as well as the global limits are enforced. Also added some stats about this to the status command to ensure that server owners can observe this in action.
  4. Issue: .address().to_string() can throw and is called even if accept failed, in TNetwork::TCPServerMain's accept loop. I didn't observe this and it didn't cause crashes, but I was touching that part anyway. Fix: Explicitly handle the error case first, then get the IP, etc.
  5. Issue: ReadWithTimeout spawned a new async context for each read, and then ran that context's event loop in a new thread. This means that, not only did every one of those reads SPAWN A THREAD(!!!), it also started an io context, which gets an fd, so this made DDoS arguably more effective, not less.
  6. Issue: Client disconnect can race due to being done on multiple threads (TOCTOU bug). For example Looper and a normal disconnect call can happen at the same time, because they check for is_open which can be true, and then change to false right after the check, causing a segfault in asio internals. Fix: Added an atomic compare-and-swap (CAS) mechanic that acts like a lock for the socket disconnect/close, and adjusted other places that checked is_open.
  7. Issue: Lua panic calls the panic handler, and if the error supplied in the panic is not a string, or is otherwise invalid, it will trigger another panic within the panic handler. This continues and eventually crashes the program in one of many fun ways. Fix: Use raw lua functions to check if the top of the stack is a string, and only then print it, otherwise print that there was a panic and leave it at that.
  8. Issue: error() crashes the server, due to sol::error's constructor expecting a std::string (lua_tostring or __tostring meta method), which doesn't exist if the error is, for example, nil. I reported this to sol2, but it might be an issue only in this older version we're using. Fix: Fixed as part of the next issue:
  9. Issue: TLuaResult was used/accessed from multiple threads, including sol::object accessed from multiple threads. This lead to each access of a TLuaResult::Result accessing the Lua stack of that state (from outside that state's thread, which is unsafe). This consistently lead to issues and sometimes crashes. Fix: TLuaResult now always marshals results into a detached result variant. This allocates, but this is unlikely to impact the hot paths, as most results will be empty or have primitive types.
  10. Issue: HTTP retains a curl handle per thread and never cleans them up. With one new thread per client, each doing an auth request at least, this quickly exhausts all file descriptors. This manifests itself as dns resolutions failing, as the server fails to open a socket to send a DNS query. Fix: The HTTP code now retains a pool of reusable handles, which clean up automatically via RAII. I tried to build this in a way that doesn't modify the code too much, so I kept it global and static.
  11. Issue: Crash when accessing an expired std::weak_ptr<TClient>. This can happen when we check for .expired(), and then .lock(), which is, of course, another TOCTOU (time of check vs time of use) bug. What youre SUPPOSED to do instead, is locking, which always returns a std::shared_ptr<>, and then check std::shared_ptr<>::operator bool. An expired std::weak_ptr will return a default-constructed std::shared_ptr, which evaluates to false when converted to bool. Fix: Replaced all uses of .expired() and other such checks with the correct pattern. This was a lot of search and manual replace :D.

I used LLMs to help with writing unit-tests, but those do not compile into the final executable anyway. If this is undesired, I'm happy to remove that code.


By creating this pull request, I understand that code that is AI generated or otherwise automatically generated may be rejected without further discussion.
I declare that I fully understand all code I pushed into this PR, and wrote all this code myself and own the rights to this code.

lionkor added 30 commits April 7, 2026 20:25
this happens when, somehow, the client disconnects before we get here.
I had this happen when breaking in the debugger and continuing, which
leads to clients timing out (client-side timeouts).
this was exhausting file descriptors with enough concurrent reads, from
what I can tell. Either way, spawning a new OS thread per read is not
the way.
Because this is so critical, I added unit-tests for that behavior.
with this many http connnections, we were exhausting all available file
descriptors, leading to a dead server that keeps CLOSE_WAIT tcp sockets.
Because we want to retain the behavior that we keep connections open for
reuse, we instead make a pool of 8 curl instances now, shared between
all the different requests.
the previous IoCtx was never being polled
between the time we check for `is_open` and the actual disconnect, the
socket could already have been disconnected by another thread (TOCTOU).
Furthermore, the disconnects can race causing a segfault or similar
issue in the asio's internals.
When sol2 does stack::get, it can panic, which causes the stack to
explode, corrupt it, and then any subsequent action crashes the server.
this massively improves thread safety and cleanly serializes accesses
into the lua engine's result objects where accesses before were
extremely unsafe and could access a corrupt/invalid stack.

this fixes various obscure crashes related to accessing results,
without changing any observable behavior.
You're supposed to .lock() instead of TOCTOU checking, of course.
Not sure what I was thinking when I built that. .lock() returns a
default constructed std::shared_ptr on error, which is `false` via
`operator bool`.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant