Draft
Conversation
this happens when, somehow, the client disconnects before we get here. I had this happen when breaking in the debugger and continuing, which leads to clients timing out (client-side timeouts).
this was exhausting file descriptors with enough concurrent reads, from what I can tell. Either way, spawning a new OS thread per read is not the way. Because this is so critical, I added unit-tests for that behavior.
with this many http connnections, we were exhausting all available file descriptors, leading to a dead server that keeps CLOSE_WAIT tcp sockets. Because we want to retain the behavior that we keep connections open for reuse, we instead make a pool of 8 curl instances now, shared between all the different requests.
… jthread jthread so it's cancellable
the previous IoCtx was never being polled
between the time we check for `is_open` and the actual disconnect, the socket could already have been disconnected by another thread (TOCTOU). Furthermore, the disconnects can race causing a segfault or similar issue in the asio's internals.
When sol2 does stack::get, it can panic, which causes the stack to explode, corrupt it, and then any subsequent action crashes the server.
this massively improves thread safety and cleanly serializes accesses into the lua engine's result objects where accesses before were extremely unsafe and could access a corrupt/invalid stack. this fixes various obscure crashes related to accessing results, without changing any observable behavior.
You're supposed to .lock() instead of TOCTOU checking, of course. Not sure what I was thinking when I built that. .lock() returns a default constructed std::shared_ptr on error, which is `false` via `operator bool`.
GCC 11's C++ stdlib does a weird maneuver here where it needs to know the size of the std::pair<>::second's type. So we wrap it in a ptr.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
joinablewhich is unrelated to doingjoin, leading to the case where the network thread crashes when it reaches the end of scope, due to astd::terminatein the dtor ofstd::thread(throws on destruction when unjoined). Fix: Usingstd::jthreadwhich uses RAII to join on destruction.TNetwork::DisconnectClientis called, butIsDisconnected()will still be false. In this case,.remote_endpoint()can fail, which throws an exception, which is not caught and thusstd::terminates the server. I observed this, even though it was only in extremely contrived scenarios, it still happened and crashed the server. Fix: Wrapping the whole block in a try/catch. Alternate fix would have been to pass anerror_codebut the result is the same. This fix is not quite enough though, a later fix resolves the remaining issue that this causes themClientMapto ignore the disconnection. Also, all paths that doif (c.IsDisconnected())will fail to decrement themClientMap, which isn't always the right behavior afaik. Fixed in another fix though.mClientMap) in special cases. Mutexes are locked and unlocked manually which can (and will) lead to cases where the mutex is locked, an exception is thrown in the subsequent line (address().to_string()can throw, same with.remote_endpoint(), both of which are being called in the locked context without RAII unlocking). Fix: Replace manual map and mutex handling with a new class,TConnectionLimiter, and an associated "Guard" objectTConnectionLimiter::TGuardwhich uses RAII to correctly keep track of IP-and-connection-count associations, the way the previous code was trying to do. This works across exceptions and other weird issues. Each connection's main thread now owns a guard, which, on destruction, decrements the counter. This way both the per-IP limits as well as the global limits are enforced. Also added some stats about this to thestatuscommand to ensure that server owners can observe this in action..address().to_string()can throw and is called even ifacceptfailed, inTNetwork::TCPServerMain's accept loop. I didn't observe this and it didn't cause crashes, but I was touching that part anyway. Fix: Explicitly handle the error case first, then get the IP, etc.ReadWithTimeoutspawned a new async context for each read, and then ran that context's event loop in a new thread. This means that, not only did every one of those reads SPAWN A THREAD(!!!), it also started an io context, which gets an fd, so this made DDoS arguably more effective, not less.Looperand a normal disconnect call can happen at the same time, because they check foris_openwhich can be true, and then change to false right after the check, causing a segfault in asio internals. Fix: Added an atomic compare-and-swap (CAS) mechanic that acts like a lock for the socket disconnect/close, and adjusted other places that checkedis_open.error()crashes the server, due tosol::error's constructor expecting astd::string(lua_tostringor__tostringmeta method), which doesn't exist if the error is, for example,nil. I reported this to sol2, but it might be an issue only in this older version we're using. Fix: Fixed as part of the next issue:TLuaResultwas used/accessed from multiple threads, includingsol::objectaccessed from multiple threads. This lead to each access of aTLuaResult::Resultaccessing the Lua stack of that state (from outside that state's thread, which is unsafe). This consistently lead to issues and sometimes crashes. Fix:TLuaResultnow always marshals results into a detached result variant. This allocates, but this is unlikely to impact the hot paths, as most results will be empty or have primitive types.std::weak_ptr<TClient>. This can happen when we check for.expired(), and then.lock(), which is, of course, another TOCTOU (time of check vs time of use) bug. What youre SUPPOSED to do instead, is locking, which always returns astd::shared_ptr<>, and then checkstd::shared_ptr<>::operator bool. An expiredstd::weak_ptrwill return a default-constructedstd::shared_ptr, which evaluates tofalsewhen converted tobool. Fix: Replaced all uses of.expired()and other such checks with the correct pattern. This was a lot of search and manual replace :D.I used LLMs to help with writing unit-tests, but those do not compile into the final executable anyway. If this is undesired, I'm happy to remove that code.
By creating this pull request, I understand that code that is AI generated or otherwise automatically generated may be rejected without further discussion.
I declare that I fully understand all code I pushed into this PR, and wrote all this code myself and own the rights to this code.