Prevent snowballing connection creation on context cancellation. #2425

redbaron · 2025-11-06T12:44:27Z

Change deadline delay for default DeadlineContextWatcherHander to prevent snowballing effect of establishing connections in excess of configured connection pool maximum size in pathological case with many context cancellations.

Previously, with default deadline of 0, on context cancellation pgx was immediately marking connection as closed, even though pgConn.asyncClose still needs time to complete and PG need time to act on the cancellation request. Connection pooler (pgx pool or stdlib sql) checks IsClosed() state on Release, recognizes it as closed and immediately removes it from the pool, allowing new connection to be established.

If new connection establishment time is less than full end-to-end query cancellation time, including time a server needs to act on cancellation request, and there are many context cancellations events happening per second, then number of connections seen by a PostgreSQL server start to accumulate, exceeding configured total pool size.

It eventually reaches max_connections limit on a server, with the majority of them being cancelled. No new connection can be established, bringing app to a halt.

New default deadline delay of 50ms is chosen to be larger than pessimistic time for a full query cancellation completion. With the new deadline pgx keeps connection in the pool after cancellation for up to 50ms, thus effectively throttling how fast cancelled queries can be replaced with a new connection. Because new connection creation is now throttled, it can't overrun query cancellations, preventing the snowball effect.

It also has nice side effect of increasing a chance of connection reuse if what would be cancelled query completes in less than 50ms.

Change deadline delay for default DeadlineContextWatcherHander to prevent snowballing effect of establishing connections in excess of configured connection pool maximum size in pathological case with many context cancellations. Previously, with default deadline of 0, on context cancellation pgx was immediately marking connection as closed, even though pgConn.asyncClose still needs time to complete and PG need time to act on the cancellation request. Connection pooler (pgx pool or stdlib sql) checks `IsClosed()` state on Release, recognizes it as closed and immediately removes it from the pool, allowing new connection to be established. If new connection establishment time is less than full end-to-end query cancellation time, including time a server needs to act on cancellation request, and there are many context cancellations events happening per second, then number of connections seen by a PostgreSQL server start accumulate, exceeding configured total pool size. It eventually reaches `max_connections` limit on a server, with the majority of them being cancelled. No new connection can be established, bring app to a halt. New default deadline delay of 50ms is chosen to be larger than pessimistic time for a full query cancellation completion. With the new deadline pgx keeps connection in the pool after cancellation for up to 50ms, thus effectively throttling how fast cancelled queries can be replaced with a new connection. Because new connection creation is now throttled, it can't overrun query cancellations, preventing the snowball effect. It also has nice side effect of increasing a chance of connection reuse if what would be cancelled query completes in less than 50ms.

jackc · 2025-11-09T00:58:14Z

I built DeadlineContextWatcherHandler to support this behavior, but I'm not sure I want it as default.

My approach to date has been that by default context cancellation should cause the canceled operation to return as quickly as possible.

Regardless, it seems that the underlying issue is the pool creating more connections than it has capacity for. I have some recollection of that being an issue in the past, but I thought it was resolved.

redbaron · 2025-11-09T09:24:01Z

My approach to date has been that by default context cancellation should cause the canceled operation to return as quickly as possible.

Yes, but it effectively misinforms an application layer: connection is still pretty much alive, but app thinks it is gone for good. This creates conditions for creating more connections than it intends to.

underlying issue is the pool creating more connections than it has capacity for.

Pool is doing best it can: if pgConn.IsClosed() is true, then it must free up slot and allow new connection to be created.

Ideally IsClosed() should be accurate and faithfully wait for in-flight query to be truly aborted by a PG server. It can currently be achieved with DeadlineContextWatcherHandler.DeadlineDelay set to infinity, but async cancellation can fail for various reasons, so there must be a time limit, exceeding which we just close TCP connection as last resort. 50ms is a reasonable limit for the async closure full completion.

jackc · 2025-11-09T14:01:50Z

Ideally IsClosed() should be accurate and faithfully wait for in-flight query to be truly aborted by a PG server.

This role is filled by reading from PgConn.CleanupDone(). That's what I thought was already preventing creating a new connection while connections that are being cleaned up are still active. If that's not working then that is an underlying bug that should be fixed, rather than masking the bug with what is essentially a sleep.

Where / how are you measuring the connection count? Is it possible that is including the temporary connections made to issue the "CancelQuery"? Those aren't full connections and aren't considered by the pool, but may be impacting this situation.

redbaron · 2025-11-09T21:16:04Z

This role is filled by reading from PgConn.CleanupDone().

Cleanup channel is closed in a goroutine started by pgConn.asyncClose immediately after query cancelation request is replied by a PG server. As I understand it doesn't guarantee that connection, which is being closed is actually closed by server.

Where / how are you measuring the connection count? Is it possible that is including the temporary connections made to issue the "CancelQuery"? Those aren't full connections and aren't considered by the pool, but may be impacting this situation.

Connection count is measured on PG side using pg_stat_activity. All excessive connections have matching application_name we configure for the pgx pools. On pgx side we don't use custom context cancellation handler and pgx connects directly to PG, without any intermediaries. As I understand it auxiliary connections pgx opens to issue CancelQuery command are not counted towards real connections count by the PG server: they are not even authenticated like normal connections do.

Scenario to max out server connections beyond pgx pool max size (in our case we had 6x max_connections than pool size) is following:

number of active pool.Acquire goroutines >> pool size (200-300 times more in our case). Acquires should never dry out , always applying pressure. Acquire should use cancellable context, it was 2s in our case.
SQL queries on acquired connections should take less time than context timeout
Majority Acquire will timeout without success, but few will be lucky enough to run query
Some of those which run query will be cancelling them mid-query, thus triggering pgConn.asyncClose
if PG server is overloaded with queries, cancellation processing got delayed, building up unaccounted (by pgxpool) connections.

jackc · 2025-11-28T16:46:28Z

Cleanup channel is closed in a goroutine started by pgConn.asyncClose immediately after query cancelation request is replied by a PG server. As I understand it doesn't guarantee that connection, which is being closed is actually closed by server.

I looked at all the locations the cleanup channel is closed. In every case the net.Conn is closed first. So at least on the client side the connection is entirely terminated. But I suppose that on the server side the connection could still be open for a little while. 🤔

Looking at the PG docs it doesn't seem there is a good way for the client to know when the server has actually finished cleaning up the connection. From https://www.postgresql.org/docs/current/protocol-flow.html#PROTOCOL-FLOW-TERMINATION

The normal, graceful termination procedure is that the frontend sends a Terminate message and immediately closes the connection. On receipt of this message, the backend closes the connection and terminates.

We are already doing that.

But a simple change would be to add the sleep / delay to asyncClose. That would preserve the behavior of the caller of an interrupted query getting control back as quickly as possible while delaying the closing of the cleanup channel for a while.

It also might be possible to use CloseWrite and Read to wait a little while for the server. With that we could wait up to a few seconds if the server never closed the connection, but also close the cleanup channel immediately if the server did close it. Might need to review #637 though.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Prevent snowballing connection creation on context cancellation. #2425

Prevent snowballing connection creation on context cancellation. #2425

redbaron commented Nov 6, 2025 •

edited

Loading

Uh oh!

jackc commented Nov 9, 2025

Uh oh!

redbaron commented Nov 9, 2025

Uh oh!

jackc commented Nov 9, 2025

Uh oh!

redbaron commented Nov 9, 2025

Uh oh!

jackc commented Nov 28, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Prevent snowballing connection creation on context cancellation. #2425

Are you sure you want to change the base?

Prevent snowballing connection creation on context cancellation. #2425

Conversation

redbaron commented Nov 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jackc commented Nov 9, 2025

Uh oh!

redbaron commented Nov 9, 2025

Uh oh!

jackc commented Nov 9, 2025

Uh oh!

redbaron commented Nov 9, 2025

Uh oh!

jackc commented Nov 28, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

redbaron commented Nov 6, 2025 •

edited

Loading