-
-
Notifications
You must be signed in to change notification settings - Fork 579
Description
simply enabling DEBUG logging causes WebSocket connections to fail when sending large messages containing multi-byte UTF-8 characters (e.g., Japanese, Chinese, emoji).
Summary
The websockets library itself fragments large messages at byte boundaries, and then fails to log those same frames it created. This is a bug in the library's Frame.__str__() method, not a user error.
When DEBUG logging is enabled, Frame.__str__() raises UnicodeDecodeError for OP_TEXT frames that contain incomplete UTF-8 byte sequences.
This happens when:
- A large text message is sent (e.g.,
websocket.send(large_json_with_japanese)) - The websockets library fragments the message into multiple frames (at byte boundaries, not UTF-8 character boundaries)
- DEBUG logging attempts to log the sent frame with
logger.debug("> %s", frame) Frame.__str__()tries to decode the frame data as UTF-8, which fails if the frame ends in the middle of a multi-byte character
This causes the WebSocket connection to close with code 1007 (INVALID_DATA) even though the message data itself is valid - it's just split across multiple frames.
Environment
- websockets version: 16.0
- Python version: 3.11
- OS: Linux (Azure App Service)
Reproduction
Minimal example (direct Frame creation)
from websockets.frames import OP_TEXT, Frame
# Create a large Japanese text (UTF-8 multi-byte characters)
text = "あいうえおかきくけこ" * 700
data = text.encode("utf-8")
# Truncate at position 16373 (middle of a multi-byte character)
# This simulates what happens when a large message is fragmented
truncated = data[:16373]
# OP_TEXT frame with incomplete UTF-8 (simulating first fragment)
frame = Frame(opcode=OP_TEXT, data=truncated, fin=False)
# This raises UnicodeDecodeError
print(str(frame))Output:
UnicodeDecodeError: 'utf-8' codec can't decode bytes in position 16371-16372: unexpected end of data
Realistic example (WebSocket server and client)
This example demonstrates the actual scenario where a user simply sends a large message and the library fails:
import asyncio
import logging
import websockets
# Enable DEBUG logging - this is all it takes to trigger the bug
logging.basicConfig(level=logging.DEBUG)
async def echo_handler(websocket):
"""Simple echo server - receives and echoes back messages."""
async for message in websocket:
await websocket.send(message)
async def test_large_unicode_message():
"""Test that sending a large Unicode message fails with DEBUG logging."""
# Start a WebSocket server
async with websockets.serve(echo_handler, "localhost", 8765):
# Connect as a client
async with websockets.connect("ws://localhost:8765") as websocket:
# Create a large message with Japanese text (multi-byte UTF-8)
# This is a realistic scenario - e.g., sending a large JSON with Japanese content
large_message = "こんにちは世界" * 3000 # ~63KB of Japanese text
try:
# This will fail when the library tries to log the fragmented frame
await websocket.send(large_message)
response = await websocket.recv()
print(f"Success: received {len(response)} characters")
except websockets.exceptions.ConnectionClosedError as e:
print(f"Connection closed unexpectedly: {e}")
# Expected output with DEBUG logging:
# ConnectionClosedError: sent 1007 (invalid frame payload data)
# unexpected end of data at position XXXXX
if __name__ == "__main__":
asyncio.run(test_large_unicode_message())Expected behavior: Message is sent and echoed back successfully.
Actual behavior with DEBUG logging:
ConnectionClosedError: sent 1007 (invalid frame payload data)
unexpected end of data at position 16373
Note: The same code works perfectly when logging level is INFO or higher. The bug only manifests when DEBUG logging is enabled.
Expected behavior
Frame.__str__() should gracefully handle incomplete UTF-8 sequences, similar to how it handles OP_CONT frames.
Actual behavior
Frame.__str__() raises UnicodeDecodeError for OP_TEXT frames, which can propagate through the logging system and cause the connection to be closed.
Root cause
In frames.py, the __str__ method has inconsistent exception handling:
For OP_TEXT (line 159-162) - NO exception handling:
if self.opcode is OP_TEXT:
# Decode the entire payload then elide later if necessary.
data = repr(bytes(self.data).decode()) # <-- No try-except!For OP_CONT/PING/PONG (line 179-182) - HAS exception handling:
try:
data = repr(bytes(self.data).decode())
coding = "text"
except (UnicodeDecodeError, AttributeError):
# fallback to binary representationImpact
Users are not doing anything wrong - they're simply:
- Enabling DEBUG logging (a standard practice for debugging)
- Sending a normal text message via
websocket.send()
The library then:
- Fragments the message into multiple frames (internal behavior)
- Attempts to log those frames
- Fails to decode its own fragmented frames
- Closes the connection with an error
When DEBUG logging is enabled globally (e.g., logging.getLogger().setLevel(logging.DEBUG)), the following happens:
Protocol.debugbecomesTrue(line 108 in protocol.py)- When sending a large message, the library fragments it into multiple frames
- For each frame being sent,
self.logger.debug("> %s", frame)is called Frame.__str__()is invoked- For fragmented OP_TEXT frames with incomplete UTF-8,
UnicodeDecodeErroris raised - The exception propagates and the connection is closed with code 1007
This is particularly problematic for applications that:
- Use DEBUG logging in development/staging environments
- Send large text messages (which are fragmented by the library)
- Process non-ASCII text (like Japanese, which uses 3-byte UTF-8 sequences)
Suggested fix
Add the same exception handling to OP_TEXT frames:
if self.opcode is OP_TEXT:
try:
data = repr(bytes(self.data).decode())
except (UnicodeDecodeError, AttributeError):
# Fallback to binary representation for incomplete UTF-8
binary = self.data
if len(binary) > self.MAX_LOG_SIZE // 3:
cut = (self.MAX_LOG_SIZE // 3 - 1) // 3
binary = b"".join([binary[: 2 * cut], b"\x00\x00", binary[-cut:]])
data = " ".join(f"{byte:02x}" for byte in binary)
coding = "binary (incomplete UTF-8)"Workaround
As a workaround, set the websockets logger level to INFO or higher:
import logging
logging.getLogger("websockets").setLevel(logging.INFO)This prevents Frame.__str__() from being called since Protocol.debug will be False.
Related
- This issue is similar to how the library handles UTF-8 decode errors in the actual message processing (which correctly uses incremental UTF-8 decoding in the Assembler class)
- The
__str__method is only used for debugging/logging purposes, so it should be fault-tolerant