When we upgraded the Braintree Python SDK to 4.42.0
we started getting RemoteDisconnected errors in our Lambda functions.
Why?
First ingredient: urllib3 race condition
Before sending a request,
urllib3 checks whether the connection is dropped via
def is_connection_dropped(conn: BaseHTTPConnection) -> bool: # Platform-specific
"""
Returns True if the connection is dropped and should be closed.
:param conn: :class:`urllib3.connection.HTTPConnection` object.
"""
return not conn.is_connected
which calls
@property
def is_connected(self) -> bool:
if self.sock is None:
return False
return not wait_for_read(self.sock, timeout=0.0)
I think the idea is:
-
if the socket has readable data when no request is in progress, it must be a FIN,
-
so the connection is dropped, and
urllib3gets another.
There’s a race condition:
-
if the connection is dropped after the
wait_for_read, -
but before we send our request,
-
then the request will fail with
RemoteDisconnected.
The race is inevitable. There must be a window between check and send, during which the connection might be dropped.
Let’s prove it.
First, we’ll set up a keep-alive HTTP server which drops idle connections after TIMEOUT seconds:
class Handler(BaseHTTPRequestHandler):
protocol_version = "HTTP/1.1"
timeout = TIMEOUT
def do_GET(self):
self.send_response(200)
self.send_header("Content-Type", "text/plain")
self.send_header("Content-Length", "2")
self.end_headers()
self.wfile.write(b"ok")
def log_message(self, format, *args): # noqa: A002
pass # suppress per-request and timeout logs
def start_server() -> ThreadingHTTPServer:
server = ThreadingHTTPServer((SERVER_HOST, SERVER_PORT), Handler)
thread = threading.Thread(target=server.serve_forever, daemon=True)
thread.start()
return server
Next, a client which disables the is_connection_dropped check, GETs, sleeps past TIMEOUT seconds, then GETs again.
This client gets RemoteDisconnected for sure, proving the race in principle.
def run_patched_client():
print("=== PATCHED CLIENT ===")
original_util = urllib3.util.connection.is_connection_dropped
original_pool = urllib3.connectionpool.is_connection_dropped
urllib3.util.connection.is_connection_dropped = lambda _: False
urllib3.connectionpool.is_connection_dropped = lambda _: False
try:
session = requests.Session()
print("Making first request...", end=" ")
response = session.get(SERVER_URL)
print(response.status_code)
print("Sleeping past the keep-alive timeout so the connection is dropped")
sleep(TIMEOUT + 1)
try:
print("Making second request...", end=" ")
response = session.get(SERVER_URL)
print(response.status_code)
except RequestException as e:
if "RemoteDisconnected" in str(e):
print("RemoteDisconnected")
finally:
urllib3.util.connection.is_connection_dropped = original_util
urllib3.connectionpool.is_connection_dropped = original_pool
Finally, a pool of normal clients.
These clients get RemoteDisconnected once in a while, proving the race in practice.
def run_normal_clients():
print("=== NORMAL CLIENTS ===")
def worker():
session = requests.Session()
session.get(SERVER_URL)
sleep(TIMEOUT)
try:
session.get(SERVER_URL)
return "ok"
except RequestException as e:
return str(e)
num_ok = 0
num_remote_disconnected = 0
with ThreadPoolExecutor(max_workers=NUM_NORMAL_CLIENTS) as pool:
futures = [pool.submit(worker) for _ in range(NUM_NORMAL_CLIENTS)]
for f in as_completed(futures):
result = f.result()
if result == "ok":
num_ok += 1
elif "RemoteDisconnected" in result:
num_remote_disconnected += 1
print(f"RemoteDisconnected: {num_remote_disconnected}/{NUM_NORMAL_CLIENTS}")
print(f"OK: {num_ok}/{NUM_NORMAL_CLIENTS}")
Orchestrate (full script):
if __name__ == "__main__":
server = start_server()
try:
run_patched_client()
run_normal_clients()
finally:
server.shutdown()
And run:
% uv run remote_disconnected_demo.py
=== PATCHED CLIENT ===
Making first request... 200
Sleeping past the keep-alive timeout so the connection is dropped
Making second request... RemoteDisconnected
=== NORMAL CLIENTS ===
RemoteDisconnected: 4/50
OK: 46/50
Second ingredient: session per gateway
braintree depends on requests, which depends on urllib3.
In 4.42, it changed how it manages requests.Session objects.
Before, a BraintreeGateway creates a session per POST.
In 4.42, a BraintreeGateway stores an Http internally, as self.config._http_strategy.
The first time you request through the gateway, the Http creates a requests.Session, as self._thread_local.session.
A session per gateway, not per POST.
A sensible idea: faster and less wasteful.
Third ingredient: module-level gateway
It’s standard, recommended practice in AWS Lambda to set up some resources (configuration, connections, etc.) at module-level, so they’re created once at init time, and then re-used across warm invocations. A sensible idea: faster and less wasteful.
Mixing the ingredients
You initialise the BraintreeGateway at module-level.
The first time your function is invoked, the gateway creates a requests.Session.
Your function is invoked again a while later,
just when you’re hitting the connection timeout,
and the remote disconnects between urllib3's check and send.
So RemoteDisconnected.
This will happen at a low rate, only when you get unlucky with the timing. But at scale a low rate is a high number.