2

I'm running a Windows server on AWS that is serving some data to IOT devices, but after a while the server stops responding to requests because it hangs on the s.accept() call, I've managed to determine that this happens because the server has too many TCP connections open so the OS wont allocate any more which makes sense, but what doesn't make sense to me is why the connections are open still open because they should all have been closed. Here is an example from my code with parts omitted for safety:


def connection(conn, addr):
    conn.settimeout(10)
    data = None
    connection_time = datetime.now()
    n_items = 0
    try:
        print(connection_time.strftime("[%d/%m/%Y, %H:%M:%S] "), "new connection started:", addr)
        data = get_info(conn)
        print(addr, data)
        
        # serve client here, protocol omitted

    except Exception as e:
        print(f"{addr} connection error:" + str(e))
    if data is not None:
        add_connection_info(addr, data, connection_time)
    try:
        conn.close()
        print(connection_time.strftime("[%d/%m/%Y, %H:%M:%S] "), "connection ended:", addr)
    except Exception as e:
        print(f"close failed: {addr} ; {e}")


if __name__ == '__main__':
    ssl_context = ssl.SSLContext(ssl.PROTOCOL_TLS_SERVER)
    ssl_context.load_cert_chain(cert, key, password=*omitted*)
    s = socket.socket()
    s = ssl_context.wrap_socket(s, server_side=True)
    host = "0.0.0.0" 
    port = 12345  # not the actual port

    print('Server started:', host, port)

    s.bind((host, port))  # Bind to the port
    s.listen()  # Now wait for client connection.
    s.setblocking(False)
    # Join completed threads and check connection status
    threads = []
    while True:
        for thread in threads:
            thread.join(0)
        threads = [t for t in threads if t.is_alive()]
        print(f"{len(threads)} active connections")
        try:
            # Use select to wait for a connection or timeout
            rlist, _, _ = select.select([s], [], [], 100)  # 100 seconds timeout
            if s in rlist:
                s.settimeout(10)
                # TODO timout here
                c, addr = s.accept()
                print(f"Accepted connection from {addr}")
                thread = Thread(target=connection, args=(c, addr))
                #thread.daemon = True
                thread.start()
                threads.append(thread)
                print("thread started")
            else:
                print("No connection within 100 second period")

        except BlockingIOError:
            print("No connection ready")
        except Exception as e:
            print("error", str(e))
            
            try:
                c.close()
                print(f"Connection from {addr} closed due to error.")
            except Exception as e_close:
                print(f"Failed to close connection after error: {str(e_close)}")

I'm logging the output of the server and when I checked last after seeing the server freeze for every print(connection_time.strftime("[%d/%m/%Y, %H:%M:%S] "), "new connection started:", addr) there is a matching print(connection_time.strftime("[%d/%m/%Y, %H:%M:%S] "), "connection ended:", addr) so from what I can tell there should be no open connections because print(f"{len(threads)} active connections") prints that there are 0 active threads. But when I open windows resource monitor there are 50+ ish open TCP by python even for (ip, port) that should have been closed hours ago and were logged as "ended" by the server so I don't understand why they still are. Update: I was a little mistaken, after adding some logging to my code it appears that the open connections in the resource monitor have never been accepted by my server, they are definitely coming from my devices though but I'm unsure of how to close them/free them if I never know they are there in the first place. if I use ´netstat -an´ I can see they are all stuck in a CLOSE_WAIT state, is there a way for me to force windows to just cleanup connections that have been stuck in this state for more than 5 minutes?

6
  • Just a suggestion: check to see if it's due to a simple port scan. Commented Sep 27, 2024 at 16:45
  • Hi, I've updated the question a bit but I believe it is not due to a portscan because the ip addresses match my devices and previously successful connections. Commented Oct 1, 2024 at 10:39
  • I didn't say it was someone else's connections. Normal browsers send 2 TCP SYNs each, so I suggest you scan the ports yourself to see if the number of open connections increases. Commented Oct 1, 2024 at 16:51
  • What is the purpose of joining the threads in the list with timeout of 0? Isn't that a noop? Commented Oct 2, 2024 at 4:49
  • @viilpe but they are not using a browser to connect, but I did a port scan using nmap and didnt see any change in the number of hanged connections the connection is closed via the last try block. Commented Oct 2, 2024 at 9:23

3 Answers 3

1

Could not reproduce this but here are some hints:

  1. The way this server is implemented, it is not guaranteed that connections gets closed under all circumstances in function connection.

Close connections in a finally block to make sure that connections get closed under any circumstances. Alternatively,consider using the with statement.

  1. Implementing a robust and secure SSL server is tricky. If possible, consider using an existing server like gunicorn.
Sign up to request clarification or add additional context in comments.

1 Comment

Hi, yes I should definitely end the function with a finally block, I'll put that in, but I doubt it'll solve the issue, unfortunately I cannot switch to an existing server for now but I might look into that later.
1

When a TCP connection is closed, it goes into the TIME_WAIT state to ensure all packets have been properly received. If you have many connections in this state, it can exhaust available ports. You can reduce the TIME_WAIT timeout period or increase the number of available ephemeral ports and Check if NAT gateways and load balancers, are not contributing to the problem. Sometimes, network appliances have their own timeout settings that can affect connections

You can also adjust the maximum number of TCP connections allowed by modifying the registry settings on your Windows server. change params like MaxUserPort and TcpTimedWaitDelay

but I recommend you to Implement TCP keep-alive to ensure that idle connections are closed. This can help in identifying and closing stale connections Implementing long-running TCP Connections within VPC networking

Comments

0

I've managed to fix the issue, the fix seems to have been to use socket.setdefaulttimeout(10), I'm not sure why this works but not s.settimeout(10), but now the server has been running for 6 days without issues (it used to run for about 8-12 ish hours before halting), there are now 0 connections stuck in the closed wait state.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.