lslb - ExtConn timed out while connecting.

Discussion in 'General' started by Clockwork, Aug 11, 2010.

  1. Clockwork

    Clockwork Member

    My lslb error.log is getting flooded by the following notice:

    the site is working fine, but what does this message mean?

    edit:

    debug stuff:

    edit:

    sometimes I'm getting the following warning:

    the problem started after we've changed the connection from our database server to a gigabit port, but this change doesn't affected the loadbalancer or webserver, it just improved the page load speed.

    btw. we had the "ExtConn timed out while connecting." notice sometimes before, but not that much like now.

    edit:

    nginx seems to loadbalance without any problems, so this seems to be a lslb problem
    Last edited: Aug 11, 2010
  2. mistwang

    mistwang LiteSpeed Staff

    Please try command "telenet 192.168.0.4 80" from command line multiple times, see if you got long delay connecting to the target server sometimes.
  3. GaryT

    GaryT New Member

    edit: wrong section
    Last edited: Aug 11, 2010
  4. Clockwork

    Clockwork Member

    I've no telnet installed, but I've tried it with nmap and nc, no problems so far.
  5. Clockwork

    Clockwork Member

    ohps, could someone move this topic to the loadbalancer forum? my mistake.

    I've switched to nginx until there is a solution, lslb doesn't run stable atm, I hope you guys can help us to fix this problem, lslb is our ddos protection and performs way better than nginx.
  6. mistwang

    mistwang LiteSpeed Staff

    No problem. moved.
    Have you specify the source IP when your configure each node?
    Looks like lslb has problem connecting to all backend servers. could it be a problem with NIC port, switch port? If you use dedicate connection communicate with backend servers, you can check the packet loss of that specific NIC.
    LSLB uses persistent connections, while nginx does not, there could be more ESTABLISHED connections with LSLB. Is there a firewall between LSLB and web servers?

    If you do think it is a LSLB bug, could you strace lslbd while the problem is happening to help analyze the cause of the problem?
  7. Clockwork

    Clockwork Member

    clusterHTTP config:
    <nodeAddresses>(s1)127.0.0.1->192.168.0.3, (s2)127.0.0.1->192.168.0.4, (s4)127.0.0.1->192.168.0.5</nodeAddresses>

    clusterStatic config:
    <nodeAddresses>(s3)127.0.0.1->192.168.0.1:81</nodeAddresses>

    I'll ask my provider if he could check the ports.

    --- 192.168.0.3 ping statistics ---
    272 packets transmitted, 263 received, 3% packet loss, time 271515ms
    rtt min/avg/max/mdev = 0.110/1.030/10.585/1.856 ms

    I've disabled persistent connections in both clusters that I use.

    nope

    I'll do, but first I need to read some strace howto's :p
  8. mistwang

    mistwang LiteSpeed Staff

    I think the problem is the source IP, should use a 192.168.0.x IP assigned to that server, or not use a source IP.
  9. Clockwork

    Clockwork Member

    I've tried both, same problem.
  10. mistwang

    mistwang LiteSpeed Staff

    3% packet loss for a LAN environment is extremely high.
  11. Clockwork

    Clockwork Member

    there only seems to be packet loss if the backend servers have much to do, currently there is no packet loss but the problem still occurs

    edit:

    I've tried httping on some backend servers from the loadbalancer, sometimes there are slow replies:

    could that be the problem?

    httping sends head requests to the target
    Last edited: Aug 13, 2010
  12. mistwang

    mistwang LiteSpeed Staff

    Yes, that will cause problem. LSLB wont wait more than 10 seconds when trying to establish a connection with backend.

    Maybe you should use dedicate NIC for different task, one for MySQL, one for communication between LSLB. The same with LSLB, one NIC for backend communication, one NIC for frontend communication.
  13. Clockwork

    Clockwork Member

    I'm not sure if this is the problem, we have no 10 second delay on the website, and it's not more than 3 seconds with httping.

    Also lslb was fine for months until we connected the database server to a 1gbit port, but there must something else because we haven't touched our loadbalancer nor any of the backend webservers.

    We use 3 webservers, 1 loadbalancer and 1 database server.
  14. mistwang

    mistwang LiteSpeed Staff

    Since it get worse when server is busy, my guess is, after the Gb port upgrade for the DB server, the burst bandwidth usage is higher and causes some packet being dropped.
    Maybe there are some kind of bandwidth cap on the switch port if server are interconnected via a VLAN instead of a dedicate switch.
  15. Clockwork

    Clockwork Member

    they've analyzed the NIC's and switches and haven't found any problem, they don't use any vlan, anything is connected directly on the same switch

    I've sent the strace log to litespeeds bug email
    Last edited: Aug 21, 2010
  16. mistwang

    mistwang LiteSpeed Staff

    Please regenerate strace output with option "-tt -T", need timestamps to figure something out.
    The strace output shows some errors. Need timestamp in the output to further analyze it.

    Are you still get packet loss? it is a big problem if you get packet loss over LAN connection.
  17. Clockwork

    Clockwork Member

    it seems there is still some packet loss sometimes, I've contacted my provider again.

    I've sent another strace to the bug email.
  18. mistwang

    mistwang LiteSpeed Staff

    Are you using 32bit binary on 64bit OS? I wonder if it is an issue with the 32bit-64bit compatible layer.
    Last edited: Aug 23, 2010
  19. Clockwork

    Clockwork Member

    it's the 64bit one, my provider is going to replace the switch we use with a new one tomorrow, let's hope it will fix the packet loss.
  20. mistwang

    mistwang LiteSpeed Staff

    Please download 1.7 release and update again, I am not sure it is something wrong with interpreting getsockopt() results or not.
    Also strace need to be updated to the latest release to get correct result value for getsockopt() on 64bit Linux.

Share This Page