lslb - ExtConn timed out while connecting.

Clockwork · Aug 11, 2010

My lslb error.log is getting flooded by the following notice:

the site is working fine, but what does this message mean?

edit:

debug stuff:

edit:

sometimes I'm getting the following warning:

the problem started after we've changed the connection from our database server to a gigabit port, but this change doesn't affected the loadbalancer or webserver, it just improved the page load speed.

btw. we had the "ExtConn timed out while connecting." notice sometimes before, but not that much like now.

edit:

nginx seems to loadbalance without any problems, so this seems to be a lslb problem

mistwang · Aug 11, 2010

Please try command "telenet 192.168.0.4 80" from command line multiple times, see if you got long delay connecting to the target server sometimes.

GaryT · Aug 11, 2010

edit: wrong section

Clockwork · Aug 11, 2010

I've no telnet installed, but I've tried it with nmap and nc, no problems so far.

Clockwork · Aug 12, 2010

ohps, could someone move this topic to the loadbalancer forum? my mistake.

I've switched to nginx until there is a solution, lslb doesn't run stable atm, I hope you guys can help us to fix this problem, lslb is our ddos protection and performs way better than nginx.

mistwang · Aug 12, 2010

No problem. moved.
Have you specify the source IP when your configure each node?
Looks like lslb has problem connecting to all backend servers. could it be a problem with NIC port, switch port? If you use dedicate connection communicate with backend servers, you can check the packet loss of that specific NIC.
LSLB uses persistent connections, while nginx does not, there could be more ESTABLISHED connections with LSLB. Is there a firewall between LSLB and web servers?

If you do think it is a LSLB bug, could you strace lslbd while the problem is happening to help analyze the cause of the problem?

Clockwork · Aug 12, 2010

clusterHTTP config:
<nodeAddresses>(s1)127.0.0.1->192.168.0.3, (s2)127.0.0.1->192.168.0.4, (s4)127.0.0.1->192.168.0.5</nodeAddresses>

clusterStatic config:
<nodeAddresses>(s3)127.0.0.1->192.168.0.1:81</nodeAddresses>

I'll ask my provider if he could check the ports.

--- 192.168.0.3 ping statistics ---
272 packets transmitted, 263 received, 3% packet loss, time 271515ms
rtt min/avg/max/mdev = 0.110/1.030/10.585/1.856 ms

I've disabled persistent connections in both clusters that I use.

nope

I'll do, but first I need to read some strace howto's

mistwang · Aug 12, 2010

I think the problem is the source IP, should use a 192.168.0.x IP assigned to that server, or not use a source IP.

Clockwork · Aug 12, 2010

I've tried both, same problem.

mistwang · Aug 13, 2010

3% packet loss for a LAN environment is extremely high.

Clockwork · Aug 13, 2010

there only seems to be packet loss if the backend servers have much to do, currently there is no packet loss but the problem still occurs

edit:

I've tried httping on some backend servers from the loadbalancer, sometimes there are slow replies:

could that be the problem?

httping sends head requests to the target

mistwang · Aug 13, 2010

Yes, that will cause problem. LSLB wont wait more than 10 seconds when trying to establish a connection with backend.

Maybe you should use dedicate NIC for different task, one for MySQL, one for communication between LSLB. The same with LSLB, one NIC for backend communication, one NIC for frontend communication.

Clockwork · Aug 13, 2010

I'm not sure if this is the problem, we have no 10 second delay on the website, and it's not more than 3 seconds with httping.

Also lslb was fine for months until we connected the database server to a 1gbit port, but there must something else because we haven't touched our loadbalancer nor any of the backend webservers.

We use 3 webservers, 1 loadbalancer and 1 database server.

mistwang · Aug 13, 2010

Since it get worse when server is busy, my guess is, after the Gb port upgrade for the DB server, the burst bandwidth usage is higher and causes some packet being dropped.
Maybe there are some kind of bandwidth cap on the switch port if server are interconnected via a VLAN instead of a dedicate switch.

Clockwork · Aug 21, 2010

they've analyzed the NIC's and switches and haven't found any problem, they don't use any vlan, anything is connected directly on the same switch

I've sent the strace log to litespeeds bug email

mistwang · Aug 22, 2010

Please regenerate strace output with option "-tt -T", need timestamps to figure something out.
The strace output shows some errors. Need timestamp in the output to further analyze it.

Are you still get packet loss? it is a big problem if you get packet loss over LAN connection.

Clockwork · Aug 23, 2010

it seems there is still some packet loss sometimes, I've contacted my provider again.

I've sent another strace to the bug email.

mistwang · Aug 23, 2010

Are you using 32bit binary on 64bit OS? I wonder if it is an issue with the 32bit-64bit compatible layer.

Clockwork · Aug 23, 2010

it's the 64bit one, my provider is going to replace the switch we use with a new one tomorrow, let's hope it will fix the packet loss.

mistwang · Aug 23, 2010

Please download 1.7 release and update again, I am not sure it is something wrong with interpreting getsockopt() results or not.
Also strace need to be updated to the latest release to get correct result value for getsockopt() on 64bit Linux.

lslb - ExtConn timed out while connecting.

Well-Known Member

LiteSpeed Staff

Active Member

Well-Known Member

Well-Known Member

LiteSpeed Staff

Well-Known Member

LiteSpeed Staff

Well-Known Member

LiteSpeed Staff

Well-Known Member

LiteSpeed Staff

Well-Known Member

LiteSpeed Staff

Well-Known Member

LiteSpeed Staff

Well-Known Member

LiteSpeed Staff

Well-Known Member

LiteSpeed Staff