lslb - ExtConn timed out while connecting.

Clockwork

Well-Known Member
#1
My lslb error.log is getting flooded by the following notice:

2010-08-11 15:10:42.000 NOTICE [xxx] ExtConn timed out while connecting.
the site is working fine, but what does this message mean?

edit:

debug stuff:

2010-08-11 18:02:55.000 NOTICE [ip:52299-21#sitename:loadbalancer] ExtConn timed out while connecting.
2010-08-11 18:02:55.000 DEBUG [ip:52299-21#sitename:loadbalancer] connection to [192.168.0.3:80] on request #1, error: Connection timed out!
2010-08-11 18:02:55.000 DEBUG [ip:52299-21#sitename:loadbalancer] [ExtConn] close()
2010-08-11 18:02:55.000 DEBUG [ip:52299-21#sitename:loadbalancer] HttpExtConnector::tryRecover()...
2010-08-11 18:02:55.000 DEBUG [ip:52299-21#sitename:loadbalancer] trying to recover from connection problem, attempt: #1!
2010-08-11 18:02:55.000 DEBUG [ip:52299-21#sitename:loadbalancer] Get SESSION_ID from COOKIE: [hash].
2010-08-11 18:02:55.000 DEBUG [ip:52299-21#sitename:loadbalancer] Found worker [clusterHTTP_s2] by strategy [0].
2010-08-11 18:02:55.000 DEBUG [ip:52299-21#sitename:loadbalancer] [LB] retry worker: [clusterHTTP_s2]
2010-08-11 18:02:55.000 DEBUG [ip:52299-21#sitename:loadbalancer] trying to recover from connection problem, attempt: #1!
2010-08-11 18:02:55.000 DEBUG [192.168.0.4:80] connection available!
2010-08-11 18:02:55.000 DEBUG [192.168.0.4:80] request [ip:52299-21#sitename:loadbalancer] is assigned with connection!
2010-08-11 18:02:55.000 DEBUG [ip:52299-21#sitename:loadbalancer] [ExtConn] reconnect()
2010-08-11 18:02:55.000 DEBUG [ip:52299-21#sitename:loadbalancer] [ExtConn] connecting to [192.168.0.4:80]...
edit:

sometimes I'm getting the following warning:

2010-08-11 19:09:05.000 NOTICE [clusterHTTP_s4] PingConn timed out while connecting.
2010-08-11 19:09:05.000 WARN [192.168.0.5:80] Failure detected: Connection Failure, 110:Connection timed out
2010-08-11 19:09:05.000 NOTICE [clusterHTTP_s2] PingConn timed out while connecting.
2010-08-11 19:09:05.000 WARN [192.168.0.4:80] Failure detected: Connection Failure, 110:Connection timed out
2010-08-11 19:09:05.899 INFO [192.168.0.4:80] Fail all outstanding requests!
2010-08-11 19:09:05.899 INFO [192.168.0.4:80] Fail all outstanding requests!
2010-08-11 19:09:06.000 NOTICE [ip:1612-0#sitename] ExtConn timed out while connecting.
2010-08-11 19:09:06.000 INFO [192.168.0.5:80] Fail all outstanding requests!
the problem started after we've changed the connection from our database server to a gigabit port, but this change doesn't affected the loadbalancer or webserver, it just improved the page load speed.

btw. we had the "ExtConn timed out while connecting." notice sometimes before, but not that much like now.

edit:

nginx seems to loadbalance without any problems, so this seems to be a lslb problem
 
Last edited:

mistwang

LiteSpeed Staff
#2
Please try command "telenet 192.168.0.4 80" from command line multiple times, see if you got long delay connecting to the target server sometimes.
 

Clockwork

Well-Known Member
#5
ohps, could someone move this topic to the loadbalancer forum? my mistake.

I've switched to nginx until there is a solution, lslb doesn't run stable atm, I hope you guys can help us to fix this problem, lslb is our ddos protection and performs way better than nginx.
 

mistwang

LiteSpeed Staff
#6
No problem. moved.
Have you specify the source IP when your configure each node?
Looks like lslb has problem connecting to all backend servers. could it be a problem with NIC port, switch port? If you use dedicate connection communicate with backend servers, you can check the packet loss of that specific NIC.
LSLB uses persistent connections, while nginx does not, there could be more ESTABLISHED connections with LSLB. Is there a firewall between LSLB and web servers?

If you do think it is a LSLB bug, could you strace lslbd while the problem is happening to help analyze the cause of the problem?
 

Clockwork

Well-Known Member
#7
clusterHTTP config:
<nodeAddresses>(s1)127.0.0.1->192.168.0.3, (s2)127.0.0.1->192.168.0.4, (s4)127.0.0.1->192.168.0.5</nodeAddresses>

clusterStatic config:
<nodeAddresses>(s3)127.0.0.1->192.168.0.1:81</nodeAddresses>

could it be a problem with NIC port, switch port?
I'll ask my provider if he could check the ports.

you can check the packet loss of that specific NIC
--- 192.168.0.3 ping statistics ---
272 packets transmitted, 263 received, 3% packet loss, time 271515ms
rtt min/avg/max/mdev = 0.110/1.030/10.585/1.856 ms

LSLB uses persistent connections
I've disabled persistent connections in both clusters that I use.

Is there a firewall between LSLB and web servers?
nope

If you do think it is a LSLB bug, could you strace lslbd while the problem is happening to help analyze the cause of the problem?
I'll do, but first I need to read some strace howto's :p
 

mistwang

LiteSpeed Staff
#10
--- 192.168.0.3 ping statistics ---
272 packets transmitted, 263 received, 3% packet loss, time 271515ms
rtt min/avg/max/mdev = 0.110/1.030/10.585/1.856 ms
3% packet loss for a LAN environment is extremely high.
 

Clockwork

Well-Known Member
#11
there only seems to be packet loss if the backend servers have much to do, currently there is no packet loss but the problem still occurs

edit:

I've tried httping on some backend servers from the loadbalancer, sometimes there are slow replies:

connected to 192.168.0.5:80, seq=31 time=2999.83 ms
could that be the problem?

httping sends head requests to the target
 
Last edited:

mistwang

LiteSpeed Staff
#12
Yes, that will cause problem. LSLB wont wait more than 10 seconds when trying to establish a connection with backend.

Maybe you should use dedicate NIC for different task, one for MySQL, one for communication between LSLB. The same with LSLB, one NIC for backend communication, one NIC for frontend communication.
 

Clockwork

Well-Known Member
#13
I'm not sure if this is the problem, we have no 10 second delay on the website, and it's not more than 3 seconds with httping.

Also lslb was fine for months until we connected the database server to a 1gbit port, but there must something else because we haven't touched our loadbalancer nor any of the backend webservers.

We use 3 webservers, 1 loadbalancer and 1 database server.
 

mistwang

LiteSpeed Staff
#14
Since it get worse when server is busy, my guess is, after the Gb port upgrade for the DB server, the burst bandwidth usage is higher and causes some packet being dropped.
Maybe there are some kind of bandwidth cap on the switch port if server are interconnected via a VLAN instead of a dedicate switch.
 

Clockwork

Well-Known Member
#15
they've analyzed the NIC's and switches and haven't found any problem, they don't use any vlan, anything is connected directly on the same switch

I've sent the strace log to litespeeds bug email
 
Last edited:

mistwang

LiteSpeed Staff
#16
Please regenerate strace output with option "-tt -T", need timestamps to figure something out.
The strace output shows some errors. Need timestamp in the output to further analyze it.

Are you still get packet loss? it is a big problem if you get packet loss over LAN connection.
 

mistwang

LiteSpeed Staff
#20
Please download 1.7 release and update again, I am not sure it is something wrong with interpreting getsockopt() results or not.
Also strace need to be updated to the latest release to get correct result value for getsockopt() on 64bit Linux.
 
Top