unusually high cpu usage by lshttpd

#1
i've been using litespeed for almost a year now on a rails/mongrel based site, but in the last few weeks i've been noticing cpu load spikes. up until today i thought it was to do with some changes in the ruby app, but today watching 'top' i've noticed that the lshttpd process occasionally spikes up to 40% of the cpu for a few minutes at a time. i can't tell what its doing (access/error/rails/debugging logs show little-to-no activity).

today (sunday, which should be quiet) the load average is spiking up to 3.5 for a few minutes at a time, and back down to a regular 0.40.

has anybody else experienced any like this? if i siege test the home page litespeed only takes up 15-20% of the cpu, so those ~40% spikes on a quiet day seem very strange.

thanks, zach
 

mistwang

LiteSpeed Staff
#2
Have you upgrade to the latest release? Are you using epoll or poll?
We need more information to look into this issue, maybe you can strace the lshttpd process when you caught that.
 
#3
i'm using 2.2.6 at the moment, a friend had an unsmooth upgrade to 3.0 (upgrade failed and he scrubbed the directories and installed fresh) so i'm waiting a month as we're switching hosting providers then anyway.

i'm using poll, though if its recommend i can switch to epoll - i assume thats preferable (am on fedora 4).

thanks for the response
 
#4
attached is some strace output from lshttpd (also here http://www.arenaflowers.com:8080/strace_output.txt if the attachement didn't work). interestingly, when it runs it squishes lshttpd back down to about 3%. (it was at 35-40% before i started the trace, and would go back up again if i canceled it).

here's the top w/out the strace:

top - 01:03:06 up 28 days, 19:45, 2 users, load average: 1.75, 1.49, 1.19
Tasks: 111 total, 5 running, 105 sleeping, 0 stopped, 1 zombie
Cpu(s): 48.0% us, 27.1% sy, 0.0% ni, 13.1% id, 0.0% wa, 0.0% hi, 11.8% si
Mem: 2074912k total, 1924328k used, 150584k free, 39180k buffers
Swap: 2031608k total, 92k used, 2031516k free, 1052560k cached

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
19968 rails 16 0 97316 87m 4692 R 60.3 4.3 216:54.26 ruby /usr/lib/ruby/gems/1.8/gems/mongrel-1.0.1/bin/mongrel_rails start -d -e pro
19970 rails 16 0 67968 58m 4668 R 58.3 2.9 186:58.65 ruby /usr/lib/ruby/gems/1.8/gems/mongrel-1.0.1/bin/mongrel_rails start -d -e pro
27995 rails 15 0 6368 3760 1516 R 51.4 0.2 19:58.76 lshttpd
12954 postfix 15 0 7024 2432 2028 S 0.7 0.1 0:08.28 smtpd -n arenaflowers.com:25 -t inet -u -o smtpd_sasl_auth_enable yes
31287 root 15 0 178m 166m 2276 S 0.3 8.2 3:09.93 backgroundrb



i'm still not convinced its a litespeed problem so i'll try tracing the mongrels as well and see if i see anything. i noticed when i restarted the mongrel it all dropped back to normal, though i should do that a few times to make sure it's not just a coincidence.
 
#5
what seems to be happening is some sort of loop between mongrel and lightspeed. using strace i can repetitions of one file, for example

writev(4, [{"GET /flower_delivery/florists_in"..., 441}, {"X-Forwarded-For: 74.124.192.171\r"..., 35}], 2) = 476

in lightspeed

and a similar line saying read in mongrel:

read(218, "GET /flower_delivery/florists_in"..., 16384) = 476

the repetition goes on much longer than it would take to transfer the file (minutes), and if i gracefully restart lightspeed (or mongrel) everything returns to normal. there's nothing wrong in the ruby app (ie clicking on the same link to test won't cause a problem, and it happens links in various parts of the site, all of which use caches_page).

is the bad file descriptor the issue? see below:


full lightspeed loop block:

gettimeofday({1178219476, 926735}, NULL) = 0
select(216, [3 215], [], [], {0, 619176}) = 1 (in [3], left {0, 619176})
accept(3, {sa_family=AF_INET, sin_port=htons(36453), sin_addr=inet_addr("127.0.0.1")}, [16]) = 218
fcntl64(218, F_GETFL) = 0x2 (flags O_RDWR)
fstat64(218, {st_mode=S_IFSOCK|0777, st_size=0, ...}) = 0
mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb7f3c000
_llseek(218, 0, 0xbfe86bbc, SEEK_CUR) = -1 ESPIPE (Illegal seek)
fcntl64(218, F_GETFL) = 0x2 (flags O_RDWR)
fstat64(218, {st_mode=S_IFSOCK|0777, st_size=0, ...}) = 0
mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb7f3b000
_llseek(218, 0, 0xbfe86bbc, SEEK_CUR) = -1 ESPIPE (Illegal seek)
setsockopt(218, SOL_TCP, TCP_CORK, [1], 4) = 0
gettimeofday({1178219476, 930155}, NULL) = 0
select(219, [215 218], [], [], {0, 0}) = 1 (in [218], left {0, 0})
gettimeofday({1178219476, 930767}, NULL) = 0
gettimeofday({1178219476, 931024}, NULL) = 0
select(216, [3 215], [], [], {0, 0}) = 0 (Timeout)
read(218, "GET /flower_delivery/florists_in"..., 16384) = 476
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
close(218) = 0
munmap(0xb7f3b000, 4096) = 0
close(218) = -1 EBADF (Bad file descriptor)
munmap(0xb7f3c000, 4096) = 0


full mongrel block:

connect(4, {sa_family=AF_INET, sin_port=htons(8000), sin_addr=inet_addr("127.0.0.1")}, 16) = -1 EINPROGRESS (Operation now in progress)
fcntl64(4, F_SETFD, FD_CLOEXEC) = 0
epoll_ctl(6, EPOLL_CTL_ADD, 4, {EPOLLIN|EPOLLOUT|EPOLLERR|EPOLLHUP, {u32=4, u64=4}}) = 0
time(NULL) = 1178219495
gettimeofday({1178219495, 538716}, NULL) = 0
epoll_wait(6, {{EPOLLOUT, {u32=4, u64=4}}}, 341, 100) = 1
getsockopt(4, SOL_SOCKET, SO_ERROR, [0], [4]) = 0
writev(4, [{"GET /flower_delivery/florists_in"..., 441}, {"X-Forwarded-For: 74.124.192.171\r"..., 35}], 2) = 476
epoll_ctl(6, EPOLL_CTL_MOD, 4, {EPOLLIN|EPOLLERR|EPOLLHUP, {u32=4, u64=4}}) = 0
time(NULL) = 1178219495
gettimeofday({1178219495, 540226}, NULL) = 0
epoll_wait(6, {{EPOLLIN, {u32=4, u64=4}}}, 341, 100) = 1
read(4, "", 1460) = 0
epoll_ctl(6, EPOLL_CTL_DEL, 4, {0, {u32=4, u64=4}}) = 0
close(4) = 0
socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 4
fcntl64(4, F_SETFL, O_RDWR|O_NONBLOCK) = 0
setsockopt(4, SOL_TCP, TCP_NODELAY, [1], 4) = 0
 
Last edited:

mistwang

LiteSpeed Staff
#6
Looks like that a Mongrel instance has gone bad, it closes the connection without sending back anything, and LiteSpeed keep trying the bad Mongrel, so the end results are both lshttpd and mongrel uses a lot of CPU.
Please try the latest 3.1 release, it should be as stable as 2.2.6 if not better, if the problem is still there, we will investigate more.
 

andreas

Well-Known Member
#7
I have exactly the same problem with 3.2.2. A few times per day the CPU usage of lsws and mongrel explodes for no apparent reason:

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
6985 www-data 16 0 33616 30m 2272 R 52.5 1.5 9:25.64 mongrel_rails
18228 www-data 15 0 10196 7772 2152 S 30.9 0.4 11:35.12 lshttpd

lsws is set up as proxy for mongrel, which is delivering static files (only a few per second). There is no excessive traffic to the app when it happens, and I can't reproduce the problem by hammering the server with requests. I think the problem started with 3.1, but I'm not sure.

A strace shows that lsws lseems to be sending the same request to mongrel again and again:
writev(14, [{"GET /math/722713bae093b63c9c5fd9"..., 705},...
...
read(12, "", 1460) = 0
close(12) = 0
 
#8
it was 2.2.6 that had the problem for me, and the problem went away with 3.1.1, but this was part of a migration to a completely different setup (two boxes, fresh installs of all packages etc) so it could easily have been something else. sorry i haven't got more to go on :(
 
#9
After 3-4 months on new servers, this problem has returned again. I'm pretty sure it has nothing with the apps code (definitely between mongrel and litespeed). While the problem is happening I can view the cached page in question fine in a browser, and all the mongrels seem to behave fine when pinged directly (or through LSWS).

I think possibly it might be to do with Google crawling and hitting pages at a certain speed (the pages that cause the problem are always SEO related pages). It happens any time of day/night, so it doesn't seem resource related.

I'm using the LSWS 3.2.4 and mongrel 1.0.1, so the upgrade hasn't helped. Any ideas on which settings I could tweak (Max keep alives etc)? I'm using EPOLL.
 

mistwang

LiteSpeed Staff
#10
Strace output from the process with high CPU utilization should help identifying the problem. If LSWS keep sending the same request to mongrel without getting any valid response back, it probably a issue with mongrel. Have you tried restart mongrels when it heppens?

The better solution probably is to switch to our ruby lsapi. :)
 
#11
Hi mistwang,

There's output in my previous post #5, it hasn't changed any since the problem occured in May (but see attached and below). I can't tell if Mongrel is sending a valid response back. Restarting mongrel or lwsw when it is occuring stops the problem, but it'll happen again five minutes later.


Look at 'florists_in_rg21_hampshire_england_uk' or 'florists_in_ellisfield_hampshire_england_uk' in here

Mongrel is closing with EBADF, Litespeed doesn't seem to have any errors I can see but is requesting the file alot).

LiteSpeed:

[root@web rails]# strace -p 16899 -s 124
Process 16899 attached - interrupt to quit
gettimeofday({1193683853, 953263}, NULL) = 0
epoll_wait(6, {{EPOLLIN, {u32=17, u64=17}}}, 1365, 100) = 1
read(17, "", 1460) = 0
epoll_ctl(6, EPOLL_CTL_DEL, 17, {0, {u32=17, u64=17}}) = 0
close(17) = 0
socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 17
fcntl64(17, F_SETFL, O_RDWR|O_NONBLOCK) = 0
setsockopt(17, SOL_TCP, TCP_NODELAY, [1], 4) = 0
connect(17, {sa_family=AF_INET, sin_port=htons(8021), sin_addr=inet_addr("10.245.11.69")}, 16) = -1 EINPROGRESS (Operation now in progress)
fcntl64(17, F_SETFD, FD_CLOEXEC) = 0
epoll_ctl(6, EPOLL_CTL_ADD, 17, {EPOLLIN|EPOLLOUT|EPOLLERR|EPOLLHUP, {u32=17, u64=17}}) = 0
time(NULL) = 1193683853
gettimeofday({1193683853, 956105}, NULL) = 0
epoll_wait(6, {{EPOLLOUT, {u32=17, u64=17}}}, 1365, 100) = 1
getsockopt(17, SOL_SOCKET, SO_ERROR, [0], [4]) = 0
writev(17, [{"GET /flower_delivery/florists_in_rg21_hampshire_england_uk HTTP/1.1\nAccept: image/gif, image/x-xbitmap, image/jpeg, image/pj"..., 436}, {"X-Forwarded-For: 74.124.192.139\r\n\r\n", 35}], 2) = 471
epoll_ctl(6, EPOLL_CTL_MOD, 17, {EPOLLIN|EPOLLERR|EPOLLHUP, {u32=17, u64=17}}) = 0
time(NULL) = 1193683853
gettimeofday({1193683853, 956694}, NULL) = 0
epoll_wait(6, {{EPOLLIN, {u32=5, u64=5}}}, 1365, 100) = 1
read(5, "", 1460) = 0
epoll_ctl(6, EPOLL_CTL_DEL, 5, {0, {u32=5, u64=5}}) = 0
close(5) = 0
socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 5
fcntl64(5, F_SETFL, O_RDWR|O_NONBLOCK) = 0
setsockopt(5, SOL_TCP, TCP_NODELAY, [1], 4) = 0
connect(5, {sa_family=AF_INET, sin_port=htons(8021), sin_addr=inet_addr("10.245.11.69")}, 16) = -1 EINPROGRESS (Operation now in progress)
fcntl64(5, F_SETFD, FD_CLOEXEC) = 0
epoll_ctl(6, EPOLL_CTL_ADD, 5, {EPOLLIN|EPOLLOUT|EPOLLERR|EPOLLHUP, {u32=5, u64=5}}) = 0
time(NULL) = 1193683853
gettimeofday({1193683853, 959090}, NULL) = 0
epoll_wait(6, {{EPOLLOUT, {u32=5, u64=5}}}, 1365, 100) = 1
getsockopt(5, SOL_SOCKET, SO_ERROR, [0], [4]) = 0
writev(5, [{"GET /flower_delivery/florists_in_ellisfield_hampshire_england_uk HTTP/1.1\nAccept: image/gif, image/x-xbitmap, image/jpeg, im"..., 442}, {"X-Forwarded-For: 74.124.192.139\r\n\r\n", 35}], 2) = 477
epoll_ctl(6, EPOLL_CTL_MOD, 5, {EPOLLIN|EPOLLERR|EPOLLHUP, {u32=5, u64=5}}) = 0
time(NULL) = 1193683853
gettimeofday({1193683853, 959682}, NULL) = 0
epoll_wait(6, {{EPOLLIN, {u32=17, u64=17}}}, 1365, 100) = 1
read(17, "", 1460) = 0
epoll_ctl(6, EPOLL_CTL_DEL, 17, {0, {u32=17, u64=17}}) = 0
close(17) = 0
socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 17
fcntl64(17, F_SETFL, O_RDWR|O_NONBLOCK) = 0
setsockopt(17, SOL_TCP, TCP_NODELAY, [1], 4) = 0
connect(17, {sa_family=AF_INET, sin_port=htons(8021), sin_addr=inet_addr("10.245.11.69")}, 16) = -1 EINPROGRESS (Operation now in progress)
fcntl64(17, F_SETFD, FD_CLOEXEC) = 0
epoll_ctl(6, EPOLL_CTL_ADD, 17, {EPOLLIN|EPOLLOUT|EPOLLERR|EPOLLHUP, {u32=17, u64=17}}) = 0
time(NULL) = 1193683853
gettimeofday({1193683853, 961588}, NULL) = 0
epoll_wait(6, {{EPOLLOUT, {u32=17, u64=17}}}, 1365, 100) = 1
getsockopt(17, SOL_SOCKET, SO_ERROR, [0], [4]) = 0
writev(17, [{"GET /flower_delivery/florists_in_rg21_hampshire_england_uk HTTP/1.1\nAccept: image/gif, image/x-xbitmap, image/jpeg, image/pj"..., 436}, {"X-Forwarded-For: 74.124.192.139\r\n\r\n", 35}], 2) = 471
epoll_ctl(6, EPOLL_CTL_MOD, 17, {EPOLLIN|EPOLLERR|EPOLLHUP, {u32=17, u64=17}}) = 0
time(NULL) = 1193683853
gettimeofday({1193683853, 962173}, NULL) = 0
epoll_wait(6, {{EPOLLIN, {u32=5, u64=5}}}, 1365, 100) = 1
read(5, "", 1460) = 0
epoll_ctl(6, EPOLL_CTL_DEL, 5, {0, {u32=5, u64=5}}) = 0
close(5) = 0
socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 5
fcntl64(5, F_SETFL, O_RDWR|O_NONBLOCK) = 0
setsockopt(5, SOL_TCP, TCP_NODELAY, [1], 4) = 0
connect(5, {sa_family=AF_INET, sin_port=htons(8021), sin_addr=inet_addr("10.245.11.69")}, 16) = -1 EINPROGRESS (Operation now in progress)
fcntl64(5, F_SETFD, FD_CLOEXEC) = 0
epoll_ctl(6, EPOLL_CTL_ADD, 5, {EPOLLIN|EPOLLOUT|EPOLLERR|EPOLLHUP, {u32=5, u64=5}}) = 0
time(NULL) = 1193683853
gettimeofday({1193683853, 965793}, NULL) = 0
epoll_wait(6, {{EPOLLOUT, {u32=5, u64=5}}}, 1365, 100) = 1
getsockopt(5, SOL_SOCKET, SO_ERROR, [0], [4]) = 0






Mongrel:

[root@apps rails]# strace -s 124 -p 6090
Process 6090 attached - interrupt to quit
gettimeofday({1193683852, 132459}, NULL) = 0
gettimeofday({1193683852, 132528}, NULL) = 0
select(34, [3 33], [], [], {0, 0}) = 1 (in [3], left {0, 0})
read(5, "GET /flower_delivery/florists_in_ellisfield_hampshire_england_uk HTTP/1.1\nAccept: image/gif, image/x-xbitmap, image/jpeg, im"..., 16384) = 477
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
close(5) = 0
munmap(0xb62dd000, 4096) = 0
close(5) = -1 EBADF (Bad file descriptor)
munmap(0xb62de000, 4096) = 0
gettimeofday({1193683852, 133107}, NULL) = 0
select(34, [33], [], [], {0, 0}) = 0 (Timeout)
accept(3, {sa_family=AF_INET, sin_port=htons(44429), sin_addr=inet_addr("10.245.11.68")}, [16]) = 5
fcntl64(5, F_GETFL) = 0x2 (flags O_RDWR)
fstat64(5, {st_mode=S_IFSOCK|0777, st_size=0, ...}) = 0
mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb62de000
_llseek(5, 0, 0xbfe194b4, SEEK_CUR) = -1 ESPIPE (Illegal seek)
fcntl64(5, F_GETFL) = 0x2 (flags O_RDWR)
fstat64(5, {st_mode=S_IFSOCK|0777, st_size=0, ...}) = 0
mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb62dd000
_llseek(5, 0, 0xbfe194b4, SEEK_CUR) = -1 ESPIPE (Illegal seek)
setsockopt(5, SOL_TCP, TCP_CORK, [1], 4) = 0
gettimeofday({1193683852, 133837}, NULL) = 0
select(34, [5 33], [], [], {0, 0}) = 1 (in [5], left {0, 0})
gettimeofday({1193683852, 133988}, NULL) = 0
gettimeofday({1193683852, 134030}, NULL) = 0
select(34, [3 33], [], [], {0, 0}) = 1 (in [3], left {0, 0})
read(5, "GET /flower_delivery/florists_in_rg21_hampshire_england_uk HTTP/1.1\nAccept: image/gif, image/x-xbitmap, image/jpeg, image/pj"..., 16384) = 471
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
close(5) = 0
munmap(0xb62dd000, 4096) = 0
close(5) = -1 EBADF (Bad file descriptor)
munmap(0xb62de000, 4096) = 0
gettimeofday({1193683852, 134521}, NULL) = 0
select(34, [33], [], [], {0, 0}) = 0 (Timeout)
accept(3, {sa_family=AF_INET, sin_port=htons(44430), sin_addr=inet_addr("10.245.11.68")}, [16]) = 5
fcntl64(5, F_GETFL) = 0x2 (flags O_RDWR)
fstat64(5, {st_mode=S_IFSOCK|0777, st_size=0, ...}) = 0
mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb62de000
_llseek(5, 0, 0xbfe194b4, SEEK_CUR) = -1 ESPIPE (Illegal seek)
fcntl64(5, F_GETFL) = 0x2 (flags O_RDWR)
fstat64(5, {st_mode=S_IFSOCK|0777, st_size=0, ...}) = 0
mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb62dd000
_llseek(5, 0, 0xbfe194b4, SEEK_CUR) = -1 ESPIPE (Illegal seek)
setsockopt(5, SOL_TCP, TCP_CORK, [1], 4) = 0
gettimeofday({1193683852, 135199}, NULL) = 0
select(34, [5 33], [], [], {0, 0}) = 1 (in [5], left {0, 0})
gettimeofday({1193683852, 135346}, NULL) = 0
gettimeofday({1193683852, 135388}, NULL) = 0
select(34, [3 33], [], [], {0, 0}) = 1 (in [3], left {0, 0})
 

Attachments

mistwang

LiteSpeed Staff
#12
The problem is on mongrel side. As shown in the strace output, file handle 5 is the reverse proxy connection from LSWS to mongrel. Mongrel read the request, then it closed the connection immediately without sending back anything, then try to close it again with result EBADF, because the file descriptor has been closed already.

When mongrel was working, it should send the reply back to LSWS before closing the socket.

The root cause of the problem is on Mongrel side, however, LSWS should fail the request after a few retries. We will implement that in our 3.3 release.
 

mistwang

LiteSpeed Staff
#14
3.3 release package is available on our server, just change the version number to 3.3 to download it. Please give it a try if you want to.
 
Top