unusually high cpu usage by lshttpd

Discussion in 'Ruby/Rails' started by babelian, Apr 29, 2007.

  1. babelian

    babelian New Member

    i've been using litespeed for almost a year now on a rails/mongrel based site, but in the last few weeks i've been noticing cpu load spikes. up until today i thought it was to do with some changes in the ruby app, but today watching 'top' i've noticed that the lshttpd process occasionally spikes up to 40% of the cpu for a few minutes at a time. i can't tell what its doing (access/error/rails/debugging logs show little-to-no activity).

    today (sunday, which should be quiet) the load average is spiking up to 3.5 for a few minutes at a time, and back down to a regular 0.40.

    has anybody else experienced any like this? if i siege test the home page litespeed only takes up 15-20% of the cpu, so those ~40% spikes on a quiet day seem very strange.

    thanks, zach
  2. mistwang

    mistwang LiteSpeed Staff

    Have you upgrade to the latest release? Are you using epoll or poll?
    We need more information to look into this issue, maybe you can strace the lshttpd process when you caught that.
  3. babelian

    babelian New Member

    i'm using 2.2.6 at the moment, a friend had an unsmooth upgrade to 3.0 (upgrade failed and he scrubbed the directories and installed fresh) so i'm waiting a month as we're switching hosting providers then anyway.

    i'm using poll, though if its recommend i can switch to epoll - i assume thats preferable (am on fedora 4).

    thanks for the response
  4. babelian

    babelian New Member

    attached is some strace output from lshttpd (also here http://www.arenaflowers.com:8080/strace_output.txt if the attachement didn't work). interestingly, when it runs it squishes lshttpd back down to about 3%. (it was at 35-40% before i started the trace, and would go back up again if i canceled it).

    here's the top w/out the strace:

    top - 01:03:06 up 28 days, 19:45, 2 users, load average: 1.75, 1.49, 1.19
    Tasks: 111 total, 5 running, 105 sleeping, 0 stopped, 1 zombie
    Cpu(s): 48.0% us, 27.1% sy, 0.0% ni, 13.1% id, 0.0% wa, 0.0% hi, 11.8% si
    Mem: 2074912k total, 1924328k used, 150584k free, 39180k buffers
    Swap: 2031608k total, 92k used, 2031516k free, 1052560k cached

    PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
    19968 rails 16 0 97316 87m 4692 R 60.3 4.3 216:54.26 ruby /usr/lib/ruby/gems/1.8/gems/mongrel-1.0.1/bin/mongrel_rails start -d -e pro
    19970 rails 16 0 67968 58m 4668 R 58.3 2.9 186:58.65 ruby /usr/lib/ruby/gems/1.8/gems/mongrel-1.0.1/bin/mongrel_rails start -d -e pro
    27995 rails 15 0 6368 3760 1516 R 51.4 0.2 19:58.76 lshttpd
    12954 postfix 15 0 7024 2432 2028 S 0.7 0.1 0:08.28 smtpd -n arenaflowers.com:25 -t inet -u -o smtpd_sasl_auth_enable yes
    31287 root 15 0 178m 166m 2276 S 0.3 8.2 3:09.93 backgroundrb



    i'm still not convinced its a litespeed problem so i'll try tracing the mongrels as well and see if i see anything. i noticed when i restarted the mongrel it all dropped back to normal, though i should do that a few times to make sure it's not just a coincidence.
  5. babelian

    babelian New Member

    what seems to be happening is some sort of loop between mongrel and lightspeed. using strace i can repetitions of one file, for example

    writev(4, [{"GET /flower_delivery/florists_in"..., 441}, {"X-Forwarded-For: 74.124.192.171\r"..., 35}], 2) = 476

    in lightspeed

    and a similar line saying read in mongrel:

    read(218, "GET /flower_delivery/florists_in"..., 16384) = 476

    the repetition goes on much longer than it would take to transfer the file (minutes), and if i gracefully restart lightspeed (or mongrel) everything returns to normal. there's nothing wrong in the ruby app (ie clicking on the same link to test won't cause a problem, and it happens links in various parts of the site, all of which use caches_page).

    is the bad file descriptor the issue? see below:


    full lightspeed loop block:

    gettimeofday({1178219476, 926735}, NULL) = 0
    select(216, [3 215], [], [], {0, 619176}) = 1 (in [3], left {0, 619176})
    accept(3, {sa_family=AF_INET, sin_port=htons(36453), sin_addr=inet_addr("127.0.0.1")}, [16]) = 218
    fcntl64(218, F_GETFL) = 0x2 (flags O_RDWR)
    fstat64(218, {st_mode=S_IFSOCK|0777, st_size=0, ...}) = 0
    mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb7f3c000
    _llseek(218, 0, 0xbfe86bbc, SEEK_CUR) = -1 ESPIPE (Illegal seek)
    fcntl64(218, F_GETFL) = 0x2 (flags O_RDWR)
    fstat64(218, {st_mode=S_IFSOCK|0777, st_size=0, ...}) = 0
    mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb7f3b000
    _llseek(218, 0, 0xbfe86bbc, SEEK_CUR) = -1 ESPIPE (Illegal seek)
    setsockopt(218, SOL_TCP, TCP_CORK, [1], 4) = 0
    gettimeofday({1178219476, 930155}, NULL) = 0
    select(219, [215 218], [], [], {0, 0}) = 1 (in [218], left {0, 0})
    gettimeofday({1178219476, 930767}, NULL) = 0
    gettimeofday({1178219476, 931024}, NULL) = 0
    select(216, [3 215], [], [], {0, 0}) = 0 (Timeout)
    read(218, "GET /flower_delivery/florists_in"..., 16384) = 476
    rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
    close(218) = 0
    munmap(0xb7f3b000, 4096) = 0
    close(218) = -1 EBADF (Bad file descriptor)
    munmap(0xb7f3c000, 4096) = 0


    full mongrel block:

    connect(4, {sa_family=AF_INET, sin_port=htons(8000), sin_addr=inet_addr("127.0.0.1")}, 16) = -1 EINPROGRESS (Operation now in progress)
    fcntl64(4, F_SETFD, FD_CLOEXEC) = 0
    epoll_ctl(6, EPOLL_CTL_ADD, 4, {EPOLLIN|EPOLLOUT|EPOLLERR|EPOLLHUP, {u32=4, u64=4}}) = 0
    time(NULL) = 1178219495
    gettimeofday({1178219495, 538716}, NULL) = 0
    epoll_wait(6, {{EPOLLOUT, {u32=4, u64=4}}}, 341, 100) = 1
    getsockopt(4, SOL_SOCKET, SO_ERROR, [0], [4]) = 0
    writev(4, [{"GET /flower_delivery/florists_in"..., 441}, {"X-Forwarded-For: 74.124.192.171\r"..., 35}], 2) = 476
    epoll_ctl(6, EPOLL_CTL_MOD, 4, {EPOLLIN|EPOLLERR|EPOLLHUP, {u32=4, u64=4}}) = 0
    time(NULL) = 1178219495
    gettimeofday({1178219495, 540226}, NULL) = 0
    epoll_wait(6, {{EPOLLIN, {u32=4, u64=4}}}, 341, 100) = 1
    read(4, "", 1460) = 0
    epoll_ctl(6, EPOLL_CTL_DEL, 4, {0, {u32=4, u64=4}}) = 0
    close(4) = 0
    socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 4
    fcntl64(4, F_SETFL, O_RDWR|O_NONBLOCK) = 0
    setsockopt(4, SOL_TCP, TCP_NODELAY, [1], 4) = 0
    Last edited: May 3, 2007
  6. mistwang

    mistwang LiteSpeed Staff

    Looks like that a Mongrel instance has gone bad, it closes the connection without sending back anything, and LiteSpeed keep trying the bad Mongrel, so the end results are both lshttpd and mongrel uses a lot of CPU.
    Please try the latest 3.1 release, it should be as stable as 2.2.6 if not better, if the problem is still there, we will investigate more.
  7. andreas

    andreas New Member

    I have exactly the same problem with 3.2.2. A few times per day the CPU usage of lsws and mongrel explodes for no apparent reason:

    PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
    6985 www-data 16 0 33616 30m 2272 R 52.5 1.5 9:25.64 mongrel_rails
    18228 www-data 15 0 10196 7772 2152 S 30.9 0.4 11:35.12 lshttpd

    lsws is set up as proxy for mongrel, which is delivering static files (only a few per second). There is no excessive traffic to the app when it happens, and I can't reproduce the problem by hammering the server with requests. I think the problem started with 3.1, but I'm not sure.

    A strace shows that lsws lseems to be sending the same request to mongrel again and again:
    writev(14, [{"GET /math/722713bae093b63c9c5fd9"..., 705},...
    ...
    read(12, "", 1460) = 0
    close(12) = 0
  8. babelian

    babelian New Member

    it was 2.2.6 that had the problem for me, and the problem went away with 3.1.1, but this was part of a migration to a completely different setup (two boxes, fresh installs of all packages etc) so it could easily have been something else. sorry i haven't got more to go on :(
  9. babelian

    babelian New Member

    After 3-4 months on new servers, this problem has returned again. I'm pretty sure it has nothing with the apps code (definitely between mongrel and litespeed). While the problem is happening I can view the cached page in question fine in a browser, and all the mongrels seem to behave fine when pinged directly (or through LSWS).

    I think possibly it might be to do with Google crawling and hitting pages at a certain speed (the pages that cause the problem are always SEO related pages). It happens any time of day/night, so it doesn't seem resource related.

    I'm using the LSWS 3.2.4 and mongrel 1.0.1, so the upgrade hasn't helped. Any ideas on which settings I could tweak (Max keep alives etc)? I'm using EPOLL.
  10. mistwang

    mistwang LiteSpeed Staff

    Strace output from the process with high CPU utilization should help identifying the problem. If LSWS keep sending the same request to mongrel without getting any valid response back, it probably a issue with mongrel. Have you tried restart mongrels when it heppens?

    The better solution probably is to switch to our ruby lsapi. :)
  11. babelian

    babelian New Member

    Hi mistwang,

    There's output in my previous post #5, it hasn't changed any since the problem occured in May (but see attached and below). I can't tell if Mongrel is sending a valid response back. Restarting mongrel or lwsw when it is occuring stops the problem, but it'll happen again five minutes later.


    Look at 'florists_in_rg21_hampshire_england_uk' or 'florists_in_ellisfield_hampshire_england_uk' in here

    Mongrel is closing with EBADF, Litespeed doesn't seem to have any errors I can see but is requesting the file alot).

    LiteSpeed:

    [root@web rails]# strace -p 16899 -s 124
    Process 16899 attached - interrupt to quit
    gettimeofday({1193683853, 953263}, NULL) = 0
    epoll_wait(6, {{EPOLLIN, {u32=17, u64=17}}}, 1365, 100) = 1
    read(17, "", 1460) = 0
    epoll_ctl(6, EPOLL_CTL_DEL, 17, {0, {u32=17, u64=17}}) = 0
    close(17) = 0
    socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 17
    fcntl64(17, F_SETFL, O_RDWR|O_NONBLOCK) = 0
    setsockopt(17, SOL_TCP, TCP_NODELAY, [1], 4) = 0
    connect(17, {sa_family=AF_INET, sin_port=htons(8021), sin_addr=inet_addr("10.245.11.69")}, 16) = -1 EINPROGRESS (Operation now in progress)
    fcntl64(17, F_SETFD, FD_CLOEXEC) = 0
    epoll_ctl(6, EPOLL_CTL_ADD, 17, {EPOLLIN|EPOLLOUT|EPOLLERR|EPOLLHUP, {u32=17, u64=17}}) = 0
    time(NULL) = 1193683853
    gettimeofday({1193683853, 956105}, NULL) = 0
    epoll_wait(6, {{EPOLLOUT, {u32=17, u64=17}}}, 1365, 100) = 1
    getsockopt(17, SOL_SOCKET, SO_ERROR, [0], [4]) = 0
    writev(17, [{"GET /flower_delivery/florists_in_rg21_hampshire_england_uk HTTP/1.1\nAccept: image/gif, image/x-xbitmap, image/jpeg, image/pj"..., 436}, {"X-Forwarded-For: 74.124.192.139\r\n\r\n", 35}], 2) = 471
    epoll_ctl(6, EPOLL_CTL_MOD, 17, {EPOLLIN|EPOLLERR|EPOLLHUP, {u32=17, u64=17}}) = 0
    time(NULL) = 1193683853
    gettimeofday({1193683853, 956694}, NULL) = 0
    epoll_wait(6, {{EPOLLIN, {u32=5, u64=5}}}, 1365, 100) = 1
    read(5, "", 1460) = 0
    epoll_ctl(6, EPOLL_CTL_DEL, 5, {0, {u32=5, u64=5}}) = 0
    close(5) = 0
    socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 5
    fcntl64(5, F_SETFL, O_RDWR|O_NONBLOCK) = 0
    setsockopt(5, SOL_TCP, TCP_NODELAY, [1], 4) = 0
    connect(5, {sa_family=AF_INET, sin_port=htons(8021), sin_addr=inet_addr("10.245.11.69")}, 16) = -1 EINPROGRESS (Operation now in progress)
    fcntl64(5, F_SETFD, FD_CLOEXEC) = 0
    epoll_ctl(6, EPOLL_CTL_ADD, 5, {EPOLLIN|EPOLLOUT|EPOLLERR|EPOLLHUP, {u32=5, u64=5}}) = 0
    time(NULL) = 1193683853
    gettimeofday({1193683853, 959090}, NULL) = 0
    epoll_wait(6, {{EPOLLOUT, {u32=5, u64=5}}}, 1365, 100) = 1
    getsockopt(5, SOL_SOCKET, SO_ERROR, [0], [4]) = 0
    writev(5, [{"GET /flower_delivery/florists_in_ellisfield_hampshire_england_uk HTTP/1.1\nAccept: image/gif, image/x-xbitmap, image/jpeg, im"..., 442}, {"X-Forwarded-For: 74.124.192.139\r\n\r\n", 35}], 2) = 477
    epoll_ctl(6, EPOLL_CTL_MOD, 5, {EPOLLIN|EPOLLERR|EPOLLHUP, {u32=5, u64=5}}) = 0
    time(NULL) = 1193683853
    gettimeofday({1193683853, 959682}, NULL) = 0
    epoll_wait(6, {{EPOLLIN, {u32=17, u64=17}}}, 1365, 100) = 1
    read(17, "", 1460) = 0
    epoll_ctl(6, EPOLL_CTL_DEL, 17, {0, {u32=17, u64=17}}) = 0
    close(17) = 0
    socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 17
    fcntl64(17, F_SETFL, O_RDWR|O_NONBLOCK) = 0
    setsockopt(17, SOL_TCP, TCP_NODELAY, [1], 4) = 0
    connect(17, {sa_family=AF_INET, sin_port=htons(8021), sin_addr=inet_addr("10.245.11.69")}, 16) = -1 EINPROGRESS (Operation now in progress)
    fcntl64(17, F_SETFD, FD_CLOEXEC) = 0
    epoll_ctl(6, EPOLL_CTL_ADD, 17, {EPOLLIN|EPOLLOUT|EPOLLERR|EPOLLHUP, {u32=17, u64=17}}) = 0
    time(NULL) = 1193683853
    gettimeofday({1193683853, 961588}, NULL) = 0
    epoll_wait(6, {{EPOLLOUT, {u32=17, u64=17}}}, 1365, 100) = 1
    getsockopt(17, SOL_SOCKET, SO_ERROR, [0], [4]) = 0
    writev(17, [{"GET /flower_delivery/florists_in_rg21_hampshire_england_uk HTTP/1.1\nAccept: image/gif, image/x-xbitmap, image/jpeg, image/pj"..., 436}, {"X-Forwarded-For: 74.124.192.139\r\n\r\n", 35}], 2) = 471
    epoll_ctl(6, EPOLL_CTL_MOD, 17, {EPOLLIN|EPOLLERR|EPOLLHUP, {u32=17, u64=17}}) = 0
    time(NULL) = 1193683853
    gettimeofday({1193683853, 962173}, NULL) = 0
    epoll_wait(6, {{EPOLLIN, {u32=5, u64=5}}}, 1365, 100) = 1
    read(5, "", 1460) = 0
    epoll_ctl(6, EPOLL_CTL_DEL, 5, {0, {u32=5, u64=5}}) = 0
    close(5) = 0
    socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 5
    fcntl64(5, F_SETFL, O_RDWR|O_NONBLOCK) = 0
    setsockopt(5, SOL_TCP, TCP_NODELAY, [1], 4) = 0
    connect(5, {sa_family=AF_INET, sin_port=htons(8021), sin_addr=inet_addr("10.245.11.69")}, 16) = -1 EINPROGRESS (Operation now in progress)
    fcntl64(5, F_SETFD, FD_CLOEXEC) = 0
    epoll_ctl(6, EPOLL_CTL_ADD, 5, {EPOLLIN|EPOLLOUT|EPOLLERR|EPOLLHUP, {u32=5, u64=5}}) = 0
    time(NULL) = 1193683853
    gettimeofday({1193683853, 965793}, NULL) = 0
    epoll_wait(6, {{EPOLLOUT, {u32=5, u64=5}}}, 1365, 100) = 1
    getsockopt(5, SOL_SOCKET, SO_ERROR, [0], [4]) = 0






    Mongrel:

    [root@apps rails]# strace -s 124 -p 6090
    Process 6090 attached - interrupt to quit
    gettimeofday({1193683852, 132459}, NULL) = 0
    gettimeofday({1193683852, 132528}, NULL) = 0
    select(34, [3 33], [], [], {0, 0}) = 1 (in [3], left {0, 0})
    read(5, "GET /flower_delivery/florists_in_ellisfield_hampshire_england_uk HTTP/1.1\nAccept: image/gif, image/x-xbitmap, image/jpeg, im"..., 16384) = 477
    rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
    close(5) = 0
    munmap(0xb62dd000, 4096) = 0
    close(5) = -1 EBADF (Bad file descriptor)
    munmap(0xb62de000, 4096) = 0
    gettimeofday({1193683852, 133107}, NULL) = 0
    select(34, [33], [], [], {0, 0}) = 0 (Timeout)
    accept(3, {sa_family=AF_INET, sin_port=htons(44429), sin_addr=inet_addr("10.245.11.68")}, [16]) = 5
    fcntl64(5, F_GETFL) = 0x2 (flags O_RDWR)
    fstat64(5, {st_mode=S_IFSOCK|0777, st_size=0, ...}) = 0
    mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb62de000
    _llseek(5, 0, 0xbfe194b4, SEEK_CUR) = -1 ESPIPE (Illegal seek)
    fcntl64(5, F_GETFL) = 0x2 (flags O_RDWR)
    fstat64(5, {st_mode=S_IFSOCK|0777, st_size=0, ...}) = 0
    mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb62dd000
    _llseek(5, 0, 0xbfe194b4, SEEK_CUR) = -1 ESPIPE (Illegal seek)
    setsockopt(5, SOL_TCP, TCP_CORK, [1], 4) = 0
    gettimeofday({1193683852, 133837}, NULL) = 0
    select(34, [5 33], [], [], {0, 0}) = 1 (in [5], left {0, 0})
    gettimeofday({1193683852, 133988}, NULL) = 0
    gettimeofday({1193683852, 134030}, NULL) = 0
    select(34, [3 33], [], [], {0, 0}) = 1 (in [3], left {0, 0})
    read(5, "GET /flower_delivery/florists_in_rg21_hampshire_england_uk HTTP/1.1\nAccept: image/gif, image/x-xbitmap, image/jpeg, image/pj"..., 16384) = 471
    rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
    close(5) = 0
    munmap(0xb62dd000, 4096) = 0
    close(5) = -1 EBADF (Bad file descriptor)
    munmap(0xb62de000, 4096) = 0
    gettimeofday({1193683852, 134521}, NULL) = 0
    select(34, [33], [], [], {0, 0}) = 0 (Timeout)
    accept(3, {sa_family=AF_INET, sin_port=htons(44430), sin_addr=inet_addr("10.245.11.68")}, [16]) = 5
    fcntl64(5, F_GETFL) = 0x2 (flags O_RDWR)
    fstat64(5, {st_mode=S_IFSOCK|0777, st_size=0, ...}) = 0
    mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb62de000
    _llseek(5, 0, 0xbfe194b4, SEEK_CUR) = -1 ESPIPE (Illegal seek)
    fcntl64(5, F_GETFL) = 0x2 (flags O_RDWR)
    fstat64(5, {st_mode=S_IFSOCK|0777, st_size=0, ...}) = 0
    mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb62dd000
    _llseek(5, 0, 0xbfe194b4, SEEK_CUR) = -1 ESPIPE (Illegal seek)
    setsockopt(5, SOL_TCP, TCP_CORK, [1], 4) = 0
    gettimeofday({1193683852, 135199}, NULL) = 0
    select(34, [5 33], [], [], {0, 0}) = 1 (in [5], left {0, 0})
    gettimeofday({1193683852, 135346}, NULL) = 0
    gettimeofday({1193683852, 135388}, NULL) = 0
    select(34, [3 33], [], [], {0, 0}) = 1 (in [3], left {0, 0})

    Attached Files:

  12. mistwang

    mistwang LiteSpeed Staff

    The problem is on mongrel side. As shown in the strace output, file handle 5 is the reverse proxy connection from LSWS to mongrel. Mongrel read the request, then it closed the connection immediately without sending back anything, then try to close it again with result EBADF, because the file descriptor has been closed already.

    When mongrel was working, it should send the reply back to LSWS before closing the socket.

    The root cause of the problem is on Mongrel side, however, LSWS should fail the request after a few retries. We will implement that in our 3.3 release.
  13. babelian

    babelian New Member

    thanks mistwang, thats great news.
  14. mistwang

    mistwang LiteSpeed Staff

    3.3 release package is available on our server, just change the version number to 3.3 to download it. Please give it a try if you want to.
  15. babelian

    babelian New Member

    thanks mistwang. i'll try it out and let you know if the problem continues.

Share This Page