100% CPU after 4.1.1 upgrade from 4.0.20

#1
Hi, I've been running LiteSpeed for a long time and I don't remember the last time I had any problems with it.

Yesterday I upgraded from 4.0.20 to 4.1.1 (for no special reason) and later I started noticing an oddity: sometimes the litespeed process (lshttpd) used 100% of the CPU.

This happens when users use my service and when they try to upload a file, CPU of "litespeed" process goes to 100% and stays there....

I switched back and forth from 4.0.20 to 4.1.1 and I can reproduce this with 4.1.1 only.

I'm pretty sure it's a bug somewhere... but I haven't isolated the problem yet... this only happens when the user tries to upload a file (I tested with a 4MB file). I know I haven't done anything besides upgrading LiteSpeed.

This web service is a Perl script using CGI::Fast, on a Debian 5 linux running 2.6.26.

I did a "strace" to this process, and when the CPU goes to 100% it loops with messages like these:

Code:
10:45:23.817680 gettimeofday({1309959923, 817690}, NULL) = 0
10:45:23.817732 epoll_wait(6, {{EPOLLOUT, {u32=24, u64=24}}}, 10, 100) = 1
10:45:23.817785 gettimeofday({1309959923, 817795}, NULL) = 0
10:45:23.817836 epoll_wait(6, {{EPOLLOUT, {u32=24, u64=24}}}, 10, 100) = 1
10:45:23.817890 gettimeofday({1309959923, 817900}, NULL) = 0
10:45:23.817941 epoll_wait(6, {{EPOLLOUT, {u32=24, u64=24}}}, 10, 100) = 1
10:45:23.817994 gettimeofday({1309959923, 818004}, NULL) = 0
10:45:23.818046 epoll_wait(6, {{EPOLLOUT, {u32=24, u64=24}}}, 10, 100) = 1
10:45:23.818099 gettimeofday({1309959923, 818109}, NULL) = 0
10:45:23.818150 epoll_wait(6, {{EPOLLOUT, {u32=24, u64=24}}}, 10, 100) = 1
10:45:23.818204 gettimeofday({1309959923, 818214}, NULL) = 0
10:45:23.818255 epoll_wait(6, {{EPOLLOUT, {u32=24, u64=24}}}, 10, 100) = 1
10:45:23.818308 gettimeofday({1309959923, 818318}, NULL) = 0
10:45:23.818360 epoll_wait(6, {{EPOLLOUT, {u32=24, u64=24}}}, 10, 100) = 1
10:45:23.818413 gettimeofday({1309959923, 818423}, NULL) = 0
10:45:23.818464 epoll_wait(6, {{EPOLLOUT, {u32=24, u64=24}}}, 10, 100) = 1
10:45:23.818518 gettimeofday({1309959923, 818528}, NULL) = 0
10:45:23.818569 epoll_wait(6, {{EPOLLOUT, {u32=24, u64=24}}}, 10, 100) = 1
10:45:23.818622 gettimeofday({1309959923, 818632}, NULL) = 0
10:45:23.818674 epoll_wait(6, {{EPOLLOUT, {u32=24, u64=24}}}, 10, 100) = 1
10:45:23.818727 gettimeofday({1309959923, 818737}, NULL) = 0
10:45:23.818778 epoll_wait(6, {{EPOLLOUT, {u32=24, u64=24}}}, 10, 100) = 1
10:45:23.818832 gettimeofday({1309959923, 818841}, NULL) = 0
10:45:23.818883 epoll_wait(6, {{EPOLLOUT, {u32=24, u64=24}}}, 10, 100) = 1
10:45:23.818939 gettimeofday({1309959923, 818949}, NULL) = 0
10:45:23.818991 epoll_wait(6, {{EPOLLOUT, {u32=24, u64=24}}}, 10, 100) = 1
10:45:23.819044 gettimeofday({1309959923, 819054}, NULL) = 0
10:45:23.819096 epoll_wait(6, {{EPOLLOUT, {u32=24, u64=24}}}, 10, 100) = 1
I switched back to 4.0.20. Let me know if I can help you with any other info!

Regards!
 
#3
William, thanks for your reply, but I didn't understand what you mean. Could you please rewrite in "user-friendly language"?

Like I said, I only upgraded LiteSpeed from 4.0.20 to 4.1.1 from the admin web console. Is there an upgrade requirement that I am missing here?


Thanks
 
#5
hi mistwang, I never used this feature and I have both Outbound Bandwidth and Inbound Bandwidth set to 0.

Let me know if you need any other info.


Thanks
 

mistwang

LiteSpeed Staff
#8
can you please reproduce the bug with debug logging enabled?
You can search related log entries by your IP address, please send the log to bug@litespeedtech...
 

IrPr

Well-Known Member
#11
I'm facing the same issue on 4.1.1 and 4.1.2 both, no anything special found on error log
No any CGI application, no any Output/Inout BW limitation, just simple lsphp5 processes are running
AIO, epoll and noatime (fstab) are used on the box


here is top head output:
top - 00:19:02 up 3 days, 11:21, 5 users, load average: 185.35, 175.61, 102.62
Tasks: 1153 total, 160 running, 993 sleeping, 0 stopped, 0 zombie
Cpu(s): 21.3%us, 77.7%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.0%hi, 1.0%si, 0.0%st
Mem: 66007536k total, 45632468k used, 20375068k free, 2394380k buffers
Swap: 2096472k total, 0k used, 2096472k free, 30505516k cached

look at sys CPU usage, lshttpd is eating all CPU power

I'll switch to 4.0 and seeing, but any idea on this?
 

mistwang

LiteSpeed Staff
#12
there is a bug in mod_security implementation may cause high CPU usage after lshttpd run for a while. Please download and upgrade to the latest 4.1.2 build.
If it happen again, please try "strace -tt -T -p <pid_of_lshttpd_using_100%_cpu>"
 

IrPr

Well-Known Member
#13
upgraded to latest 4.1.12 and it happened again

strace is very ugly but i found these bellow lines there:

Code:
00:56:46.039465 epoll_ctl(7, EPOLL_CTL_MOD, 3, {EPOLLERR|EPOLLHUP, {u32=0, u64=12884901888}}) = 0 <0.005406>
00:56:46.046294 epoll_ctl(7, EPOLL_CTL_MOD, 3, {EPOLLIN|EPOLLERR|EPOLLHUP, {u32=0, u64=12884901888}}) = 0 <0.000630>
00:56:46.049874 recvfrom(28, 0x7fff11b5d05f, 1, 2, 0, 0) = -1 EAGAIN (Resource temporarily unavailable) <0.000928>
00:56:46.051698 epoll_ctl(7, EPOLL_CTL_MOD, 28, {EPOLLERR|EPOLLHUP, {u32=0, u64=120259084288}}) = 0 <0.001321>
00:56:46.053730 epoll_ctl(7, EPOLL_CTL_MOD, 28, {EPOLLIN|EPOLLERR|EPOLLHUP, {u32=0, u64=120259084288}}) = 0 <0.003656>
00:56:46.059891 recvfrom(41, 0x7fff11b5d05f, 1, 2, 0, 0) = -1 EAGAIN (Resource temporarily unavailable) <0.003208>
00:56:46.063269 epoll_ctl(7, EPOLL_CTL_MOD, 41, {EPOLLERR|EPOLLHUP, {u32=0, u64=176093659136}}) = 0 <0.002374>
00:56:46.067189 epoll_ctl(7, EPOLL_CTL_MOD, 41, {EPOLLIN|EPOLLERR|EPOLLHUP, {u32=0, u64=176093659136}}) = 0 <0.000409>
00:56:46.068408 recvfrom(43, 0x7fff11b5d05f, 1, 2, 0, 0) = -1 EAGAIN (Resource temporarily unavailable) <0.000439>
00:56:46.070474 epoll_ctl(7, EPOLL_CTL_MOD, 74, {EPOLLIN|EPOLLOUT|EPOLLERR|EPOLLHUP, {u32=0, u64=317827579904}}) = 0 <0.002435>
00:56:46.132436 recvfrom(98, 0x7fff11b5d05f, 1, 2, 0, 0) = -1 EAGAIN (Resource temporarily unavailable) <0.000715>
00:56:46.133559 epoll_ctl(7, EPOLL_CTL_MOD, 98, {EPOLLERR|EPOLLHUP, {u32=0, u64=420906795008}}) = 0 <0.002848>
00:56:46.136911 epoll_ctl(7, EPOLL_CTL_MOD, 98, {EPOLLIN|EPOLLERR|EPOLLHUP, {u32=0, u64=420906795008}}) = 0 <0.000948>
00:56:46.138793 epoll_ctl(7, EPOLL_CTL_MOD, 101, {EPOLLOUT|EPOLLERR|EPOLLHUP, {u32=0, u64=433791696896}}) = 0 <0.000544>
00:56:46.140099 epoll_ctl(7, EPOLL_CTL_MOD, 101, {EPOLLIN|EPOLLOUT|EPOLLERR|EPOLLHUP, {u32=0, u64=433791696896}}) = 0 <0.000976>
00:56:46.141473 epoll_ctl(7, EPOLL_CTL_MOD, 105, {EPOLLIN|EPOLLOUT|EPOLLERR|EPOLLHUP, {u32=0, u64=450971566080}}) = 0 <0.000787>
00:56:46.143161 recvfrom(111, 0x7fff11b5d05f, 1, 2, 0, 0) = -1 EAGAIN (Resource temporarily unavailable) <0.002742>
00:56:46.148236 epoll_ctl(7, EPOLL_CTL_MOD, 111, {EPOLLERR|EPOLLHUP, {u32=0, u64=476741369856}}) = 0 <0.001239>
00:56:46.151945 epoll_ctl(7, EPOLL_CTL_MOD, 111, {EPOLLIN|EPOLLERR|EPOLLHUP, {u32=0, u64=476741369856}}) = 0 <0.001868>
00:56:46.185697 recvfrom(112, 0x7fff11b5d05f, 1, 2, 0, 0) = -1 EAGAIN (Resource temporarily unavailable) <0.004630>
00:56:46.191481 epoll_ctl(7, EPOLL_CTL_MOD, 112, {EPOLLERR|EPOLLHUP, {u32=0, u64=481036337152}}) = 0 <0.003293>
00:56:46.195478 epoll_ctl(7, EPOLL_CTL_MOD, 112, {EPOLLIN|EPOLLERR|EPOLLHUP, {u32=0, u64=481036337152}}) = 0 <0.000903>
00:56:46.198602 epoll_ctl(7, EPOLL_CTL_MOD, 116, {EPOLLOUT|EPOLLERR|EPOLLHUP, {u32=0, u64=498216206336}}) = 0 <0.001762>
00:56:46.201337 epoll_ctl(7, EPOLL_CTL_MOD, 116, {EPOLLIN|EPOLLOUT|EPOLLERR|EPOLLHUP, {u32=0, u64=498216206336}}) = 0 <0.002027>
00:56:46.204415 epoll_ctl(7, EPOLL_CTL_MOD, 117, {EPOLLOUT|EPOLLERR|EPOLLHUP, {u32=0, u64=502511173632}}) = 0 <0.000342>
00:56:46.205935 epoll_ctl(7, EPOLL_CTL_MOD, 117, {EPOLLIN|EPOLLOUT|EPOLLERR|EPOLLHUP, {u32=0, u64=502511173632}}) = 0 <0.002864>
00:56:46.209402 recvfrom(118, 0x7fff11b5d05f, 1, 2, 0, 0) = -1 EAGAIN (Resource temporarily unavailable) <0.000978>
00:56:46.212161 epoll_ctl(7, EPOLL_CTL_MOD, 118, {EPOLLERR|EPOLLHUP, {u32=0, u64=506806140928}}) = 0 <0.000267>
00:56:46.215712 epoll_ctl(7, EPOLL_CTL_MOD, 118, {EPOLLIN|EPOLLERR|EPOLLHUP, {u32=0, u64=506806140928}}) = 0 <0.000114>
00:56:46.216778 recvfrom(122, 0x7fff11b5d05f, 1, 2, 0, 0) = -1 EAGAIN (Resource temporarily unavailable) <0.000940>
00:56:46.218182 epoll_ctl(7, EPOLL_CTL_MOD, 122, {EPOLLERR|EPOLLHUP, {u32=0, u64=523986010112}}) = 0 <0.006464>
00:56:46.225484 epoll_ctl(7, EPOLL_CTL_MOD, 122, {EPOLLIN|EPOLLERR|EPOLLHUP, {u32=0, u64=523986010112}}) = 0 <0.000672>
00:56:46.251877 recvfrom(126, 0x7fff11b5d05f, 1, 2, 0, 0) = -1 EAGAIN (Resource temporarily unavailable) <0.000898>
00:56:46.253708 epoll_ctl(7, EPOLL_CTL_MOD, 126, {EPOLLERR|EPOLLHUP, {u32=0, u64=541165879296}}) = 0 <0.003339>
00:56:46.257553 epoll_ctl(7, EPOLL_CTL_MOD, 126, {EPOLLIN|EPOLLERR|EPOLLHUP, {u32=0, u64=541165879296}}) = 0 <0.000904>
00:56:46.258900 recvfrom(143, 0x7fff11b5d05f, 1, 2, 0, 0) = -1 EAGAIN (Resource temporarily unavailable) <0.000041>
00:56:46.260824 epoll_ctl(7, EPOLL_CTL_MOD, 143, {EPOLLERR|EPOLLHUP, {u32=0, u64=614180323328}}) = 0 <0.003327>
00:56:46.266498 epoll_ctl(7, EPOLL_CTL_MOD, 143, {EPOLLIN|EPOLLERR|EPOLLHUP, {u32=0, u64=614180323328}}) = 0 <0.000084>
00:56:46.270110 recvfrom(146, 0x7fff11b5d05f, 1, 2, 0, 0) = -1 EAGAIN (Resource temporarily unavailable) <0.025589>
00:56:46.296781 epoll_ctl(7, EPOLL_CTL_MOD, 146, {EPOLLERR|EPOLLHUP, {u32=0, u64=627065225216}}) = 0 <0.097823>
00:56:46.395563 epoll_ctl(7, EPOLL_CTL_MOD, 146, {EPOLLIN|EPOLLERR|EPOLLHUP, {u32=0, u64=627065225216}}) = 0 <0.000030>
00:56:46.395680 epoll_ctl(7, EPOLL_CTL_MOD, 151, {EPOLLOUT|EPOLLERR|EPOLLHUP, {u32=0, u64=648540061696}}) = 0 <0.000009>
00:56:46.395724 epoll_ctl(7, EPOLL_CTL_MOD, 151, {EPOLLIN|EPOLLOUT|EPOLLERR|EPOLLHUP, {u32=0, u64=648540061696}}) = 0 <0.000005>
00:56:46.395765 recvfrom(155, 0x7fff11b5d05f, 1, 2, 0, 0) = -1 EAGAIN (Resource temporarily unavailable) <0.000011>
00:56:46.395812 epoll_ctl(7, EPOLL_CTL_MOD, 155, {EPOLLERR|EPOLLHUP, {u32=0, u64=6657<0.000005>
 

mistwang

LiteSpeed Staff
#14
Can also do "lsof -p <pid_of_100%_cpu>", then

kill -USR2 -p <pid_of_100%_cpu>

to turn on debug logging dynamically, Log Level must be set to "DEBUG", and "Debug Level" "NONE".

If it is too big, please send the log file to bug@litespeedtech...
 

IrPr

Well-Known Member
#15
Unfortunately its a production server and i couldn't risk to switch back into 4.1
but there are other servers which may this problem occur in them, I will update with the result of commands if i possible

Hope henpa provide more information at the night to investigate this odd issue
 

mistwang

LiteSpeed Staff
#16
No problem. this type of issue usually used up all available idle CPU cycles on one core, by spinning in the event loop, maybe a little bit sluggish, but should not block a request.
 
#17
Unfortunately its a production server and i couldn't risk to switch back into 4.1
but there are other servers which may this problem occur in them, I will update with the result of commands if i possible

Hope henpa provide more information at the night to investigate this odd issue
Hi mistwang, I'm sorry I didn't get the info that you've asked yet, but it is also a production server and I didn't have the opportunity yet, but I will eventually sometime in the next few days when the usage is low.

Thanks for your help so far and sorry for not writing here again for a few days.
 

andreas

Well-Known Member
#19
Had the 100% CPU problem with the latest 4.1.2 today. Restarting didn't fix it, but after killing and deleting /tmp/lshttpd it seems to work for now.

strace:
Code:
gettimeofday({1312205537, 345598}, NULL) = 0
epoll_wait(6, {{EPOLLIN, {u32=7, u64=7}}, {EPOLLIN, {u32=9, u64=9}}}, 10, 100) = 2
gettimeofday({1312205537, 345644}, NULL) = 0
epoll_wait(6, {{EPOLLIN, {u32=7, u64=7}}, {EPOLLIN, {u32=9, u64=9}}}, 10, 100) = 2
gettimeofday({1312205537, 345690}, NULL) = 0
epoll_wait(6, {{EPOLLIN, {u32=7, u64=7}}, {EPOLLIN, {u32=9, u64=9}}}, 10, 100) = 2
gettimeofday({1312205537, 345736}, NULL) = 0
epoll_wait(6, {{EPOLLIN, {u32=7, u64=7}}, {EPOLLIN, {u32=9, u64=9}}}, 10, 100) = 2
gettimeofday({1312205537, 345782}, NULL) = 0
epoll_wait(6, {{EPOLLIN, {u32=7, u64=7}}, {EPOLLIN, {u32=9, u64=9}}}, 10, 100) = 2
 
Top