Help with hung processes

GOT

Well-Known Member
#1
How can I find out why at times all the processes get hung up and the site stops responding. This happens a few times a day and its not traceable to a big traffic spike. Process list is below. Once this happens I have to stop lsws, kill all lsphp5 processes and then start lsws again.

Page requests are only around 10 pages/sec. Memory, server load, io all fine.

Server is a dual proc hex core with SSD's.

26620 ? S 0:07 litespeed (lshttpd)
26624 ? S 0:00 \_ httpd (lscgid)
26625 ? Sl 3:17 \_ litespeed (lshttpd)
26054 ? S 0:00 | \_ lsphp5
31269 ? S 0:00 | \_ lsphp5:/home/ocfcom/public_html/index.php
31273 ? S 0:00 | \_ lsphp5:/home/ocfcom/public_html/index.php
31276 ? S 0:00 | \_ lsphp5:/home/ocfcom/public_html/index.php
31294 ? S 0:00 | \_ lsphp5:/home/ocfcom/public_html/index.php
31304 ? S 0:00 | \_ lsphp5:/home/ocfcom/public_html/index.php
31318 ? S 0:00 | \_ lsphp5:/home/ocfcom/public_html/index.php
31335 ? S 0:00 | \_ lsphp5:/home/ocfcom/public_html/index.php
31400 ? S 0:00 | \_ lsphp5:/home/ocfcom/public_html/index.php
31423 ? S 0:00 | \_ lsphp5:/home/ocfcom/public_html/index.php
31429 ? S 0:00 | \_ lsphp5:/home/ocfcom/public_html/index.php
31432 ? S 0:00 | \_ lsphp5:/home/ocfcom/public_html/index.php
31434 ? S 0:00 | \_ lsphp5:/home/ocfcom/public_html/index.php
31435 ? S 0:00 | \_ lsphp5:/home/ocfcom/public_html/index.php
31437 ? S 0:00 | \_ lsphp5:/home/ocfcom/public_html/index.php
31438 ? S 0:00 | \_ lsphp5:/home/ocfcom/public_html/index.php
31447 ? S 0:00 | \_ lsphp5:/home/ocfcom/public_html/index.php
31470 ? S 0:00 | \_ lsphp5:/home/ocfcom/public_html/index.php
31474 ? S 0:00 | \_ lsphp5:/home/ocfcom/public_html/index.php
31475 ? S 0:00 | \_ lsphp5:/home/ocfcom/public_html/index.php
31479 ? S 0:00 | \_ lsphp5:/home/ocfcom/public_html/index.php
31486 ? S 0:00 | \_ lsphp5:/home/ocfcom/public_html/index.php
31493 ? S 0:00 | \_ lsphp5:/home/ocfcom/public_html/index.php
31504 ? S 0:00 | \_ lsphp5:/home/ocfcom/public_html/index.php
31513 ? S 0:00 | \_ lsphp5:/home/ocfcom/public_html/index.php
31519 ? S 0:00 | \_ lsphp5:/home/ocfcom/public_html/index.php
31526 ? S 0:00 | \_ lsphp5:/home/ocfcom/public_html/index.php
31527 ? S 0:00 | \_ lsphp5:/home/ocfcom/public_html/index.php
31536 ? S 0:00 | \_ lsphp5:/home/ocfcom/public_html/index.php
31543 ? S 0:00 | \_ lsphp5:/home/ocfcom/public_html/index.php
31565 ? S 0:00 | \_ lsphp5:/home/ocfcom/public_html/index.php
31580 ? S 0:00 | \_ lsphp5:/home/ocfcom/public_html/index.php
31591 ? S 0:00 | \_ lsphp5:/home/ocfcom/public_html/index.php
31602 ? S 0:00 | \_ lsphp5:/home/ocfcom/public_html/index.php
31612 ? S 0:00 | \_ lsphp5:/home/ocfcom/public_html/index.php
31633 ? S 0:00 | \_ lsphp5:/home/ocfcom/public_html/index.php
31635 ? S 0:00 | \_ lsphp5:/home/ocfcom/public_html/index.php
31690 ? S 0:00 | \_ lsphp5:/home/ocfcom/public_html/index.php
31693 ? S 0:00 | \_ lsphp5:/home/ocfcom/public_html/index.php
31713 ? S 0:00 | \_ lsphp5:/home/ocfcom/public_html/index.php
31731 ? S 0:00 | \_ lsphp5:/home/ocfcom/public_html/index.php
31732 ? S 0:00 | \_ lsphp5:/home/ocfcom/public_html/index.php
31733 ? S 0:00 | \_ lsphp5:/home/ocfcom/public_html/index.php
31736 ? S 0:00 | \_ lsphp5:/home/ocfcom/public_html/index.php
31740 ? S 0:00 | \_ lsphp5:/home/ocfcom/public_html/index.php
31744 ? S 0:00 | \_ lsphp5:/home/ocfcom/public_html/index.php
31745 ? S 0:00 | \_ lsphp5:/home/ocfcom/public_html/index.php
31775 ? S 0:00 | \_ lsphp5:/home/ocfcom/public_html/index.php
26626 ? Sl 1:20 \_ litespeed (lshttpd)
31629 ? S 0:00 \_ lsphp5
31752 ? S 0:00 \_ lsphp5
 

Pong

Administrator
Staff member
#2
Any information in your error log? which one consumes most of your resources when problem occurs? what's the output the real-time stats from LiteSpeed admin console?
 

GOT

Well-Known Member
#3
Attached is what the live stats look like. It says 40 of 40 pool used.

The error logs are just saying

2015-02-04 12:37:42.000 [INFO] [CLEANUP] Send signal: 10 to process: 29660
2015-02-04 12:37:42.000 [INFO] [CLEANUP] Send signal: 10 to process: 29659
2015-02-04 12:37:44.000 [INFO] [CLEANUP] Send signal: 10 to process: 29665
2015-02-04 12:37:44.000 [INFO] [CLEANUP] Send signal: 10 to process: 29663
2015-02-04 12:37:44.000 [INFO] [CLEANUP] Send signal: 10 to process: 29662
2015-02-04 12:37:44.000 [INFO] [CLEANUP] Send signal: 10 to process: 29664
2015-02-04 12:37:47.000 [INFO] [CLEANUP] Send signal: 10 to process: 29669
2015-02-04 12:37:47.000 [INFO] [CLEANUP] Send signal: 10 to process: 29667
2015-02-04 12:37:49.000 [INFO] [CLEANUP] Send signal: 10 to process: 29674
2015-02-04 12:37:50.000 [INFO] [CLEANUP] Send signal: 10 to process: 29670
2015-02-04 12:37:52.000 [INFO] [CLEANUP] Send signal: 10 to process: 29649
2015-02-04 12:37:52.000 [INFO] [CLEANUP] Send signal: 10 to process: 29616
2015-02-04 12:37:53.000 [INFO] [CLEANUP] Send signal: 10 to process: 29681
2015-02-04 12:37:55.000 [INFO] [CLEANUP] Send signal: 10 to process: 29683
2015-02-04 12:38:00.000 [INFO] [CLEANUP] Send signal: 10 to process: 29689
2015-02-04 12:38:01.000 [INFO] [CLEANUP] Send signal: 10 to process: 29691


You can see taht the requestws per second are really very low. Less than 3 per second.
 

Attachments

GOT

Well-Known Member
#5
I've done that before using strace -p <pid> and the process isn't doing anything its just hung.

Is there a better command arguments to use?
 

NiteWave

Administrator
#6
the issue looks like difficult at the time being. have to collect more information gradually

>Once this happens I have to stop lsws, kill all lsphp5 processes and then start lsws again.
how about killing all lsphp5 processes only, no stop/restart/start lsws ?
it should work too. if not, even more weird issue.
 

GOT

Well-Known Member
#7
Yeah, I get that. We have two application servers, so we can take teh time necessary, but its baffling and it is causing issues. last night it happened to both servers at the same time.

I've been dealing with this for a couple of months trying different settings and timeouts.

I need to get it resolved though.

I'll send you the output of the strace as soon as I can catch it again, but I can tell you, it looked less than helpful to me.
 

GOT

Well-Known Member
#8
It happened again this morning. Here is all I got

[root@app02 ~]# strace -p 29846

Process 29846 attached - interrupt to quit

futex(0x7fbf50490190, FUTEX_WAIT, 2, NULL


and it just sat there until I broke out.
 
#9
Do you use Cloud Linux on this server? If so you may want check your EP and nPROC limit, I was having similar problem, and it was fixed by increase nPROC limit to 4*EP ( If your EP limit is 10 set nPROC to at least 40).
Not sure if we are having the same issue but hope it helps.
 

NiteWave

Administrator
#11
>futex(0x7fbf50490190, FUTEX_WAIT, 2, NULL

this shows deadlock happened, so all processes hanged.
any opcode cache enabled ? suhosin.so installed ? see if you can temporarily to remove unnecessary extensions one by one and watch.
 
Top