kernel: Bad page state in process 'lshttpd'

#1
Code:
Message from syslogd@ at Mon Jul 13 07:49:20 2009 ...                                                                                       715,1         69%
host kernel: Bad page state in process 'lshttpd'
Message from syslogd@ at Mon Jul 13 07:49:20 2009 ...
host kernel: page:c33fb540 flags:0xc0080010 mapping:00000000 mapcount:0 count:0 (Not tainted)
Message from syslogd@ at Mon Jul 13 07:49:20 2009 ...
host kernel: Trying to fix it up, but a reboot is needed
I can't seem to explain why I'm getting plague by this over the past few months. ;/
 

mistwang

LiteSpeed Staff
#2
Sorry about the problem you experienced.
Looks like lshttpd triggered a kernel bug.

Which version of linux and kernel are you using?
search the error message "Bad page state in process", I got many discussions about it in the kernel mailing list.
Only upgrade/downgrade to a kernel without this bug can fix this I think.
 

mistwang

LiteSpeed Staff
#3
go over your thread at webhostingtalk, looks like you have been trying different kernel, since many people using RE5 stable kernel on their server, if it is a kernel bug with the stable kernel, a lot of users would report the same problem.

Have you swapped the hardware, memory? CPU? motherboard?
 
#4
I recently took a trip so I was gone for about a week, which led to people leaving my host, however, the server managed to stay stable and was up for 5 days, till tonight which is about 24hrs later after I told everyone I was back and it looks like some clients came back over.

I've ran all kernels and im right now back on the latest stable RH5.

hardware was also changed since i changed datacenters.
 

mistwang

LiteSpeed Staff
#5
I think it is a kernel bug, find this
http://www.kernel.org/pub/linux/kernel/v2.6/ChangeLog-2.6.28.3

x86, mm: fix pte_free()

commit 42ef73fe134732b2e91c0326df5fd568da17c4b2 upstream.

On -rt we were seeing spurious bad page states like:

Bad page state in process 'firefox'
page:c1bc2380 flags:0x40000000 mapping:c1bc2390 mapcount:0 count:0
Trying to fix it up, but a reboot is needed
Backtrace:
Pid: 503, comm: firefox Not tainted 2.6.26.8-rt13 #3
[<c043d0f3>] ? printk+0x14/0x19
[<c0272d4e>] bad_page+0x4e/0x79
[<c0273831>] free_hot_cold_page+0x5b/0x1d3
[<c02739f6>] free_hot_page+0xf/0x11
[<c0273a18>] __free_pages+0x20/0x2b
[<c027d170>] __pte_alloc+0x87/0x91
[<c027d25e>] handle_mm_fault+0xe4/0x733
[<c043f680>] ? rt_mutex_down_read_trylock+0x57/0x63
[<c043f680>] ? rt_mutex_down_read_trylock+0x57/0x63
[<c0218875>] do_page_fault+0x36f/0x88a

This is the case where a concurrent fault already installed the PTE and
we get to free the newly allocated one.

This is due to pgtable_page_ctor() doing the spin_lock_init(&page->ptl)
which is overlaid with the {private, mapping} struct.

union {
struct {
unsigned long private;
struct address_space *mapping;
};
spinlock_t ptl;
struct kmem_cache *slab;
struct page *first_page;
};

Normally the spinlock is small enough to not stomp on page->mapping, but
PREEMPT_RT=y has huge 'spin'locks.

But lockdep kernels should also be able to trigger this splat, as the
lock tracking code grows the spinlock to cover page->mapping.

The obvious fix is calling pgtable_page_dtor() like the regular pte free
path __pte_free_tlb() does.

It seems all architectures except x86 and nm10300 already do this, and
nm10300 doesn't seem to use pgtable_page_ctor(), which suggests it
doesn't do SMP or simply doesnt do MMU at all or something.

Signed-off-by: Peter Zijlstra <a.p.zijlsta@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
If you can, you should provide the kernel backtrace to help identify the problem.
 

yuho

New Member
#7
setup a 1hr cron job to do that following:

Code:
#!/bin/sh
kilall -9 lshttpd
/sbin/service lsws start
sync; echo 3 > /proc/sys/vm/drop_caches
that will do a hard reset and prob fix your problem

----

could this be an LSAPI issue? You said you had it running stable when you had less sites.
 
Last edited:

mistwang

LiteSpeed Staff
#8
How do i do a backtrace? and what kernel do you recommend that i upgrade to?
Never debugged a kernel bug, so I am not sure either. but I think it is should be a kernel configuration to let kernel dump a backtrace like the one I posted.

The RHEL5 kernel should be stable, otherwise, we will be flooded with bug reports like yours.

Is there any kernel parameter, tuning applied to the default setup?

Another suggestion is to load 64bit Linux if your server has more than 4GB memory, otherwise get rid of the PAE kernel.

You can try restarting LSWS regularly as suggested as well.
 
#9
The script above works... just went 1 day with uptime, so I'll let it go for another day and see how it goes. if its still stable then I'm going to try a few things and narrow down the cause. I'll try removing drop_caches first and see what happens.

so far my ideas are:
- bad sector in memory?
- litespeed in some way caching which causes a corruption?

what i cant figure out is what im doing so different then all the other CPanel setups that run LS(default config) with probably more active sites then me. It could be a kernel issue as you pointed out... so I'll try upgrading to the latest and see what happens later this week.
 
Top