This is an old revision of the document!


By default , LSCWP built-in crawler will add an URI into blacklist if following conditions are met:

1. the page is not cache by design or default , in other word, any pages that sends response header x-litespeed-cache-control: no-cache will be added into blacklist after initial crawling.

2. If the page is not responding the following headers:

HTTP/1.1 200 OK
HTTP/1.1 201 Created
HTTP/2 200
HTTP/2 201

One Real Debug Case:

a user reports some pages are always being added into blacklist after first crawling, although manually use curl or Chrome browser , it always shows x-litespeed-cache header and 200 OK status code, but there are always dozens of URIs being added into blacklist when doing crawl.

So as mentioned above , we know the condition why it is blacklist , so we just need to figure what happened to trigger crawler to add it into blacklist.

Upon the checking debug log , but apparently it didn't log the response header, so we will need a little modification.

So we add a line to log more by inserting following code into file litespeed-cache/lib/litespeed/litespeed-crawler.class.php at line 273

 LiteSpeed_Cache_Log::debug( 'crawler logs headers', $headers ) ; 

This way , we will get the $headers when crawler deals it.

Now after a manual crawling , let's check the debug.log by grep headers /path/to/wordpress/wp-content/debug.log

So here is the problem , most of logs shows header is HTTP/1.1 200 OK but some headers are empty, that's the reason why it is being added into blacklist.

But why if manually run a curl , it just works as normal ?

[root@test ~]# curl -I -XGET https://example.com/product-name-1
HTTP/1.1 200 OK
Date: Thu, 11 Jul 2019 20:57:54 GMT
Content-Type: text/html; charset=UTF-8
Transfer-Encoding: chunked
Connection: keep-alive
Set-Cookie: __cfduid=some-string-here; expires=Fri, 10-Jul-20 20:57:43 GMT; path=/; domain=.example.com; HttpOnly
Cf-Railgun: direct (starting new WAN connection)
Link: <https://example.com/wp-json/>; rel="https://api.w.org/"
Link: </min/186a9.css>; rel=preload; as=style,</min/f7e97.css>; rel=preload; as=style,</wp-content/plugins/plugin/jquery.min.js>; rel=preload; as=script,</min/7f44e.js>; rel=preload; as=script,</min/a8512.js>; rel=preload; as=script,</wp-content/plugins/litespeed-cache/js/webfontloader.min.js>; rel=preload; as=script
Set-Cookie: wp_woocommerce_session_string=value; expires=Sat, 13-Jul-2019 20:57:43 GMT; Max-Age=172799; path=/; secure; HttpOnly
Set-Cookie: wp_woocommerce_session_string=value; expires=Sat, 13-Jul-2019 20:57:43 GMT; Max-Age=172799; path=/; secure; HttpOnly
Vary: Accept-Encoding
X-Litespeed-Cache: miss
X-Litespeed-Cache-Control: public,max-age=604800
X-Litespeed-Tag: 98f_WC_T.156,98f_WC_T.494,98f_WC_T.48,98f_product_cat,98f_URL.e3a528ab8c54fd1cf6bf060091288580,98f_T.156,98f_
X-Powered-By: PHP/7.3.6
Expect-CT: max-age=604800, report-uri="https://report-uri.cloudflare.com/cdn-cgi/beacon/expect-ct"
Server: cloudflare
CF-RAY: 5f5db4fd1c234c56-AMS

It returns 200 OK , and `x-litespeed-cache-control: public` , so why header is empty in previous debugging process?

So what to do next ?

Naturally , we will now need to mimic the exact options the PHP curl used and see what's going on.

Add another debug log code to grab the curl options crawler used, add following code into litespeed-cache/lib/litespeed/litespeed-crawler.class.php , in line 627 , before return $options ;

		$options[ CURLOPT_COOKIE ] = implode( '; ', $cookies ) ;
		LiteSpeed_Cache_Log::debug( 'crawler logs headers2', json_encode( $options ) ) ;
		return $options ;

Now let's manually crawl it again to get the all the options.

07/11/19 14:20:15.374 [123.123.123.123:37386 1 ZWh] crawler logs headers2 --- '{
"19913":true,
"42":true,
"10036":"GET",
"52":false,
"10102":"gzip",
"78":10,
"13":10,
"81":0,
"64":false,
"44":false,
"10023":["Cache-Control: max-age=0","Host: example.com"],
"84":2,
"10018":"lscache_runner ",
"10016":"http:\/\/example.com\/wp-cron.php?doing_wp_cron=1234567890.12345678910111213141516","10022":"litespeed_hash=qwert"
}'

These numbers are PHP curlset reference code , after some googling , the 78 and 13 are particularly interesting , they represent curl connection timeout and curl timeout

Let's apply these otpions into our curl command.

[root@test ~]# curl -I -XGET --max-time 10 https://example.com/product-name-1
curl: (28) Operation timed out after 10001 milliseconds with 0 out of -1 bytes received

So this is the root cause , the page without cache takes more than 10 seconds to load.

Let's do more test to confirm it

[root@test ~]# curl -w 'Establish Connection: %{time_connect}s\nTTFB: %{time_starttransfer}s\nTotal: %{time_total}s\n' -XGET -A "lscache_runner https://example.com/product-name-1/
Establish Connection: 0.006s
TTFB: 16.455s
Total: 16.462s

So yes , the page without cache takes more than 16 seconds to load which hit curl timeout, that is the reason why debug log shows empty header.

We now need to figure out where the timeout is set by grep

[root@test litespeed-cache]# grep -riF "timeout" --include="*crawler*.php"
includes/litespeed-cache-crawler.class.php:             $response = wp_remote_get( $sitemap, array( 'timeout' => 15 ) ) ;
inc/crawler.class.php:          $response = wp_remote_get( $sitemap, array( 'timeout' => 15 ) ) ;
lib/litespeed/litespeed-crawler.class.php:                      CURLOPT_CONNECTTIMEOUT => 10,
lib/litespeed/litespeed-crawler.class.php:                      CURLOPT_TIMEOUT => 10,

So the last result , shows curl timeout is defined there, open file `litespeed-cache/lib/litespeed/litespeed-crawler.class.php` , around line 561-572

			CURLOPT_RETURNTRANSFER => true,
			CURLOPT_HEADER => true,
			CURLOPT_CUSTOMREQUEST => 'GET',
			CURLOPT_FOLLOWLOCATION => false,
			CURLOPT_ENCODING => 'gzip',
			CURLOPT_CONNECTTIMEOUT => 10,
			CURLOPT_TIMEOUT => 10,
			CURLOPT_SSL_VERIFYHOST => 0,
			CURLOPT_SSL_VERIFYPEER => false,
			CURLOPT_NOBODY => false,
			CURLOPT_HTTPHEADER => $this->_curl_headers,

Raise timeout from 10 to higher number , like 30 seconds.

Manual crawling again , all these previous blacklisted URIs are not longer being added into blacklist.

This is a temporarily solution by manually edit the code to raise timeout , default timeout will be changed to 30 seconds in LSCWP 2.9.8.4 , and will be made configurable in LSCWP 3.0 in case user may need a longer time.

  • Admin
  • Last modified: 2019/07/11 21:48
  • by qtwrk