LiteSpeed Cache Crawler Blacklist

By default , LSCWP's built-in crawler will add a URI to the blacklist if the following conditions are met:

The page is not cacheable by design or by default. In other words, any pages that send the response header x-litespeed-cache-control: no-cache

The page doesn't respond with the following headers:

HTTP/1.1 200 OK
HTTP/1.1 201 Created
HTTP/2 200
HTTP/2 201

Knowing these conditions can help us to troubleshoot pages that are unexpectedly blacklisted.

Particular pages are being added to the blacklist after the first crawling, but when you check manually (through the browser or through curl) you see the x-litespeed-cache header and 200 OK status code. So, why are the URIs ending up in the blacklist?

Upon checking the debug log, we find that the response header was never logged. To find out why, we need to make a modification to the crawler class.

Open the following file: litespeed-cache/lib/litespeed/litespeed-crawler.class.php

Add the following code at line 273, to allow us to log more information:

 LiteSpeed_Cache_Log::debug( 'crawler logs headers', $headers ) ;

Now, when the crawler processes a URI, the $headers will be written to the debug log.

Run the crawler manually, and check grep headers /path/to/wordpress/wp-content/debug.log. You should see something like this:

So here is the problem: most of the logs show the header is HTTP/1.1 200 OK but a few of them are empty. It's the empty ones that are being added to the blacklist.

But why, if you manually run a curl, it looks fine?

[root@test ~]# curl -I -XGET https://example.com/product-name-1
HTTP/1.1 200 OK
Date: Thu, 11 Jul 2019 20:57:54 GMT
Content-Type: text/html; charset=UTF-8
Transfer-Encoding: chunked
Connection: keep-alive
Set-Cookie: __cfduid=some-string-here; expires=Fri, 10-Jul-20 20:57:43 GMT; path=/; domain=.example.com; HttpOnly
Cf-Railgun: direct (starting new WAN connection)
Link: <https://example.com/wp-json/>; rel="https://api.w.org/"
Link: </min/186a9.css>; rel=preload; as=style,</min/f7e97.css>; rel=preload; as=style,</wp-content/plugins/plugin/jquery.min.js>; rel=preload; as=script,</min/7f44e.js>; rel=preload; as=script,</min/a8512.js>; rel=preload; as=script,</wp-content/plugins/litespeed-cache/js/webfontloader.min.js>; rel=preload; as=script
Set-Cookie: wp_woocommerce_session_string=value; expires=Sat, 13-Jul-2019 20:57:43 GMT; Max-Age=172799; path=/; secure; HttpOnly
Set-Cookie: wp_woocommerce_session_string=value; expires=Sat, 13-Jul-2019 20:57:43 GMT; Max-Age=172799; path=/; secure; HttpOnly
Vary: Accept-Encoding
X-Litespeed-Cache: miss
X-Litespeed-Cache-Control: public,max-age=604800
X-Litespeed-Tag: 98f_WC_T.156,98f_WC_T.494,98f_WC_T.48,98f_product_cat,98f_URL.e3a528ab8c54fd1cf6bf060091288580,98f_T.156,98f_
X-Powered-By: PHP/7.3.6
Expect-CT: max-age=604800, report-uri="https://report-uri.cloudflare.com/cdn-cgi/beacon/expect-ct"
Server: cloudflare
CF-RAY: 5f5db4fd1c234c56-AMS

This URI returns 200, and x-litespeed-cache-control: public, so why is the header empty in the previous debugging process?

To figure it out, we can mimic the exact options the PHP curl used, and see what's going on.

To add another debug log code to grab the curl options the crawler used, add following code into litespeed-cache/lib/litespeed/litespeed-crawler.class.php at line 627, directly before return $options ;, like so:

		$options[ CURLOPT_COOKIE ] = implode( '; ', $cookies ) ;
		LiteSpeed_Cache_Log::debug( 'crawler logs headers2', json_encode( $options ) ) ;
		return $options ;

Now, manually crawl it again to get the all the options.

07/11/19 14:20:15.374 [123.123.123.123:37386 1 ZWh] crawler logs headers2 --- '{
"19913":true,
"42":true,
"10036":"GET",
"52":false,
"10102":"gzip",
"78":10,
"13":10,
"81":0,
"64":false,
"44":false,
"10023":["Cache-Control: max-age=0","Host: example.com"],
"84":2,
"10018":"lscache_runner ",
"10016":"http:\/\/example.com\/wp-cron.php?doing_wp_cron=1234567890.12345678910111213141516","10022":"litespeed_hash=qwert"
}'

The numbers you see are PHP curlset reference code. An internet search reveals that the 78 and 13 are particularly interesting. They represent curl connection timeout and curl timeout.

Let's apply these options to our curl command.

[root@test ~]# curl -I -XGET --max-time 10 https://example.com/product-name-1
curl: (28) Operation timed out after 10001 milliseconds with 0 out of -1 bytes received

So this confirms a timeout is the root cause of the problem. Without cache, the page takes more than ten seconds to load.

Let's do one more test to confirm it:

[root@test ~]# curl -w 'Establish Connection: %{time_connect}s\nTTFB: %{time_starttransfer}s\nTotal: %{time_total}s\n' -XGET -A "lscache_runner https://example.com/product-name-1/
Establish Connection: 0.006s
TTFB: 16.455s
Total: 16.462s

So yes. The page without cache takes more than 16 seconds to load, which results in a curl timeout. That is the reason why the debug log shows an empty header, the 200 status is never received by the crawler, and the URL is blacklisted.

We need to determine where the timeout is set, and increase it. Use grep:

[root@test litespeed-cache]# grep -riF "timeout" --include="*crawler*.php"
includes/litespeed-cache-crawler.class.php:             $response = wp_remote_get( $sitemap, array( 'timeout' => 15 ) ) ;
inc/crawler.class.php:          $response = wp_remote_get( $sitemap, array( 'timeout' => 15 ) ) ;
lib/litespeed/litespeed-crawler.class.php:                      CURLOPT_CONNECTTIMEOUT => 10,
lib/litespeed/litespeed-crawler.class.php:                      CURLOPT_TIMEOUT => 10,

The last result, shows curl timeout is defined there. Open litespeed-cache/lib/litespeed/litespeed-crawler.class.php and somewhere around lines 561-572, raise the timeout from 10 to something higher, like 30.

			CURLOPT_RETURNTRANSFER => true,
			CURLOPT_HEADER => true,
			CURLOPT_CUSTOMREQUEST => 'GET',
			CURLOPT_FOLLOWLOCATION => false,
			CURLOPT_ENCODING => 'gzip',
			CURLOPT_CONNECTTIMEOUT => 10,
			CURLOPT_TIMEOUT => 10,
			CURLOPT_SSL_VERIFYHOST => 0,
			CURLOPT_SSL_VERIFYPEER => false,
			CURLOPT_NOBODY => false,
			CURLOPT_HTTPHEADER => $this->_curl_headers,

Crawl manually again, and you will see that all of the previously blacklisted URIs are no longer being added to the blacklist.

NOTE: This is a temporarily solution, manually editing the code to raise the timeout. LiteSpeed Cache's crawler timeout default will be changed to 30 seconds in LSCWP 2.9.8.4 , and will be made configurable in LSCWP 3.0 in case you need an even longer time.

LiteSpeed Cache Crawler Blacklist

Problem

Investigation

Solution