====== LiteSpeed Cache Crawler Blacklist ======
By default , LSCWP's built-in crawler will add a URI to the blacklist if the following conditions are met:

  - The page is not cacheable by design or by default. In other words, any pages that send the response header ''x-litespeed-cache-control: no-cache''
  - The page doesn't respond with the following headers: <code>HTTP/1.1 200 OK
HTTP/1.1 201 Created
HTTP/2 200
HTTP/2 201</code>

Knowing these conditions can help us to troubleshoot pages that are unexpectedly blacklisted.

===== Problem =====
Particular pages are being added to the blacklist after the first crawling, but when you check manually (through the browser or through curl) you see the  ''x-litespeed-cache'' header and ''200 OK'' status code. So, why are the URIs ending up in the blacklist?

=====  Investigation ===== 
Upon checking the debug log, we find that the response header was never logged. To find out why, we need to make a modification to the crawler class.

Open the following file: ''litespeed-cache/lib/litespeed/litespeed-crawler.class.php''

Add the following code at line 273, to allow us to log more information:

<code> LiteSpeed_Cache_Log::debug( 'crawler logs headers', $headers ) ; </code>

Now, when the crawler processes a URI, the ''$headers'' will be written to the debug log.

Run the crawler manually, and check ''grep headers /path/to/wordpress/wp-content/debug.log''. You should see something like this:

{{:litespeed_wiki:cache:lscwp:lscwp-crawler-debug1.jpg|}}

So here is the problem: most of the logs show the header is ''HTTP/1.1 200 OK'' but a few of them are empty. It's the empty ones that are being added to the blacklist.

But why, if you manually run a curl, it looks fine?

<code>[root@test ~]# curl -I -XGET https://example.com/product-name-1
HTTP/1.1 200 OK
Date: Thu, 11 Jul 2019 20:57:54 GMT
Content-Type: text/html; charset=UTF-8
Transfer-Encoding: chunked
Connection: keep-alive
Set-Cookie: __cfduid=some-string-here; expires=Fri, 10-Jul-20 20:57:43 GMT; path=/; domain=.example.com; HttpOnly
Cf-Railgun: direct (starting new WAN connection)
Link: <https://example.com/wp-json/>; rel="https://api.w.org/"
Link: </min/186a9.css>; rel=preload; as=style,</min/f7e97.css>; rel=preload; as=style,</wp-content/plugins/plugin/jquery.min.js>; rel=preload; as=script,</min/7f44e.js>; rel=preload; as=script,</min/a8512.js>; rel=preload; as=script,</wp-content/plugins/litespeed-cache/js/webfontloader.min.js>; rel=preload; as=script
Set-Cookie: wp_woocommerce_session_string=value; expires=Sat, 13-Jul-2019 20:57:43 GMT; Max-Age=172799; path=/; secure; HttpOnly
Set-Cookie: wp_woocommerce_session_string=value; expires=Sat, 13-Jul-2019 20:57:43 GMT; Max-Age=172799; path=/; secure; HttpOnly
Vary: Accept-Encoding
X-Litespeed-Cache: miss
X-Litespeed-Cache-Control: public,max-age=604800
X-Litespeed-Tag: 98f_WC_T.156,98f_WC_T.494,98f_WC_T.48,98f_product_cat,98f_URL.e3a528ab8c54fd1cf6bf060091288580,98f_T.156,98f_
X-Powered-By: PHP/7.3.6
Expect-CT: max-age=604800, report-uri="https://report-uri.cloudflare.com/cdn-cgi/beacon/expect-ct"
Server: cloudflare
CF-RAY: 5f5db4fd1c234c56-AMS</code>

This URI returns ''200'', and ''x-litespeed-cache-control: public'', so why is the header empty in the previous debugging process?

To figure it out, we can mimic the exact options the PHP curl used, and see what's going on.

To add another debug log code to grab the curl options the crawler used, add following code into ''litespeed-cache/lib/litespeed/litespeed-crawler.class.php'' at line 627, directly before ''return $options ;'', like so:

<code>		$options[ CURLOPT_COOKIE ] = implode( '; ', $cookies ) ;
		LiteSpeed_Cache_Log::debug( 'crawler logs headers2', json_encode( $options ) ) ;
		return $options ;</code>

Now, manually crawl it again to get the all the options.

<code>07/11/19 14:20:15.374 [123.123.123.123:37386 1 ZWh] crawler logs headers2 --- '{
"19913":true,
"42":true,
"10036":"GET",
"52":false,
"10102":"gzip",
"78":10,
"13":10,
"81":0,
"64":false,
"44":false,
"10023":["Cache-Control: max-age=0","Host: example.com"],
"84":2,
"10018":"lscache_runner ",
"10016":"http:\/\/example.com\/wp-cron.php?doing_wp_cron=1234567890.12345678910111213141516","10022":"litespeed_hash=qwert"
}'</code>

The numbers you see are PHP curlset reference code. An internet search reveals that the ''78'' and ''13'' are particularly interesting. They represent ''curl connection timeout'' and ''curl timeout''.

Let's apply these options to our curl command.

<code>[root@test ~]# curl -I -XGET --max-time 10 https://example.com/product-name-1
curl: (28) Operation timed out after 10001 milliseconds with 0 out of -1 bytes received</code>

So this confirms a timeout is the root cause of the problem. Without cache, the page takes more than ten seconds to load.

Let's do one more test to confirm it:

<code>[root@test ~]# curl -w 'Establish Connection: %{time_connect}s\nTTFB: %{time_starttransfer}s\nTotal: %{time_total}s\n' -XGET -A "lscache_runner https://example.com/product-name-1/
Establish Connection: 0.006s
TTFB: 16.455s
Total: 16.462s</code>

So yes. The page without cache takes more than 16 seconds to load, which results in a curl timeout. That is the reason why the debug log shows an empty header, the ''200'' status is never received by the crawler, and the URL is blacklisted.

=====  Solution ===== 

We need to determine where the timeout is set, and increase it. Use ''grep'':

<code>[root@test litespeed-cache]# grep -riF "timeout" --include="*crawler*.php"
includes/litespeed-cache-crawler.class.php:             $response = wp_remote_get( $sitemap, array( 'timeout' => 15 ) ) ;
inc/crawler.class.php:          $response = wp_remote_get( $sitemap, array( 'timeout' => 15 ) ) ;
lib/litespeed/litespeed-crawler.class.php:                      CURLOPT_CONNECTTIMEOUT => 10,
lib/litespeed/litespeed-crawler.class.php:                      CURLOPT_TIMEOUT => 10,</code>

The last result, shows ''curl timeout'' is defined there. Open ''litespeed-cache/lib/litespeed/litespeed-crawler.class.php'' and somewhere around lines 561-572, raise the timeout from ''10'' to something higher, like ''30''. 

<code>			CURLOPT_RETURNTRANSFER => true,
			CURLOPT_HEADER => true,
			CURLOPT_CUSTOMREQUEST => 'GET',
			CURLOPT_FOLLOWLOCATION => false,
			CURLOPT_ENCODING => 'gzip',
			CURLOPT_CONNECTTIMEOUT => 10,
			CURLOPT_TIMEOUT => 10,
			CURLOPT_SSL_VERIFYHOST => 0,
			CURLOPT_SSL_VERIFYPEER => false,
			CURLOPT_NOBODY => false,
			CURLOPT_HTTPHEADER => $this->_curl_headers,</code>

Crawl manually again, and you will see that all of the previously blacklisted URIs are no longer being added to the blacklist.

**NOTE**: This is a //temporarily solution//, manually editing the code to raise the timeout. LiteSpeed Cache's crawler timeout default will be changed to ''30'' seconds in LSCWP [[https://github.com/litespeedtech/lscache_wp/commit/64e7f2af39e57ed3481cae934270cf24f4695ba8#commitcomment-34272438|2.9.8.4]] , and will be made configurable in LSCWP 3.0 in case you need an even longer time.