Differences

This shows you the differences between two versions of the page.

--- litespeed_wiki:cache:lscwp:troubleshooting:crawler_blacklist [2019/07/11 19:04]
qtwrk created
+++ litespeed_wiki:cache:lscwp:troubleshooting:crawler_blacklist [2019/07/15 14:36] (current)
Lisa Clarke Copyediting
@@ Line 1: / Line 1: @@
-By default , LSCWP built-in crawler will add an URI into blacklist if following conditions are met:
+====== LiteSpeed Cache Crawler Blacklist ======
+By default , LSCWP's built-in crawler will add a URI to the blacklist if the following conditions are met:
-. the page is not cache by design or default , in other word, any pages that sends response header `x-litespeed-cache-control: no-cache` will be added into blacklist after initial crawling.
+  - The page is not cacheable by design or by default. In other words, any pages that send the response header ''x-litespeed-cache-control: no-cache''
+  - The page doesn't respond with the following headers: <code>HTTP/1.1 200 OK
-. If the page is not responding the following headers:
-<code>HTTP/1.1 200 OK
 HTTP/1.1 201 Created
 HTTP/2 200
 HTTP/2 201</code>
-One real debug case:
+Knowing these conditions can help us to troubleshoot pages that are unexpectedly blacklisted.
+===== Problem =====
+Particular pages are being added to the blacklist after the first crawling, but when you check manually (through the browser or through curl) you see the  ''x-litespeed-cache'' header and ''200 OK'' status code. So, why are the URIs ending up in the blacklist?
+=====  Investigation =====
+Upon checking the debug log, we find that the response header was never logged. To find out why, we need to make a modification to the crawler class.
+Open the following file: ''litespeed-cache/lib/litespeed/litespeed-crawler.class.php''
+Add the following code at line 273, to allow us to log more information:
+<code> LiteSpeed_Cache_Log::debug( 'crawler logs headers', $headers ) ; </code>
+Now, when the crawler processes a URI, the ''$headers'' will be written to the debug log.
+Run the crawler manually, and check ''grep headers /path/to/wordpress/wp-content/debug.log''. You should see something like this:
+{{:litespeed_wiki:cache:lscwp:lscwp-crawler-debug1.jpg|}}
+So here is the problem: most of the logs show the header is ''HTTP/1.1 200 OK'' but a few of them are empty. It's the empty ones that are being added to the blacklist.
+But why, if you manually run a curl, it looks fine?
+<code>[root@test ~]# curl -I -XGET https://example.com/product-name-1
+HTTP/1.1 200 OK
+Date: Thu, 11 Jul 2019 20:57:54 GMT
+Content-Type: text/html; charset=UTF-8
+Transfer-Encoding: chunked
+Connection: keep-alive
+Set-Cookie: __cfduid=some-string-here; expires=Fri, 10-Jul-20 20:57:43 GMT; path=/; domain=.example.com; HttpOnly
+Cf-Railgun: direct (starting new WAN connection)
+Link: <https://example.com/wp-json/>; rel="https://api.w.org/"
+Link: </min/186a9.css>; rel=preload; as=style,</min/f7e97.css>; rel=preload; as=style,</wp-content/plugins/plugin/jquery.min.js>; rel=preload; as=script,</min/7f44e.js>; rel=preload; as=script,</min/a8512.js>; rel=preload; as=script,</wp-content/plugins/litespeed-cache/js/webfontloader.min.js>; rel=preload; as=script
+Set-Cookie: wp_woocommerce_session_string=value; expires=Sat, 13-Jul-2019 20:57:43 GMT; Max-Age=172799; path=/; secure; HttpOnly
+Set-Cookie: wp_woocommerce_session_string=value; expires=Sat, 13-Jul-2019 20:57:43 GMT; Max-Age=172799; path=/; secure; HttpOnly
+Vary: Accept-Encoding
+X-Litespeed-Cache: miss
+X-Litespeed-Cache-Control: public,max-age=604800
+X-Litespeed-Tag: 98f_WC_T.156,98f_WC_T.494,98f_WC_T.48,98f_product_cat,98f_URL.e3a528ab8c54fd1cf6bf060091288580,98f_T.156,98f_
+X-Powered-By: PHP/7.3.6
+Expect-CT: max-age=604800, report-uri="https://report-uri.cloudflare.com/cdn-cgi/beacon/expect-ct"
+Server: cloudflare
+CF-RAY: 5f5db4fd1c234c56-AMS</code>
+This URI returns ''200'', and ''x-litespeed-cache-control: public'', so why is the header empty in the previous debugging process?
+To figure it out, we can mimic the exact options the PHP curl used, and see what's going on.
+To add another debug log code to grab the curl options the crawler used, add following code into ''litespeed-cache/lib/litespeed/litespeed-crawler.class.php'' at line 627, directly before ''return $options ;'', like so:
+<code>		$options[ CURLOPT_COOKIE ] = implode( '; ', $cookies ) ;
+		LiteSpeed_Cache_Log::debug( 'crawler logs headers2', json_encode( $options ) ) ;
+		return $options ;</code>
+Now, manually crawl it again to get the all the options.
+<code>07/11/19 14:20:15.374 [123.123.123.123:37386 1 ZWh] crawler logs headers2 --- '{
+"19913":true,
+"42":true,
+"10036":"GET",
+"52":false,
+"10102":"gzip",
+"78":10,
+"13":10,
+"81":0,
+"64":false,
+"44":false,
+"10023":["Cache-Control: max-age=0","Host: example.com"],
+"84":2,
+"10018":"lscache_runner ",
+"10016":"http:\/\/example.com\/wp-cron.php?doing_wp_cron=1234567890.12345678910111213141516","10022":"litespeed_hash=qwert"
+}'</code>
+The numbers you see are PHP curlset reference code. An internet search reveals that the ''78'' and ''13'' are particularly interesting. They represent ''curl connection timeout'' and ''curl timeout''.
+Let's apply these options to our curl command.
-Problem:
+<code>[root@test ~]# curl -I -XGET --max-time 10 https://example.com/product-name-1
+curl: (28) Operation timed out after 10001 milliseconds with 0 out of -1 bytes received</code>
-a user reports some pages are always being added into blacklist after first crawling, but manually use curl or Chrome browser , it always shows x-litespeed-cache header and 200 OK status code, but there are always dozens of URIs being added into blacklist when doing crawl.
+So this confirms a timeout is the root cause of the problem. Without cache, the page takes more than ten seconds to load.
-Analyze:
+Let's do one more test to confirm it:
-So as mentioned above , we know the condition why it is blacklist , so we just need to figure what happened to trigger crawler to add it into blacklist.
+<code>[root@test ~]# curl -w 'Establish Connection: %{time_connect}s\nTTFB: %{time_starttransfer}s\nTotal: %{time_total}s\n' -XGET -A "lscache_runner https://example.com/product-name-1/
+Establish Connection: 0.006s
+TTFB: 16.455s
+Total: 16.462s</code>
-Investigation:
+So yes. The page without cache takes more than 16 seconds to load, which results in a curl timeout. That is the reason why the debug log shows an empty header, the ''200'' status is never received by the crawler, and the URL is blacklisted.
-Upon the checking debug log , but apparently it didn't log the response header, so we will need a little modification.
+=====  Solution =====
-So we add a line to log more by inserting following code into file `litespeed-cache/lib/litespeed/litespeed-crawler.class.php` at line 273
+We need to determine where the timeout is set, and increase it. Use ''grep'':
- LiteSpeed_Cache_Log::debug( 'crawler logs headers', $headers ) ;
+<code>[root@test litespeed-cache]# grep -riF "timeout" --include="*crawler*.php"
+includes/litespeed-cache-crawler.class.php:             $response = wp_remote_get( $sitemap, array( 'timeout' => 15 ) ) ;
+inc/crawler.class.php:          $response = wp_remote_get( $sitemap, array( 'timeout' => 15 ) ) ;
+lib/litespeed/litespeed-crawler.class.php:                      CURLOPT_CONNECTTIMEOUT => 10,
+lib/litespeed/litespeed-crawler.class.php:                      CURLOPT_TIMEOUT => 10,</code>
-This way , we will get the `$headers` when crawler deals it.
+The last result, shows ''curl timeout'' is defined there. Open ''litespeed-cache/lib/litespeed/litespeed-crawler.class.php'' and somewhere around lines 561-572, raise the timeout from ''10'' to something higher, like ''30''.
+<code>			CURLOPT_RETURNTRANSFER => true,
+			CURLOPT_HEADER => true,
+			CURLOPT_CUSTOMREQUEST => 'GET',
+			CURLOPT_FOLLOWLOCATION => false,
+			CURLOPT_ENCODING => 'gzip',
+			CURLOPT_CONNECTTIMEOUT => 10,
+			CURLOPT_TIMEOUT => 10,
+			CURLOPT_SSL_VERIFYHOST => 0,
+			CURLOPT_SSL_VERIFYPEER => false,
+			CURLOPT_NOBODY => false,
+			CURLOPT_HTTPHEADER => $this->_curl_headers,</code>
+Crawl manually again, and you will see that all of the previously blacklisted URIs are no longer being added to the blacklist.
+**NOTE**: This is a //temporarily solution//, manually editing the code to raise the timeout. LiteSpeed Cache's crawler timeout default will be changed to ''30'' seconds in LSCWP [[https://github.com/litespeedtech/lscache_wp/commit/64e7f2af39e57ed3481cae934270cf24f4695ba8#commitcomment-34272438|2.9.8.4]] , and will be made configurable in LSCWP 3.0 in case you need an even longer time.