Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Next revision
Previous revision
litespeed_wiki:cache:lscwp:troubleshooting:crawler_blacklist [2019/07/11 19:04]
qtwrk created
litespeed_wiki:cache:lscwp:troubleshooting:crawler_blacklist [2019/07/15 14:36]
Lisa Clarke Copyediting
Line 1: Line 1:
-By default , LSCWP built-in crawler will add an URI into blacklist if following conditions are met:+====== LiteSpeed Cache Crawler Blacklist ====== 
 +By default , LSCWP'​s ​built-in crawler will add URI to the blacklist if the following conditions are met:
  
-1. the page is not cache by design or default ​, in other word, any pages that sends response header ​`x-litespeed-cache-control:​ no-cache` will be added into blacklist after initial crawling. +  - The page is not cacheable ​by design or by default. In other words, any pages that send the response header ​''​x-litespeed-cache-control:​ no-cache''​ 
- +  - The page doesn'​t respond with the following headers: <​code>​HTTP/​1.1 200 OK
-2. If the page is not responding ​the following headers: +
- +
-<​code>​HTTP/​1.1 200 OK+
 HTTP/1.1 201 Created HTTP/1.1 201 Created
 HTTP/2 200 HTTP/2 200
 HTTP/2 201</​code>​ HTTP/2 201</​code>​
  
-One real debug case:+Knowing these conditions can help us to troubleshoot pages that are unexpectedly blacklisted. 
 + 
 +===== Problem ===== 
 +Particular pages are being added to the blacklist after the first crawling, but when you check manually (through the browser or through curl) you see the  ''​x-litespeed-cache''​ header and ''​200 OK''​ status code. So, why are the URIs ending up in the blacklist?​ 
 + 
 +=====  Investigation =====  
 +Upon checking the debug log, we find that the response header was never logged. To find out why, we need to make a modification to the crawler class. 
 + 
 +Open the following file''​litespeed-cache/​lib/​litespeed/​litespeed-crawler.class.php''​ 
 + 
 +Add the following code at line 273, to allow us to log more information:​ 
 + 
 +<​code>​ LiteSpeed_Cache_Log::​debug( '​crawler logs headers',​ $headers ) ; </​code>​ 
 + 
 +Now, when the crawler processes a URI, the ''​$headers''​ will be written to the debug log. 
 + 
 +Run the crawler manually, and check ''​grep headers /​path/​to/​wordpress/​wp-content/​debug.log''​. You should see something like this: 
 + 
 +{{:​litespeed_wiki:​cache:​lscwp:​lscwp-crawler-debug1.jpg|}} 
 + 
 +So here is the problem: most of the logs show the header is ''​HTTP/​1.1 200 OK''​ but a few of them are empty. It's the empty ones that are being added to the blacklist. 
 + 
 +But why, if you manually run a curl, it looks fine? 
 + 
 +<​code>​[root@test ~]# curl -I -XGET https://​example.com/​product-name-1 
 +HTTP/1.1 200 OK 
 +Date: Thu, 11 Jul 2019 20:57:54 GMT 
 +Content-Type:​ text/html; charset=UTF-8 
 +Transfer-Encoding:​ chunked 
 +Connection: keep-alive 
 +Set-Cookie: __cfduid=some-string-here;​ expires=Fri,​ 10-Jul-20 20:57:43 GMT; path=/; domain=.example.com;​ HttpOnly 
 +Cf-Railgun: direct (starting new WAN connection) 
 +Link: <​https://​example.com/​wp-json/>;​ rel="​https://​api.w.org/"​ 
 +Link: </​min/​186a9.css>;​ rel=preload;​ as=style,</​min/​f7e97.css>;​ rel=preload;​ as=style,</​wp-content/​plugins/​plugin/​jquery.min.js>;​ rel=preload;​ as=script,</​min/​7f44e.js>;​ rel=preload;​ as=script,</​min/​a8512.js>;​ rel=preload;​ as=script,</​wp-content/​plugins/​litespeed-cache/​js/​webfontloader.min.js>;​ rel=preload;​ as=script 
 +Set-Cookie: wp_woocommerce_session_string=value;​ expires=Sat,​ 13-Jul-2019 20:57:43 GMT; Max-Age=172799;​ path=/; secure; HttpOnly 
 +Set-Cookie: wp_woocommerce_session_string=value;​ expires=Sat,​ 13-Jul-2019 20:57:43 GMT; Max-Age=172799;​ path=/; secure; HttpOnly 
 +Vary: Accept-Encoding 
 +X-Litespeed-Cache:​ miss 
 +X-Litespeed-Cache-Control:​ public,​max-age=604800 
 +X-Litespeed-Tag:​ 98f_WC_T.156,​98f_WC_T.494,​98f_WC_T.48,​98f_product_cat,​98f_URL.e3a528ab8c54fd1cf6bf060091288580,​98f_T.156,​98f_ 
 +X-Powered-By:​ PHP/7.3.6 
 +Expect-CT: max-age=604800,​ report-uri="​https://​report-uri.cloudflare.com/​cdn-cgi/​beacon/​expect-ct"​ 
 +Server: cloudflare 
 +CF-RAY: 5f5db4fd1c234c56-AMS</​code>​ 
 + 
 +This URI returns ''​200'',​ and ''​x-litespeed-cache-control:​ public'',​ so why is the header empty in the previous debugging process? 
 + 
 +To figure it out, we can mimic the exact options the PHP curl used, and see what's going on. 
 + 
 +To add another debug log code to grab the curl options the crawler used, add following code into ''​litespeed-cache/​lib/​litespeed/​litespeed-crawler.class.php''​ at line 627, directly before ''​return $options ;'',​ like so: 
 + 
 +<​code>​ $options[ CURLOPT_COOKIE ] = implode( '; ', $cookies ) ; 
 + LiteSpeed_Cache_Log::​debug( '​crawler logs headers2',​ json_encode( $options ) ) ; 
 + return $options ;</​code>​ 
 + 
 +Now, manually crawl it again to get the all the options. 
 + 
 +<​code>​07/​11/​19 14:​20:​15.374 [123.123.123.123:​37386 1 ZWh] crawler logs headers2 --- '{ 
 +"​19913":​true,​ 
 +"​42":​true,​ 
 +"​10036":"​GET",​ 
 +"​52":​false,​ 
 +"​10102":"​gzip",​ 
 +"​78":​10,​ 
 +"​13":​10,​ 
 +"​81":​0,​ 
 +"​64":​false,​ 
 +"​44":​false,​ 
 +"​10023":​["​Cache-Control:​ max-age=0","​Host:​ example.com"​],​ 
 +"​84":​2,​ 
 +"​10018":"​lscache_runner ", 
 +"​10016":"​http:​\/​\/​example.com\/​wp-cron.php?​doing_wp_cron=1234567890.12345678910111213141516","​10022":"​litespeed_hash=qwert"​ 
 +}'</​code>​ 
 + 
 +The numbers you see are PHP curlset reference code. An internet search reveals that the ''​78''​ and ''​13''​ are particularly interesting. They represent ''​curl connection timeout''​ and ''​curl timeout''​. 
 + 
 +Let's apply these options to our curl command.
  
-Problem:+<​code>​[root@test ~]# curl -I -XGET --max-time 10 https://​example.com/​product-name-1 
 +curl: (28) Operation timed out after 10001 milliseconds with 0 out of -1 bytes received</​code>​
  
-user reports some pages are always being added into blacklist after first crawling, but manually use curl or Chrome browser , it always shows x-litespeed-cache header and 200 OK status codebut there are always dozens of URIs being added into blacklist when doing crawl.+So this confirms ​timeout is the root cause of the problem. Without ​cache, ​the page takes more than ten seconds to load.
  
-Analyze:+Let's do one more test to confirm it:
  
-So as mentioned above , we know the condition why it is blacklist , so we just need to figure what happened to trigger crawler to add it into blacklist.+<​code>​[root@test ~]# curl -w '​Establish Connection: %{time_connect}s\nTTFB:​ %{time_starttransfer}s\nTotal:​ %{time_total}s\n'​ -XGET -A "​lscache_runner https://​example.com/​product-name-1/​ 
 +Establish Connection: 0.006s 
 +TTFB: 16.455s 
 +Total: 16.462s</​code>​
  
-Investigation:​+So yes. The page without cache takes more than 16 seconds to load, which results in a curl timeout. That is the reason why the debug log shows an empty header, the ''​200''​ status is never received by the crawler, and the URL is blacklisted.
  
-Upon the checking debug log , but apparently it didn't log the response header, so we will need a little modification.+=====  Solution ===== 
  
-So we add a line to log more by inserting following code into file `litespeed-cache/​lib/​litespeed/​litespeed-crawler.class.php` at line 273+We need to determine where the timeout is set, and increase itUse ''​grep'':​
  
- ​LiteSpeed_Cache_Log:​:debug( 'crawler logs headers'$headers ​) ;+<​code>​[root@test litespeed-cache]#​ grep -riF "​timeout"​ --include="​*crawler*.php"​ 
 +includes/​litespeed-cache-crawler.class.php            $response = wp_remote_get( $sitemap, array( 'timeout' ​=> 15 ) ) ; 
 +inc/​crawler.class.php: ​         ​$response = wp_remote_get( $sitemap, array( '​timeout'​ => 15 ) ) ; 
 +lib/​litespeed/​litespeed-crawler.class.php: ​                     CURLOPT_CONNECTTIMEOUT => 10, 
 +lib/​litespeed/​litespeed-crawler.class.php: ​                     CURLOPT_TIMEOUT => 10,</​code>​
  
-This way we will get the `$headers` when crawler ​deals it.+The last resultshows ''​curl timeout''​ is defined there. Open ''​litespeed-cache/​lib/​litespeed/​litespeed-crawler.class.php''​ and somewhere around lines 561-572, raise the timeout from ''​10''​ to something higher, like ''​30''​
  
 +<​code>​ CURLOPT_RETURNTRANSFER => true,
 + CURLOPT_HEADER => true,
 + CURLOPT_CUSTOMREQUEST => '​GET',​
 + CURLOPT_FOLLOWLOCATION => false,
 + CURLOPT_ENCODING => '​gzip',​
 + CURLOPT_CONNECTTIMEOUT => 10,
 + CURLOPT_TIMEOUT => 10,
 + CURLOPT_SSL_VERIFYHOST => 0,
 + CURLOPT_SSL_VERIFYPEER => false,
 + CURLOPT_NOBODY => false,
 + CURLOPT_HTTPHEADER => $this->​_curl_headers,</​code>​
  
 +Crawl manually again, and you will see that all of the previously blacklisted URIs are no longer being added to the blacklist.
  
 +**NOTE**: This is a //​temporarily solution//, manually editing the code to raise the timeout. LiteSpeed Cache'​s crawler timeout default will be changed to ''​30''​ seconds in LSCWP [[https://​github.com/​litespeedtech/​lscache_wp/​commit/​64e7f2af39e57ed3481cae934270cf24f4695ba8#​commitcomment-34272438|2.9.8.4]] , and will be made configurable in LSCWP 3.0 in case you need an even longer time.
  
  
  • Admin
  • Last modified: 2019/07/15 14:36
  • by Lisa Clarke