Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
litespeed_wiki:cache:lscwp:troubleshooting:crawler_blacklist [2019/07/11 21:52]
qtwrk
litespeed_wiki:cache:lscwp:troubleshooting:crawler_blacklist [2019/07/15 14:36] (current)
Lisa Clarke Copyediting
Line 1: Line 1:
-By default , LSCWP built-in crawler will add an URI into blacklist if following conditions are met:+====== LiteSpeed Cache Crawler Blacklist ====== 
 +By default , LSCWP'​s ​built-in crawler will add URI to the blacklist if the following conditions are met:
  
-1. the page is not cache by design or default ​, in other word, any pages that sends response header ''​x-litespeed-cache-control:​ no-cache'' ​will be added into blacklist after initial crawling. +  - The page is not cacheable ​by design or by default. In other words, any pages that send the response header ''​x-litespeed-cache-control:​ no-cache''​ 
- +  - The page doesn'​t respond with the following headers: <​code>​HTTP/​1.1 200 OK
-2. If the page is not responding ​the following headers: +
- +
-<​code>​HTTP/​1.1 200 OK+
 HTTP/1.1 201 Created HTTP/1.1 201 Created
 HTTP/2 200 HTTP/2 200
 HTTP/2 201</​code>​ HTTP/2 201</​code>​
  
 +Knowing these conditions can help us to troubleshoot pages that are unexpectedly blacklisted.
  
-===== Problem===== +===== Problem ===== 
- +Particular ​pages are being added to the blacklist after the first crawling, ​but when you check manually ​(through the browser or through ​curl) you see the  ​''​x-litespeed-cache''​ header and ''​200 OK''​ status code. So, why are the URIs ending up in the blacklist?
-a user reports some pages are always ​being added into blacklist after first crawling, ​although ​manually ​use curl or Chrome browser , it always shows ''​x-litespeed-cache''​ header and ''​200 OK''​ status code, but there are always dozens of URIs being added into blacklist when doing crawl. +
- +
-===== Analyze: =====  +
- +
-So as mentioned above we know the condition why it is blacklist ​, so we just need to figure what happened to trigger crawler to add it into blacklist.+
  
-=====  Investigation===== +=====  Investigation =====  
 +Upon checking the debug log, we find that the response header was never logged. To find out why, we need to make a modification to the crawler class.
  
-Upon the checking debug log , but apparently it didn't log the response header, so we will need a little modification.+Open the following file: ''​litespeed-cache/​lib/​litespeed/​litespeed-crawler.class.php''​
  
-So we add a line to log more by inserting ​following code into file ''​litespeed-cache/​lib/​litespeed/​litespeed-crawler.class.php'' ​at line 273+Add the following code at line 273, to allow us to log more information:​
  
 <​code>​ LiteSpeed_Cache_Log::​debug( '​crawler logs headers',​ $headers ) ; </​code>​ <​code>​ LiteSpeed_Cache_Log::​debug( '​crawler logs headers',​ $headers ) ; </​code>​
  
-This way we will get the ''​$headers'' ​when crawler deals it.+Now, when the crawler processes a URI, the ''​$headers'' ​will be written to the debug log.
  
-Now after a manual crawling ​let'​s ​check the debug.log by ''​grep headers /​path/​to/​wordpress/​wp-content/​debug.log''​+Run the crawler manuallyand check ''​grep headers /​path/​to/​wordpress/​wp-content/​debug.log''​. You should see something like this:
  
 {{:​litespeed_wiki:​cache:​lscwp:​lscwp-crawler-debug1.jpg|}} {{:​litespeed_wiki:​cache:​lscwp:​lscwp-crawler-debug1.jpg|}}
  
-So here is the problem ​most of logs shows header is ''​HTTP/​1.1 200 OK''​ but some headers ​are empty, that's the reason why it is being added into blacklist.+So here is the problemmost of the logs show the header is ''​HTTP/​1.1 200 OK''​ but a few of them are empty. It's the empty ones that are being added to the blacklist.
  
-But why if manually run a curl , it just works as normal ​?+But whyif you manually run a curl, it looks fine?
  
 <​code>​[root@test ~]# curl -I -XGET https://​example.com/​product-name-1 <​code>​[root@test ~]# curl -I -XGET https://​example.com/​product-name-1
Line 58: Line 53:
 CF-RAY: 5f5db4fd1c234c56-AMS</​code>​ CF-RAY: 5f5db4fd1c234c56-AMS</​code>​
  
-It returns 200 OK , and `x-litespeed-cache-control:​ public, so why header ​is empty in previous debugging process?+This URI returns ​''​200''​, and ''​x-litespeed-cache-control:​ public''​, so why is the header ​empty in the previous debugging process?
  
-So what to do next ?+To figure it out, we can mimic the exact options the PHP curl used, and see what's going on.
  
-Naturally , we will now need to mimic the exact options the PHP curl used and see what's going on. +To add another debug log code to grab the curl options ​the crawler used, add following code into ''​litespeed-cache/​lib/​litespeed/​litespeed-crawler.class.php'' ​at line 627, directly ​before ''​return $options ;''​, like so:
- +
-Add another debug log code to grab the curl options crawler used, add following code into ''​litespeed-cache/​lib/​litespeed/​litespeed-crawler.class.php'' ​, in line 627 , before ''​return $options ;''​+
  
 <​code>​ $options[ CURLOPT_COOKIE ] = implode( '; ', $cookies ) ; <​code>​ $options[ CURLOPT_COOKIE ] = implode( '; ', $cookies ) ;
Line 70: Line 63:
  return $options ;</​code>​  return $options ;</​code>​
  
-Now let'​s ​manually crawl it again to get the all the options.+Nowmanually crawl it again to get the all the options.
  
 <​code>​07/​11/​19 14:​20:​15.374 [123.123.123.123:​37386 1 ZWh] crawler logs headers2 --- '{ <​code>​07/​11/​19 14:​20:​15.374 [123.123.123.123:​37386 1 ZWh] crawler logs headers2 --- '{
Line 89: Line 82:
 }'</​code>​ }'</​code>​
  
-These numbers are PHP curlset reference code , after some googling , the ''​78''​ and ''​13''​ are particularly interesting ​, they represent ''​curl connection timeout''​ and ''​curl timeout''​+The numbers ​you see are PHP curlset reference code. An internet search reveals that the ''​78''​ and ''​13''​ are particularly interesting. They represent ''​curl connection timeout''​ and ''​curl timeout''​.
  
-Let's apply these otpions into our curl command.+Let's apply these options to our curl command.
  
 <​code>​[root@test ~]# curl -I -XGET --max-time 10 https://​example.com/​product-name-1 <​code>​[root@test ~]# curl -I -XGET --max-time 10 https://​example.com/​product-name-1
 curl: (28) Operation timed out after 10001 milliseconds with 0 out of -1 bytes received</​code>​ curl: (28) Operation timed out after 10001 milliseconds with 0 out of -1 bytes received</​code>​
  
-So this is the root cause , the page without cache takes more than 10 seconds to load.+So this confirms a timeout ​is the root cause of the problem. Without cache, the page takes more than ten seconds to load.
  
-Let's do more test to confirm it +Let's do one more test to confirm it:
  
 <​code>​[root@test ~]# curl -w '​Establish Connection: %{time_connect}s\nTTFB:​ %{time_starttransfer}s\nTotal:​ %{time_total}s\n'​ -XGET -A "​lscache_runner https://​example.com/​product-name-1/​ <​code>​[root@test ~]# curl -w '​Establish Connection: %{time_connect}s\nTTFB:​ %{time_starttransfer}s\nTotal:​ %{time_total}s\n'​ -XGET -A "​lscache_runner https://​example.com/​product-name-1/​
Line 105: Line 98:
 Total: 16.462s</​code>​ Total: 16.462s</​code>​
  
-So yes , the page without cache takes more than 16 seconds to load which hit curl timeout, that is the reason why debug log shows empty header.+So yes. The page without cache takes more than 16 seconds to loadwhich results in a curl timeout. That is the reason why the debug log shows an empty header, the ''​200''​ status is never received by the crawler, and the URL is blacklisted.
  
-=====  Solution===== +=====  Solution ===== 
  
-We now need to figure out where the timeout is set by ''​grep''​+We need to determine ​where the timeout is set, and increase it. Use ''​grep''​:
  
 <​code>​[root@test litespeed-cache]#​ grep -riF "​timeout"​ --include="​*crawler*.php"​ <​code>​[root@test litespeed-cache]#​ grep -riF "​timeout"​ --include="​*crawler*.php"​
Line 117: Line 110:
 lib/​litespeed/​litespeed-crawler.class.php: ​                     CURLOPT_TIMEOUT => 10,</​code>​ lib/​litespeed/​litespeed-crawler.class.php: ​                     CURLOPT_TIMEOUT => 10,</​code>​
  
-So the last result , shows ''​curl timeout''​ is defined there, open file `litespeed-cache/​lib/​litespeed/​litespeed-crawler.class.php` , around ​line 561-572 ​+The last result, shows ''​curl timeout''​ is defined there. Open ''​litespeed-cache/​lib/​litespeed/​litespeed-crawler.class.php''​ and somewhere ​around ​lines 561-572, raise the timeout from ''​10''​ to something higher, like ''​30''​. ​
  
 <​code>​ CURLOPT_RETURNTRANSFER => true, <​code>​ CURLOPT_RETURNTRANSFER => true,
Line 131: Line 124:
  CURLOPT_HTTPHEADER => $this->​_curl_headers,</​code>​  CURLOPT_HTTPHEADER => $this->​_curl_headers,</​code>​
  
-Raise timeout from 10 to higher number , like 30 seconds. +Crawl manually ​again, ​and you will see that all of the previously ​blacklisted URIs are no longer being added to the blacklist.
- +
-Manual crawling ​again , all these previous ​blacklisted URIs are not longer being added into blacklist. +
- +
-This is a temporarily solution by manually edit the code to raise timeout , default timeout will be changed to 30 seconds in LSCWP [[https://​github.com/​litespeedtech/​lscache_wp/​commit/​64e7f2af39e57ed3481cae934270cf24f4695ba8#​commitcomment-34272438|2.9.8.4]] , and will be made configurable in LSCWP 3.0 in case user may need a longer time. +
  
 +**NOTE**: This is a //​temporarily solution//, manually editing the code to raise the timeout. LiteSpeed Cache'​s crawler timeout default will be changed to ''​30''​ seconds in LSCWP [[https://​github.com/​litespeedtech/​lscache_wp/​commit/​64e7f2af39e57ed3481cae934270cf24f4695ba8#​commitcomment-34272438|2.9.8.4]] , and will be made configurable in LSCWP 3.0 in case you need an even longer time.
  
  
  • Admin
  • Last modified: 2019/07/11 21:52
  • by qtwrk