This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
litespeed_wiki:cache:lscwp:troubleshooting:crawler_blacklist [2019/07/11 21:48]
litespeed_wiki:cache:lscwp:troubleshooting:crawler_blacklist [2019/07/15 14:36]
Lisa Clarke Copyediting
Line 1: Line 1:
-By default , LSCWP built-in crawler will add an URI into blacklist if following conditions are met:+====== LiteSpeed Cache Crawler Blacklist ====== 
 +By default , LSCWP'​s ​built-in crawler will add URI to the blacklist if the following conditions are met:
-1. the page is not cache by design or default ​, in other word, any pages that sends response header ''​x-litespeed-cache-control:​ no-cache'' ​will be added into blacklist after initial crawling. +  - The page is not cacheable ​by design or by default. In other words, any pages that send the response header ''​x-litespeed-cache-control:​ no-cache''​ 
- +  - The page doesn'​t respond with the following headers: <​code>​HTTP/​1.1 200 OK
-2. If the page is not responding ​the following headers: +
- +
-<​code>​HTTP/​1.1 200 OK+
 HTTP/1.1 201 Created HTTP/1.1 201 Created
 HTTP/2 200 HTTP/2 200
 HTTP/2 201</​code>​ HTTP/2 201</​code>​
-====== One Real Debug Case: ====== +Knowing these conditions can help us to troubleshoot ​pages that are unexpectedly blacklisted.
- +
-===== Problem: ===== +
- +
-a user reports some pages are always being added into blacklist after first crawling, although manually use curl or Chrome browser , it always shows ''​x-litespeed-cache''​ header and ''​200 OK''​ status code, but there are always dozens of URIs being added into blacklist when doing crawl. +
- +
-===== Analyze: ===== +
-So as mentioned above , we know the condition why it is blacklist , so we just need to figure what happened to trigger crawler to add it into blacklist.+===== Problem ===== 
 +Particular pages are being added to the blacklist ​after the first crawlingbut when you check manually (through the browser or through curl) you see the  ''​x-litespeed-cache''​ header and ''​200 OK''​ status codeSo, why are the URIs ending up in the blacklist?
-=====  Investigation===== +=====  Investigation =====  
 +Upon checking the debug log, we find that the response header was never logged. To find out why, we need to make a modification to the crawler class.
-Upon the checking debug log , but apparently it didn't log the response header, so we will need a little modification.+Open the following file: ''​litespeed-cache/​lib/​litespeed/​litespeed-crawler.class.php''​
-So we add a line to log more by inserting ​following code into file ''​litespeed-cache/​lib/​litespeed/​litespeed-crawler.class.php'' ​at line 273+Add the following code at line 273, to allow us to log more information:​
 <​code>​ LiteSpeed_Cache_Log::​debug( '​crawler logs headers',​ $headers ) ; </​code>​ <​code>​ LiteSpeed_Cache_Log::​debug( '​crawler logs headers',​ $headers ) ; </​code>​
-This way we will get the ''​$headers'' ​when crawler deals it.+Now, when the crawler processes a URI, the ''​$headers'' ​will be written to the debug log.
-Now after a manual crawling ​let'​s ​check the debug.log by ''​grep headers /​path/​to/​wordpress/​wp-content/​debug.log''​+Run the crawler manuallyand check ''​grep headers /​path/​to/​wordpress/​wp-content/​debug.log''​. You should see something like this:
 {{:​litespeed_wiki:​cache:​lscwp:​lscwp-crawler-debug1.jpg|}} {{:​litespeed_wiki:​cache:​lscwp:​lscwp-crawler-debug1.jpg|}}
-So here is the problem ​most of logs shows header is ''​HTTP/​1.1 200 OK''​ but some headers ​are empty, that's the reason why it is being added into blacklist.+So here is the problemmost of the logs show the header is ''​HTTP/​1.1 200 OK''​ but a few of them are empty. It's the empty ones that are being added to the blacklist.
-But why if manually run a curl , it just works as normal ​?+But whyif you manually run a curl, it looks fine?
 <​code>​[root@test ~]# curl -I -XGET https://​example.com/​product-name-1 <​code>​[root@test ~]# curl -I -XGET https://​example.com/​product-name-1
Line 59: Line 53:
 CF-RAY: 5f5db4fd1c234c56-AMS</​code>​ CF-RAY: 5f5db4fd1c234c56-AMS</​code>​
-It returns 200 OK , and `x-litespeed-cache-control:​ public, so why header ​is empty in previous debugging process?+This URI returns ​''​200''​, and ''​x-litespeed-cache-control:​ public''​, so why is the header ​empty in the previous debugging process?
-So what to do next ?+To figure it out, we can mimic the exact options the PHP curl used, and see what's going on.
-Naturally , we will now need to mimic the exact options the PHP curl used and see what's going on. +To add another debug log code to grab the curl options ​the crawler used, add following code into ''​litespeed-cache/​lib/​litespeed/​litespeed-crawler.class.php'' ​at line 627, directly ​before ''​return $options ;''​, like so:
- +
-Add another debug log code to grab the curl options crawler used, add following code into ''​litespeed-cache/​lib/​litespeed/​litespeed-crawler.class.php'' ​, in line 627 , before ''​return $options ;''​+
 <​code>​ $options[ CURLOPT_COOKIE ] = implode( '; ', $cookies ) ; <​code>​ $options[ CURLOPT_COOKIE ] = implode( '; ', $cookies ) ;
Line 71: Line 63:
  return $options ;</​code>​  return $options ;</​code>​
-Now let'​s ​manually crawl it again to get the all the options.+Nowmanually crawl it again to get the all the options.
 <​code>​07/​11/​19 14:​20:​15.374 [​37386 1 ZWh] crawler logs headers2 --- '{ <​code>​07/​11/​19 14:​20:​15.374 [​37386 1 ZWh] crawler logs headers2 --- '{
Line 90: Line 82:
 }'</​code>​ }'</​code>​
-These numbers are PHP curlset reference code , after some googling , the ''​78''​ and ''​13''​ are particularly interesting ​, they represent ''​curl connection timeout''​ and ''​curl timeout''​+The numbers ​you see are PHP curlset reference code. An internet search reveals that the ''​78''​ and ''​13''​ are particularly interesting. They represent ''​curl connection timeout''​ and ''​curl timeout''​.
-Let's apply these otpions into our curl command.+Let's apply these options to our curl command.
 <​code>​[root@test ~]# curl -I -XGET --max-time 10 https://​example.com/​product-name-1 <​code>​[root@test ~]# curl -I -XGET --max-time 10 https://​example.com/​product-name-1
 curl: (28) Operation timed out after 10001 milliseconds with 0 out of -1 bytes received</​code>​ curl: (28) Operation timed out after 10001 milliseconds with 0 out of -1 bytes received</​code>​
-So this is the root cause , the page without cache takes more than 10 seconds to load.+So this confirms a timeout ​is the root cause of the problem. Without cache, the page takes more than ten seconds to load.
-Let's do more test to confirm it +Let's do one more test to confirm it:
 <​code>​[root@test ~]# curl -w '​Establish Connection: %{time_connect}s\nTTFB:​ %{time_starttransfer}s\nTotal:​ %{time_total}s\n'​ -XGET -A "​lscache_runner https://​example.com/​product-name-1/​ <​code>​[root@test ~]# curl -w '​Establish Connection: %{time_connect}s\nTTFB:​ %{time_starttransfer}s\nTotal:​ %{time_total}s\n'​ -XGET -A "​lscache_runner https://​example.com/​product-name-1/​
Line 106: Line 98:
 Total: 16.462s</​code>​ Total: 16.462s</​code>​
-So yes , the page without cache takes more than 16 seconds to load which hit curl timeout, that is the reason why debug log shows empty header.+So yes. The page without cache takes more than 16 seconds to loadwhich results in a curl timeout. That is the reason why the debug log shows an empty header, the ''​200''​ status is never received by the crawler, and the URL is blacklisted.
-=====  Solution===== +=====  Solution ===== 
-We now need to figure out where the timeout is set by ''​grep''​+We need to determine ​where the timeout is set, and increase it. Use ''​grep''​:
 <​code>​[root@test litespeed-cache]#​ grep -riF "​timeout"​ --include="​*crawler*.php"​ <​code>​[root@test litespeed-cache]#​ grep -riF "​timeout"​ --include="​*crawler*.php"​
Line 118: Line 110:
 lib/​litespeed/​litespeed-crawler.class.php: ​                     CURLOPT_TIMEOUT => 10,</​code>​ lib/​litespeed/​litespeed-crawler.class.php: ​                     CURLOPT_TIMEOUT => 10,</​code>​
-So the last result , shows ''​curl timeout''​ is defined there, open file `litespeed-cache/​lib/​litespeed/​litespeed-crawler.class.php` , around ​line 561-572 ​+The last result, shows ''​curl timeout''​ is defined there. Open ''​litespeed-cache/​lib/​litespeed/​litespeed-crawler.class.php''​ and somewhere ​around ​lines 561-572, raise the timeout from ''​10''​ to something higher, like ''​30''​. ​
Line 132: Line 124:
  CURLOPT_HTTPHEADER => $this->​_curl_headers,</​code>​  CURLOPT_HTTPHEADER => $this->​_curl_headers,</​code>​
-Raise timeout from 10 to higher number , like 30 seconds. +Crawl manually ​again, ​and you will see that all of the previously ​blacklisted URIs are no longer being added to the blacklist.
- +
-Manual crawling ​again , all these previous ​blacklisted URIs are not longer being added into blacklist. +
- +
-This is a temporarily solution by manually edit the code to raise timeout , default timeout will be changed to 30 seconds in LSCWP , and will be made configurable in LSCWP 3.0 in case user may need a longer time. +
- +
 +**NOTE**: This is a //​temporarily solution//, manually editing the code to raise the timeout. LiteSpeed Cache'​s crawler timeout default will be changed to ''​30''​ seconds in LSCWP [[https://​github.com/​litespeedtech/​lscache_wp/​commit/​64e7f2af39e57ed3481cae934270cf24f4695ba8#​commitcomment-34272438|]] , and will be made configurable in LSCWP 3.0 in case you need an even longer time.
  • Admin
  • Last modified: 2019/07/15 14:36
  • by Lisa Clarke