Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Next revision
Previous revision
Last revision Both sides next revision
litespeed_wiki:cache:lscwp:troubleshooting:crawler_blacklist [2019/07/11 19:04]
qtwrk created
litespeed_wiki:cache:lscwp:troubleshooting:crawler_blacklist [2019/07/11 21:52]
qtwrk
Line 1: Line 1:
 By default , LSCWP built-in crawler will add an URI into blacklist if following conditions are met: By default , LSCWP built-in crawler will add an URI into blacklist if following conditions are met:
  
-1. the page is not cache by design or default , in other word, any pages that sends response header ​`x-litespeed-cache-control:​ no-cachewill be added into blacklist after initial crawling.+1. the page is not cache by design or default , in other word, any pages that sends response header ​''​x-litespeed-cache-control:​ no-cache'' ​will be added into blacklist after initial crawling.
  
 2. If the page is not responding the following headers: 2. If the page is not responding the following headers:
Line 10: Line 10:
 HTTP/2 201</​code>​ HTTP/2 201</​code>​
  
-One real debug case: 
  
-Problem:+===== Problem: ​=====
  
-a user reports some pages are always being added into blacklist after first crawling, ​but manually use curl or Chrome browser , it always shows x-litespeed-cache header and 200 OK status code, but there are always dozens of URIs being added into blacklist when doing crawl.+a user reports some pages are always being added into blacklist after first crawling, ​although ​manually use curl or Chrome browser , it always shows ''​x-litespeed-cache'' ​header and ''​200 OK'' ​status code, but there are always dozens of URIs being added into blacklist when doing crawl.
  
-Analyze:+===== Analyze: ​===== 
  
 So as mentioned above , we know the condition why it is blacklist , so we just need to figure what happened to trigger crawler to add it into blacklist. So as mentioned above , we know the condition why it is blacklist , so we just need to figure what happened to trigger crawler to add it into blacklist.
  
-Investigation:​+=====  ​Investigation: ​===== 
  
 Upon the checking debug log , but apparently it didn't log the response header, so we will need a little modification. Upon the checking debug log , but apparently it didn't log the response header, so we will need a little modification.
  
-So we add a line to log more by inserting following code into file `litespeed-cache/​lib/​litespeed/​litespeed-crawler.class.phpat line 273+So we add a line to log more by inserting following code into file ''​litespeed-cache/​lib/​litespeed/​litespeed-crawler.class.php'' ​at line 273
  
- LiteSpeed_Cache_Log::​debug( '​crawler logs headers',​ $headers ) ;+<​code> ​LiteSpeed_Cache_Log::​debug( '​crawler logs headers',​ $headers ) ; </​code>​
  
-This way , we will get the `$headerswhen crawler deals it.+This way , we will get the ''​$headers'' ​when crawler deals it.
  
 +Now after a manual crawling , let's check the debug.log by ''​grep headers /​path/​to/​wordpress/​wp-content/​debug.log''​
 +
 +{{:​litespeed_wiki:​cache:​lscwp:​lscwp-crawler-debug1.jpg|}}
 +
 +So here is the problem , most of logs shows header is ''​HTTP/​1.1 200 OK''​ but some headers are empty, that's the reason why it is being added into blacklist.
 +
 +But why if manually run a curl , it just works as normal ?
 +
 +<​code>​[root@test ~]# curl -I -XGET https://​example.com/​product-name-1
 +HTTP/1.1 200 OK
 +Date: Thu, 11 Jul 2019 20:57:54 GMT
 +Content-Type:​ text/html; charset=UTF-8
 +Transfer-Encoding:​ chunked
 +Connection: keep-alive
 +Set-Cookie: __cfduid=some-string-here;​ expires=Fri,​ 10-Jul-20 20:57:43 GMT; path=/; domain=.example.com;​ HttpOnly
 +Cf-Railgun: direct (starting new WAN connection)
 +Link: <​https://​example.com/​wp-json/>;​ rel="​https://​api.w.org/"​
 +Link: </​min/​186a9.css>;​ rel=preload;​ as=style,</​min/​f7e97.css>;​ rel=preload;​ as=style,</​wp-content/​plugins/​plugin/​jquery.min.js>;​ rel=preload;​ as=script,</​min/​7f44e.js>;​ rel=preload;​ as=script,</​min/​a8512.js>;​ rel=preload;​ as=script,</​wp-content/​plugins/​litespeed-cache/​js/​webfontloader.min.js>;​ rel=preload;​ as=script
 +Set-Cookie: wp_woocommerce_session_string=value;​ expires=Sat,​ 13-Jul-2019 20:57:43 GMT; Max-Age=172799;​ path=/; secure; HttpOnly
 +Set-Cookie: wp_woocommerce_session_string=value;​ expires=Sat,​ 13-Jul-2019 20:57:43 GMT; Max-Age=172799;​ path=/; secure; HttpOnly
 +Vary: Accept-Encoding
 +X-Litespeed-Cache:​ miss
 +X-Litespeed-Cache-Control:​ public,​max-age=604800
 +X-Litespeed-Tag:​ 98f_WC_T.156,​98f_WC_T.494,​98f_WC_T.48,​98f_product_cat,​98f_URL.e3a528ab8c54fd1cf6bf060091288580,​98f_T.156,​98f_
 +X-Powered-By:​ PHP/7.3.6
 +Expect-CT: max-age=604800,​ report-uri="​https://​report-uri.cloudflare.com/​cdn-cgi/​beacon/​expect-ct"​
 +Server: cloudflare
 +CF-RAY: 5f5db4fd1c234c56-AMS</​code>​
 +
 +It returns 200 OK , and `x-litespeed-cache-control:​ public` , so why header is empty in previous debugging process?
 +
 +So what to do next ?
 +
 +Naturally , we will now need to mimic the exact options the PHP curl used and see what's going on.
 +
 +Add another debug log code to grab the curl options crawler used, add following code into ''​litespeed-cache/​lib/​litespeed/​litespeed-crawler.class.php''​ , in line 627 , before ''​return $options ;''​
 +
 +<​code>​ $options[ CURLOPT_COOKIE ] = implode( '; ', $cookies ) ;
 + LiteSpeed_Cache_Log::​debug( '​crawler logs headers2',​ json_encode( $options ) ) ;
 + return $options ;</​code>​
 +
 +Now let's manually crawl it again to get the all the options.
 +
 +<​code>​07/​11/​19 14:​20:​15.374 [123.123.123.123:​37386 1 ZWh] crawler logs headers2 --- '{
 +"​19913":​true,​
 +"​42":​true,​
 +"​10036":"​GET",​
 +"​52":​false,​
 +"​10102":"​gzip",​
 +"​78":​10,​
 +"​13":​10,​
 +"​81":​0,​
 +"​64":​false,​
 +"​44":​false,​
 +"​10023":​["​Cache-Control:​ max-age=0","​Host:​ example.com"​],​
 +"​84":​2,​
 +"​10018":"​lscache_runner ",
 +"​10016":"​http:​\/​\/​example.com\/​wp-cron.php?​doing_wp_cron=1234567890.12345678910111213141516","​10022":"​litespeed_hash=qwert"​
 +}'</​code>​
 +
 +These numbers are PHP curlset reference code , after some googling , the ''​78''​ and ''​13''​ are particularly interesting , they represent ''​curl connection timeout''​ and ''​curl timeout''​
 +
 +Let's apply these otpions into our curl command.
 +
 +<​code>​[root@test ~]# curl -I -XGET --max-time 10 https://​example.com/​product-name-1
 +curl: (28) Operation timed out after 10001 milliseconds with 0 out of -1 bytes received</​code>​
 +
 +So this is the root cause , the page without cache takes more than 10 seconds to load.
 +
 +Let's do more test to confirm it 
 +
 +<​code>​[root@test ~]# curl -w '​Establish Connection: %{time_connect}s\nTTFB:​ %{time_starttransfer}s\nTotal:​ %{time_total}s\n'​ -XGET -A "​lscache_runner https://​example.com/​product-name-1/​
 +Establish Connection: 0.006s
 +TTFB: 16.455s
 +Total: 16.462s</​code>​
 +
 +So yes , the page without cache takes more than 16 seconds to load which hit curl timeout, that is the reason why debug log shows empty header.
 +
 +=====  Solution: ===== 
 +
 +We now need to figure out where the timeout is set by ''​grep''​
 +
 +<​code>​[root@test litespeed-cache]#​ grep -riF "​timeout"​ --include="​*crawler*.php"​
 +includes/​litespeed-cache-crawler.class.php: ​            ​$response = wp_remote_get( $sitemap, array( '​timeout'​ => 15 ) ) ;
 +inc/​crawler.class.php: ​         $response = wp_remote_get( $sitemap, array( '​timeout'​ => 15 ) ) ;
 +lib/​litespeed/​litespeed-crawler.class.php: ​                     CURLOPT_CONNECTTIMEOUT => 10,
 +lib/​litespeed/​litespeed-crawler.class.php: ​                     CURLOPT_TIMEOUT => 10,</​code>​
 +
 +So the last result , shows ''​curl timeout''​ is defined there, open file `litespeed-cache/​lib/​litespeed/​litespeed-crawler.class.php` , around line 561-572 ​
 +
 +<​code>​ CURLOPT_RETURNTRANSFER => true,
 + CURLOPT_HEADER => true,
 + CURLOPT_CUSTOMREQUEST => '​GET',​
 + CURLOPT_FOLLOWLOCATION => false,
 + CURLOPT_ENCODING => '​gzip',​
 + CURLOPT_CONNECTTIMEOUT => 10,
 + CURLOPT_TIMEOUT => 10,
 + CURLOPT_SSL_VERIFYHOST => 0,
 + CURLOPT_SSL_VERIFYPEER => false,
 + CURLOPT_NOBODY => false,
 + CURLOPT_HTTPHEADER => $this->​_curl_headers,</​code>​
 +
 +Raise timeout from 10 to higher number , like 30 seconds.
 +
 +Manual crawling again , all these previous blacklisted URIs are not longer being added into blacklist.
 +
 +This is a temporarily solution by manually edit the code to raise timeout , default timeout will be changed to 30 seconds in LSCWP [[https://​github.com/​litespeedtech/​lscache_wp/​commit/​64e7f2af39e57ed3481cae934270cf24f4695ba8#​commitcomment-34272438|2.9.8.4]] , and will be made configurable in LSCWP 3.0 in case user may need a longer time.
  
  
  
  
  • Admin
  • Last modified: 2019/07/15 14:36
  • by Lisa Clarke