Differences
This shows you the differences between two versions of the page.
Next revision | Previous revision Last revision Both sides next revision | ||
litespeed_wiki:cache:lscwp:troubleshooting:crawler_blacklist [2019/07/11 19:04] qtwrk created |
litespeed_wiki:cache:lscwp:troubleshooting:crawler_blacklist [2019/07/11 21:52] qtwrk |
||
---|---|---|---|
Line 1: | Line 1: | ||
By default , LSCWP built-in crawler will add an URI into blacklist if following conditions are met: | By default , LSCWP built-in crawler will add an URI into blacklist if following conditions are met: | ||
- | 1. the page is not cache by design or default , in other word, any pages that sends response header `x-litespeed-cache-control: no-cache` will be added into blacklist after initial crawling. | + | 1. the page is not cache by design or default , in other word, any pages that sends response header ''x-litespeed-cache-control: no-cache'' will be added into blacklist after initial crawling. |
2. If the page is not responding the following headers: | 2. If the page is not responding the following headers: | ||
Line 10: | Line 10: | ||
HTTP/2 201</code> | HTTP/2 201</code> | ||
- | One real debug case: | ||
- | Problem: | + | ===== Problem: ===== |
- | a user reports some pages are always being added into blacklist after first crawling, but manually use curl or Chrome browser , it always shows x-litespeed-cache header and 200 OK status code, but there are always dozens of URIs being added into blacklist when doing crawl. | + | a user reports some pages are always being added into blacklist after first crawling, although manually use curl or Chrome browser , it always shows ''x-litespeed-cache'' header and ''200 OK'' status code, but there are always dozens of URIs being added into blacklist when doing crawl. |
- | Analyze: | + | ===== Analyze: ===== |
So as mentioned above , we know the condition why it is blacklist , so we just need to figure what happened to trigger crawler to add it into blacklist. | So as mentioned above , we know the condition why it is blacklist , so we just need to figure what happened to trigger crawler to add it into blacklist. | ||
- | Investigation: | + | ===== Investigation: ===== |
Upon the checking debug log , but apparently it didn't log the response header, so we will need a little modification. | Upon the checking debug log , but apparently it didn't log the response header, so we will need a little modification. | ||
- | So we add a line to log more by inserting following code into file `litespeed-cache/lib/litespeed/litespeed-crawler.class.php` at line 273 | + | So we add a line to log more by inserting following code into file ''litespeed-cache/lib/litespeed/litespeed-crawler.class.php'' at line 273 |
- | LiteSpeed_Cache_Log::debug( 'crawler logs headers', $headers ) ; | + | <code> LiteSpeed_Cache_Log::debug( 'crawler logs headers', $headers ) ; </code> |
- | This way , we will get the `$headers` when crawler deals it. | + | This way , we will get the ''$headers'' when crawler deals it. |
+ | Now after a manual crawling , let's check the debug.log by ''grep headers /path/to/wordpress/wp-content/debug.log'' | ||
+ | |||
+ | {{:litespeed_wiki:cache:lscwp:lscwp-crawler-debug1.jpg|}} | ||
+ | |||
+ | So here is the problem , most of logs shows header is ''HTTP/1.1 200 OK'' but some headers are empty, that's the reason why it is being added into blacklist. | ||
+ | |||
+ | But why if manually run a curl , it just works as normal ? | ||
+ | |||
+ | <code>[root@test ~]# curl -I -XGET https://example.com/product-name-1 | ||
+ | HTTP/1.1 200 OK | ||
+ | Date: Thu, 11 Jul 2019 20:57:54 GMT | ||
+ | Content-Type: text/html; charset=UTF-8 | ||
+ | Transfer-Encoding: chunked | ||
+ | Connection: keep-alive | ||
+ | Set-Cookie: __cfduid=some-string-here; expires=Fri, 10-Jul-20 20:57:43 GMT; path=/; domain=.example.com; HttpOnly | ||
+ | Cf-Railgun: direct (starting new WAN connection) | ||
+ | Link: <https://example.com/wp-json/>; rel="https://api.w.org/" | ||
+ | Link: </min/186a9.css>; rel=preload; as=style,</min/f7e97.css>; rel=preload; as=style,</wp-content/plugins/plugin/jquery.min.js>; rel=preload; as=script,</min/7f44e.js>; rel=preload; as=script,</min/a8512.js>; rel=preload; as=script,</wp-content/plugins/litespeed-cache/js/webfontloader.min.js>; rel=preload; as=script | ||
+ | Set-Cookie: wp_woocommerce_session_string=value; expires=Sat, 13-Jul-2019 20:57:43 GMT; Max-Age=172799; path=/; secure; HttpOnly | ||
+ | Set-Cookie: wp_woocommerce_session_string=value; expires=Sat, 13-Jul-2019 20:57:43 GMT; Max-Age=172799; path=/; secure; HttpOnly | ||
+ | Vary: Accept-Encoding | ||
+ | X-Litespeed-Cache: miss | ||
+ | X-Litespeed-Cache-Control: public,max-age=604800 | ||
+ | X-Litespeed-Tag: 98f_WC_T.156,98f_WC_T.494,98f_WC_T.48,98f_product_cat,98f_URL.e3a528ab8c54fd1cf6bf060091288580,98f_T.156,98f_ | ||
+ | X-Powered-By: PHP/7.3.6 | ||
+ | Expect-CT: max-age=604800, report-uri="https://report-uri.cloudflare.com/cdn-cgi/beacon/expect-ct" | ||
+ | Server: cloudflare | ||
+ | CF-RAY: 5f5db4fd1c234c56-AMS</code> | ||
+ | |||
+ | It returns 200 OK , and `x-litespeed-cache-control: public` , so why header is empty in previous debugging process? | ||
+ | |||
+ | So what to do next ? | ||
+ | |||
+ | Naturally , we will now need to mimic the exact options the PHP curl used and see what's going on. | ||
+ | |||
+ | Add another debug log code to grab the curl options crawler used, add following code into ''litespeed-cache/lib/litespeed/litespeed-crawler.class.php'' , in line 627 , before ''return $options ;'' | ||
+ | |||
+ | <code> $options[ CURLOPT_COOKIE ] = implode( '; ', $cookies ) ; | ||
+ | LiteSpeed_Cache_Log::debug( 'crawler logs headers2', json_encode( $options ) ) ; | ||
+ | return $options ;</code> | ||
+ | |||
+ | Now let's manually crawl it again to get the all the options. | ||
+ | |||
+ | <code>07/11/19 14:20:15.374 [123.123.123.123:37386 1 ZWh] crawler logs headers2 --- '{ | ||
+ | "19913":true, | ||
+ | "42":true, | ||
+ | "10036":"GET", | ||
+ | "52":false, | ||
+ | "10102":"gzip", | ||
+ | "78":10, | ||
+ | "13":10, | ||
+ | "81":0, | ||
+ | "64":false, | ||
+ | "44":false, | ||
+ | "10023":["Cache-Control: max-age=0","Host: example.com"], | ||
+ | "84":2, | ||
+ | "10018":"lscache_runner ", | ||
+ | "10016":"http:\/\/example.com\/wp-cron.php?doing_wp_cron=1234567890.12345678910111213141516","10022":"litespeed_hash=qwert" | ||
+ | }'</code> | ||
+ | |||
+ | These numbers are PHP curlset reference code , after some googling , the ''78'' and ''13'' are particularly interesting , they represent ''curl connection timeout'' and ''curl timeout'' | ||
+ | |||
+ | Let's apply these otpions into our curl command. | ||
+ | |||
+ | <code>[root@test ~]# curl -I -XGET --max-time 10 https://example.com/product-name-1 | ||
+ | curl: (28) Operation timed out after 10001 milliseconds with 0 out of -1 bytes received</code> | ||
+ | |||
+ | So this is the root cause , the page without cache takes more than 10 seconds to load. | ||
+ | |||
+ | Let's do more test to confirm it | ||
+ | |||
+ | <code>[root@test ~]# curl -w 'Establish Connection: %{time_connect}s\nTTFB: %{time_starttransfer}s\nTotal: %{time_total}s\n' -XGET -A "lscache_runner https://example.com/product-name-1/ | ||
+ | Establish Connection: 0.006s | ||
+ | TTFB: 16.455s | ||
+ | Total: 16.462s</code> | ||
+ | |||
+ | So yes , the page without cache takes more than 16 seconds to load which hit curl timeout, that is the reason why debug log shows empty header. | ||
+ | |||
+ | ===== Solution: ===== | ||
+ | |||
+ | We now need to figure out where the timeout is set by ''grep'' | ||
+ | |||
+ | <code>[root@test litespeed-cache]# grep -riF "timeout" --include="*crawler*.php" | ||
+ | includes/litespeed-cache-crawler.class.php: $response = wp_remote_get( $sitemap, array( 'timeout' => 15 ) ) ; | ||
+ | inc/crawler.class.php: $response = wp_remote_get( $sitemap, array( 'timeout' => 15 ) ) ; | ||
+ | lib/litespeed/litespeed-crawler.class.php: CURLOPT_CONNECTTIMEOUT => 10, | ||
+ | lib/litespeed/litespeed-crawler.class.php: CURLOPT_TIMEOUT => 10,</code> | ||
+ | |||
+ | So the last result , shows ''curl timeout'' is defined there, open file `litespeed-cache/lib/litespeed/litespeed-crawler.class.php` , around line 561-572 | ||
+ | |||
+ | <code> CURLOPT_RETURNTRANSFER => true, | ||
+ | CURLOPT_HEADER => true, | ||
+ | CURLOPT_CUSTOMREQUEST => 'GET', | ||
+ | CURLOPT_FOLLOWLOCATION => false, | ||
+ | CURLOPT_ENCODING => 'gzip', | ||
+ | CURLOPT_CONNECTTIMEOUT => 10, | ||
+ | CURLOPT_TIMEOUT => 10, | ||
+ | CURLOPT_SSL_VERIFYHOST => 0, | ||
+ | CURLOPT_SSL_VERIFYPEER => false, | ||
+ | CURLOPT_NOBODY => false, | ||
+ | CURLOPT_HTTPHEADER => $this->_curl_headers,</code> | ||
+ | |||
+ | Raise timeout from 10 to higher number , like 30 seconds. | ||
+ | |||
+ | Manual crawling again , all these previous blacklisted URIs are not longer being added into blacklist. | ||
+ | |||
+ | This is a temporarily solution by manually edit the code to raise timeout , default timeout will be changed to 30 seconds in LSCWP [[https://github.com/litespeedtech/lscache_wp/commit/64e7f2af39e57ed3481cae934270cf24f4695ba8#commitcomment-34272438|2.9.8.4]] , and will be made configurable in LSCWP 3.0 in case user may need a longer time. | ||