Differences
This shows you the differences between two versions of the page.
Both sides previous revision Previous revision | |||
litespeed_wiki:cache:lscwp:troubleshooting:crawler_blacklist [2019/07/11 21:52] qtwrk |
litespeed_wiki:cache:lscwp:troubleshooting:crawler_blacklist [2019/07/15 14:36] Lisa Clarke Copyediting |
||
---|---|---|---|
Line 1: | Line 1: | ||
- | By default , LSCWP built-in crawler will add an URI into blacklist if following conditions are met: | + | ====== LiteSpeed Cache Crawler Blacklist ====== |
+ | By default , LSCWP's built-in crawler will add a URI to the blacklist if the following conditions are met: | ||
- | 1. the page is not cache by design or default , in other word, any pages that sends response header ''x-litespeed-cache-control: no-cache'' will be added into blacklist after initial crawling. | + | - The page is not cacheable by design or by default. In other words, any pages that send the response header ''x-litespeed-cache-control: no-cache'' |
- | + | - The page doesn't respond with the following headers: <code>HTTP/1.1 200 OK | |
- | 2. If the page is not responding the following headers: | + | |
- | + | ||
- | <code>HTTP/1.1 200 OK | + | |
HTTP/1.1 201 Created | HTTP/1.1 201 Created | ||
HTTP/2 200 | HTTP/2 200 | ||
HTTP/2 201</code> | HTTP/2 201</code> | ||
+ | Knowing these conditions can help us to troubleshoot pages that are unexpectedly blacklisted. | ||
- | ===== Problem: ===== | + | ===== Problem ===== |
- | + | Particular pages are being added to the blacklist after the first crawling, but when you check manually (through the browser or through curl) you see the ''x-litespeed-cache'' header and ''200 OK'' status code. So, why are the URIs ending up in the blacklist? | |
- | a user reports some pages are always being added into blacklist after first crawling, although manually use curl or Chrome browser , it always shows ''x-litespeed-cache'' header and ''200 OK'' status code, but there are always dozens of URIs being added into blacklist when doing crawl. | + | |
- | + | ||
- | ===== Analyze: ===== | + | |
- | + | ||
- | So as mentioned above , we know the condition why it is blacklist , so we just need to figure what happened to trigger crawler to add it into blacklist. | + | |
- | ===== Investigation: ===== | + | ===== Investigation ===== |
+ | Upon checking the debug log, we find that the response header was never logged. To find out why, we need to make a modification to the crawler class. | ||
- | Upon the checking debug log , but apparently it didn't log the response header, so we will need a little modification. | + | Open the following file: ''litespeed-cache/lib/litespeed/litespeed-crawler.class.php'' |
- | So we add a line to log more by inserting following code into file ''litespeed-cache/lib/litespeed/litespeed-crawler.class.php'' at line 273 | + | Add the following code at line 273, to allow us to log more information: |
<code> LiteSpeed_Cache_Log::debug( 'crawler logs headers', $headers ) ; </code> | <code> LiteSpeed_Cache_Log::debug( 'crawler logs headers', $headers ) ; </code> | ||
- | This way , we will get the ''$headers'' when crawler deals it. | + | Now, when the crawler processes a URI, the ''$headers'' will be written to the debug log. |
- | Now after a manual crawling , let's check the debug.log by ''grep headers /path/to/wordpress/wp-content/debug.log'' | + | Run the crawler manually, and check ''grep headers /path/to/wordpress/wp-content/debug.log''. You should see something like this: |
{{:litespeed_wiki:cache:lscwp:lscwp-crawler-debug1.jpg|}} | {{:litespeed_wiki:cache:lscwp:lscwp-crawler-debug1.jpg|}} | ||
- | So here is the problem , most of logs shows header is ''HTTP/1.1 200 OK'' but some headers are empty, that's the reason why it is being added into blacklist. | + | So here is the problem: most of the logs show the header is ''HTTP/1.1 200 OK'' but a few of them are empty. It's the empty ones that are being added to the blacklist. |
- | But why if manually run a curl , it just works as normal ? | + | But why, if you manually run a curl, it looks fine? |
<code>[root@test ~]# curl -I -XGET https://example.com/product-name-1 | <code>[root@test ~]# curl -I -XGET https://example.com/product-name-1 | ||
Line 58: | Line 53: | ||
CF-RAY: 5f5db4fd1c234c56-AMS</code> | CF-RAY: 5f5db4fd1c234c56-AMS</code> | ||
- | It returns 200 OK , and `x-litespeed-cache-control: public` , so why header is empty in previous debugging process? | + | This URI returns ''200'', and ''x-litespeed-cache-control: public'', so why is the header empty in the previous debugging process? |
- | So what to do next ? | + | To figure it out, we can mimic the exact options the PHP curl used, and see what's going on. |
- | Naturally , we will now need to mimic the exact options the PHP curl used and see what's going on. | + | To add another debug log code to grab the curl options the crawler used, add following code into ''litespeed-cache/lib/litespeed/litespeed-crawler.class.php'' at line 627, directly before ''return $options ;'', like so: |
- | + | ||
- | Add another debug log code to grab the curl options crawler used, add following code into ''litespeed-cache/lib/litespeed/litespeed-crawler.class.php'' , in line 627 , before ''return $options ;'' | + | |
<code> $options[ CURLOPT_COOKIE ] = implode( '; ', $cookies ) ; | <code> $options[ CURLOPT_COOKIE ] = implode( '; ', $cookies ) ; | ||
Line 70: | Line 63: | ||
return $options ;</code> | return $options ;</code> | ||
- | Now let's manually crawl it again to get the all the options. | + | Now, manually crawl it again to get the all the options. |
<code>07/11/19 14:20:15.374 [123.123.123.123:37386 1 ZWh] crawler logs headers2 --- '{ | <code>07/11/19 14:20:15.374 [123.123.123.123:37386 1 ZWh] crawler logs headers2 --- '{ | ||
Line 89: | Line 82: | ||
}'</code> | }'</code> | ||
- | These numbers are PHP curlset reference code , after some googling , the ''78'' and ''13'' are particularly interesting , they represent ''curl connection timeout'' and ''curl timeout'' | + | The numbers you see are PHP curlset reference code. An internet search reveals that the ''78'' and ''13'' are particularly interesting. They represent ''curl connection timeout'' and ''curl timeout''. |
- | Let's apply these otpions into our curl command. | + | Let's apply these options to our curl command. |
<code>[root@test ~]# curl -I -XGET --max-time 10 https://example.com/product-name-1 | <code>[root@test ~]# curl -I -XGET --max-time 10 https://example.com/product-name-1 | ||
curl: (28) Operation timed out after 10001 milliseconds with 0 out of -1 bytes received</code> | curl: (28) Operation timed out after 10001 milliseconds with 0 out of -1 bytes received</code> | ||
- | So this is the root cause , the page without cache takes more than 10 seconds to load. | + | So this confirms a timeout is the root cause of the problem. Without cache, the page takes more than ten seconds to load. |
- | Let's do more test to confirm it | + | Let's do one more test to confirm it: |
<code>[root@test ~]# curl -w 'Establish Connection: %{time_connect}s\nTTFB: %{time_starttransfer}s\nTotal: %{time_total}s\n' -XGET -A "lscache_runner https://example.com/product-name-1/ | <code>[root@test ~]# curl -w 'Establish Connection: %{time_connect}s\nTTFB: %{time_starttransfer}s\nTotal: %{time_total}s\n' -XGET -A "lscache_runner https://example.com/product-name-1/ | ||
Line 105: | Line 98: | ||
Total: 16.462s</code> | Total: 16.462s</code> | ||
- | So yes , the page without cache takes more than 16 seconds to load which hit curl timeout, that is the reason why debug log shows empty header. | + | So yes. The page without cache takes more than 16 seconds to load, which results in a curl timeout. That is the reason why the debug log shows an empty header, the ''200'' status is never received by the crawler, and the URL is blacklisted. |
- | ===== Solution: ===== | + | ===== Solution ===== |
- | We now need to figure out where the timeout is set by ''grep'' | + | We need to determine where the timeout is set, and increase it. Use ''grep'': |
<code>[root@test litespeed-cache]# grep -riF "timeout" --include="*crawler*.php" | <code>[root@test litespeed-cache]# grep -riF "timeout" --include="*crawler*.php" | ||
Line 117: | Line 110: | ||
lib/litespeed/litespeed-crawler.class.php: CURLOPT_TIMEOUT => 10,</code> | lib/litespeed/litespeed-crawler.class.php: CURLOPT_TIMEOUT => 10,</code> | ||
- | So the last result , shows ''curl timeout'' is defined there, open file `litespeed-cache/lib/litespeed/litespeed-crawler.class.php` , around line 561-572 | + | The last result, shows ''curl timeout'' is defined there. Open ''litespeed-cache/lib/litespeed/litespeed-crawler.class.php'' and somewhere around lines 561-572, raise the timeout from ''10'' to something higher, like ''30''. |
<code> CURLOPT_RETURNTRANSFER => true, | <code> CURLOPT_RETURNTRANSFER => true, | ||
Line 131: | Line 124: | ||
CURLOPT_HTTPHEADER => $this->_curl_headers,</code> | CURLOPT_HTTPHEADER => $this->_curl_headers,</code> | ||
- | Raise timeout from 10 to higher number , like 30 seconds. | + | Crawl manually again, and you will see that all of the previously blacklisted URIs are no longer being added to the blacklist. |
- | + | ||
- | Manual crawling again , all these previous blacklisted URIs are not longer being added into blacklist. | + | |
- | + | ||
- | This is a temporarily solution by manually edit the code to raise timeout , default timeout will be changed to 30 seconds in LSCWP [[https://github.com/litespeedtech/lscache_wp/commit/64e7f2af39e57ed3481cae934270cf24f4695ba8#commitcomment-34272438|2.9.8.4]] , and will be made configurable in LSCWP 3.0 in case user may need a longer time. | + | |
+ | **NOTE**: This is a //temporarily solution//, manually editing the code to raise the timeout. LiteSpeed Cache's crawler timeout default will be changed to ''30'' seconds in LSCWP [[https://github.com/litespeedtech/lscache_wp/commit/64e7f2af39e57ed3481cae934270cf24f4695ba8#commitcomment-34272438|2.9.8.4]] , and will be made configurable in LSCWP 3.0 in case you need an even longer time. | ||