Magic curl option for fast recache and small lscache size

serpent_driver

Well-Known Member
#23
Hey come on! :rolleyes: This is neither magic nor special. It is just default! If you would be ready to learn and listen you would have real magic stuff. Your "magic" crawler script is a turtle!
 

serpent_driver

Well-Known Member
#29
After OpenCart brought me to the brink of despair, here comes a Quick'n & Dirty but the fastest method of how to warmup your cache not only for OpenCart. This method is only for use in CLI. I have a version for PHP, but this version is not for free. To make warmup really fast, the default way of making requests is too slow. This default method works serial. Serial means one request after the other, but curl also support parallel requests and can make request up to 10,000 and more and all at the same time. To warmup the cache we don't such a high number, because to much requests costs too much load. 3 to 5 is a good number. For your information: Lscache plugin for Wordpress also works on this way, but has a bad configuration that makes warmup slow again....

The problem with this parallel method in CLI is, that a curl version higher 7.4x is needed for it and many hostings don't have this version installed. The alternative way is to run curl from local computer. curl offers a windows version and can easily be installed. This windows version works like server version in DOS shell and uses the same commands.

To get this method work we need a list of urls in a specific format. To get this list we use sitemap.xml function in OpenCart, but this needs an extension to generate sitemaps.

How to Do:

1.) Create a directory at choice anywhere on your server. It must not be OC directory, but it must be connectable with browser.
2.) Create a blank PHP file, place it into this directory and copy the code below into this file.

Code:
header("Content-Type: text/plain");

$sitemap = 'https://www.priazha-shop.com/sitemap.xml';

$content = file_get_contents($sitemap);

$xml = simplexml_load_string($content);

foreach ($xml->sitemap as $urlsElement) {

    $urls = $urlsElement->loc;
    $sitemaps = file_get_contents($urls);
    $xmls = simplexml_load_string($sitemaps);

    foreach ($xmls->url as $urlElement) {
        $url = $urlElement->loc;
        file_put_contents('sitemap.txt',  'url = ' . $url. "\n", FILE_APPEND);
    }
}
3.) Run this file in browser
4.) The script above generates a formated .txt file with all URLs from "Index" sitemap.
5.) Download this file
6.) Run the curl command below where this txt file is located. I've named this file sitemap.txt

Code:
curl --parallel --parallel-immediate --parallel-max 3 --connect-timeout 5 --http1.1 -k -I -s -X GET -H "User-Agent:Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:101.0) Gecko/20100101 Firefox/101.0" -H "Accept-Encoding: gzip, deflate, br" --config sitemap.txt
Parameters:
--parallel-max 3 // 3 parallel requests; Do not set a higher number than 5 to prevent too high load!!!!
Set User Agent at your choice
Add further custom headers with "-H" in front

That's it! Enjoy

##############################################
Additional information:

curl for Windows can be downloaded here:
https://curl.se/windows/

To check curl version, run curl command in CLI:

Code:
curl -V
It must be version higher than 7.4x for parallel support. Otherwise use version for Windows.
 
Last edited:

AndreyPopov

Well-Known Member
#36
Why didn't you add this parameter?

Code:
curl_setopt($ch, CURLOPT_NOBODY, true);

question to @serpent_driver

CURLOPT_NOBODY
true
to exclude the body from the output. Request method is then set to HEAD. Changing this to false does not change it to GET.

why you provide method HEAD, but later set to GET?
PHP:
curl --parallel --parallel-immediate --parallel-max 3 --connect-timeout 5 --http1.1 -k -I -s -X GET
 

serpent_driver

Well-Known Member
#37
This is not a HEAD request. HEAD means only HTTP Header and no message body, but message body is not HTML body. That is a big difference. A only HEAD request prevents to cache a page.
 

serpent_driver

Well-Known Member
#39
Again, request method has nothing to do with "CURLOPT_NOBODY". This curl option has affect on returning message body and depending on the value for this option the ouput of the request will be returned or not. For caching a page you don't need the return and that's why it is better to set value for CURLOPT_NOBODY -> true.

If you still want answers regarding request method and curl option please read curl documentation.
 
Top