Magic curl option for fast recache and small lscache size

AndreyPopov

Well-Known Member
#1
in long and hard discussion with serpent_driver and researching parameters of his "superfast recaching method" finding really "magic" option for curl in crawler algorithm.

PHP:
 curl_setopt($ch, CURLOPT_ENCODING, "");
must be added in
catalog/controller/extension/module/lscache.php
after this:
PHP:
curl_setopt($ch, CURLOPT_HTTP_VERSION, CURL_HTTP_VERSION_1_1);
after adding this options:
- speed of recache increase up to 3 times
- size of already generated cache decrease up to 10 times

file of cache after lscache header now contain not full html code of cached page like:
HTML:
<!DOCTYPE html> <html dir="ltr" lang="ru" class="desktop win chrome chrome101 webkit oc30 is-guest route-common-home store-0 skin-1 desktop-header-active mobile-sticky layout-1 wf-vollkorn-n7-active wf-vollkorn-n4-active wf-hindmadurai-n7-active wf-hindmadurai-n4-active wf-active flexbox no-touchevents" data-jb="14218c54" data-jv="3.1.8" data-ov="3.0.3.1" style=""><head typeof="og:website"><meta charset="UTF-8"><meta name="viewport" content="width=device-width, initial-scale=1.0"><meta http-equiv="X-UA-Compatible" content="IE=edge"><title>Пряжа высокого качества Турции, Италии в широчайшем ассортименте - интернет-магазин "Пряжа-shop"</title><base href="https://www.priazha-shop.com/">
..............
<span class="links-text">Copyright © 2019-2022, Priazha-shop, All Rights Reserved</span>
</a></li></ul></div></div></div></div></div></div></div></footer></div><div class="notification-wrapper notification-wrapper-bottom"><div class="module module-notification module-notification-137 notification" data-options="{&quot;position&quot;:null,&quot;title&quot;:&quot;&quot;,&quot;cookie&quot;:&quot;e0d3c1b5&quot;}">
<button class="btn notification-close">OK</button><div class="notification-content"><div><div class="notification-title"></div><div class="notification-text">Этот сайт использует файлы cookies.</div></div></div></div></div> <script src="https://static.priazha-shop.com/catalog/view/theme/journal3/assets/0d6cd2a9a1e51254b9f0317c0826cc05.js?v=14218c54"></script> <div class="scroll-top">
<i class="fa fa-angle-up"></i></div></body></html>
now:
lscache_encoding.jpg

but lscache work!!!



P.S. developers add this option to plugin!!!!
https://github.com/litespeedtech/lscache-opencart/commit/d7a085e56132308eec522d9b9d332e163163b9b2
 
Last edited:

serpent_driver

Well-Known Member
#2
Code:
curl_setopt($ch, CURLOPT_ENCODING, "");
Means neither traffic will compressed nor the cache will be compressed and if there is no compression much more disk space is needed => a logical result. And a further result it is that LSWS or OLS creates compressed copies of the cache files next time when a user requests a already cached URL, because every browser supports Accept Encoding with gzip,deflate or br. An empty CURLOPT_ENCODING curl parameter means that encoding is "identity", so no good idea and no good idea to use this code. This code doesn't come from me.

To make curl requests really faster without blowing up disk space some other parameters have to be used! This code is everything, but not magic and only nonsense again.

Code:
curl_setopt($ch, CURLOPT_HTTP_VERSION, CURL_HTTP_VERSION_1_1);
This parameter is obsolete. curl always uses HTTP/1.1 by design if there is no other protocol set. There is neither an advantage nor a disadvantage to set protocol parameter.
 
Last edited:

AndreyPopov

Well-Known Member
#3
An empty CURLOPT_ENCODING curl parameter means that encoding is "identity", so no good idea and no good idea to use this code. This code doesn't come from me.
PHP: curl_setopt - Manual
If an empty string, "", is set, a header containing all supported encoding types is sent.

This code doesn't come from me.
you idea use ENCODING as curl request parameter :)


This parameter is obsolete. curl always uses HTTP/1.1 by design
says this to developer. I provide original string from Git.


if you read my post carefully then must know that I changed it to CURL_HTTP_VERSION_2
 
Last edited:

serpent_driver

Well-Known Member
#4
Nonsense! If the client doesn't send any information about if he supports any compression how should the server know if he can compress traffic? Why are the cache files sizes are so big? Right, because server hasn't compressed it due unknown Accept parameter

HTTP/2 is only used by curl if connection is unsecured without HTTPS. If it secured by HTTPS curl always uses HTTP/1.1:
https://everything.curl.dev/http/versions
 
Last edited:

serpent_driver

Well-Known Member
#5
@AndreyPopov
As long you don't respect people who have much experience and knowledge about curl and other LiteSpeed related stuff you will never really improve anything. Your current published code about how to improve crawling and caching is bullshit. You do a lot and if it works you don't know why it works and as long you think in only 1 dimension you will never learn how to really improve crawling and warmup the cache.
 

AndreyPopov

Well-Known Member
#6
Nonsense! If the client doesn't send any information
curl is client and add what it encoding support.
for example, no sense add to header encoding brotli (br) if curl version low 7.57
and h's why empty string "" is the best way to add supported encoding by current version of curl

open your eyes and free your mind ;)
 

Lauren

LiteSpeed Staff
Staff member
#8
Andrey,
1. crawler should only use HTTP/1.1, it's faster than HTTP/2, since it does not involve multi-plex of multiple urls, with much overhead of http2 processing for only single request, that's waste of resource and slower.
2. for accepted encoding gzip, we'll add to our plugins that currently does not have. For our WP plugin, litemage/presetashop crawler script, we already have that. We should warm up the gzip copy in lscache.

Thanks for your suggestions.
Lauren
 

serpent_driver

Well-Known Member
#10
If client (curl) sends "Accept-Encoding: gzip,deflate,br" request header, LScache will compress cache files, but this compression is different to compression for data transfer compression, so it doesn't matter if you use gzip, deflate or br. Cache file compression is always the same, but if you leave Accept-Encoding header empty or if you set "identity", cache files and data transfer are always uncompressed. It blows disk space up and more data have to be transferred, because data are not compressed. So use Accept-Encoding: gzip,deflate,br and nothing goes wrong.
 

AndreyPopov

Well-Known Member
#11
but if you leave Accept-Encoding header empty
you always hear only yourself!!!! :(

again and again:
for CURLOPT_ENCODING
If an empty string, "", is set, a header containing all supported encoding types is sent.

you NOT want understanding that curl send in header "Accept-Encoding: <list of all supported encoding types by current curl version>"


but this compression is different to compression for data transfer compression, so it doesn't matter if you use gzip, deflate or br.
this is matter because client request page from cache get it in gzip encoding (like for me, after use CURLOPT_ENCODING for crawler )
if not use lscache then page come with br encoding.
 
Last edited:

serpent_driver

Well-Known Member
#12
So Mr. Super Warmup Cache Crawler Master,

below you will see what is wrong with your code and your super deluxe recache warmup script.

The first code is the default code taken from lscache module for Opencart, but modified by Mr. Super Warmup Cache Crawler Master with Accept Encoding Header. $url is based on your crawler script that generates particular URLs depending on "what" query. Paste this code in a test php file, so you can verify by your own what is wrong. Important to know: All generated URLs are "non SEO URLs"!!!!

Code:
<?php

header("Content-Type: text/plain");

$url = 'https://www.priazha-shop.com/index.php?route=product/product&product_id=6273';
$ch = curl_init($url);
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_HEADER, false);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, false);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
curl_setopt($ch, CURLOPT_MAXREDIRS, 1);
curl_setopt($ch, CURLOPT_ACCEPT_ENCODING, "");
curl_setopt($ch, CURLOPT_HTTP_VERSION, CURL_HTTP_VERSION_1_1);

curl_exec($ch);

$sizeDownload = curl_getinfo($ch, CURLINFO_SIZE_DOWNLOAD);
$sizeDownload = 'Download Size: ' . round($sizeDownload / 1024) . ' kb' . "\n";
$totalTime = curl_getinfo($ch, CURLINFO_TOTAL_TIME);
$totalTime = 'Total Time: ' . $totalTime . ' seconds + time for redirection. Total transaction time in seconds for last transfer';

echo $sizeDownload;
echo $totalTime;

curl_close($ch);

Next Code is modified by me. That is nothing special and is also used in some other LScache plugins.

Code:
<?php

header("Content-Type: text/plain");

$url = 'https://www.priazha-shop.com/index.php?route=product/product&product_id=6273';
$ch = curl_init($url);
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_HEADER, false);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, false);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
curl_setopt($ch, CURLOPT_MAXREDIRS, 1);
curl_setopt($ch, CURLOPT_ACCEPT_ENCODING, "");
curl_setopt($ch, CURLOPT_NOBODY, true);
curl_setopt($ch, CURLOPT_HTTP_VERSION, CURL_HTTP_VERSION_1_1);

curl_exec($ch);

$sizeDownload = curl_getinfo($ch, CURLINFO_SIZE_DOWNLOAD);
$sizeDownload = 'Download Size: ' . round($sizeDownload / 1024) . ' kb' . "\n";
$totalTime = curl_getinfo($ch, CURLINFO_TOTAL_TIME);
$totalTime = 'Total Time: ' . $totalTime . ' seconds + time for redirection. Total transaction time in seconds for last transfer';

echo $sizeDownload;
echo $totalTime;

curl_close($ch);
The last code shows really what is wrong with your crawler script. All URLs you generate are all non-SEO URL, so every curl request must follow to redirection URL first and this takes huge of time. If you run the code below you will see the difference in total transfer time and the size of downloaded document.

Code:
<?php

header("Content-Type: text/plain");

$url = 'https://www.priazha-shop.com/alize-angora-gold-batik-6273';
$ch = curl_init($url);
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_HEADER, false);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, false);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
curl_setopt($ch, CURLOPT_MAXREDIRS, 1);
curl_setopt($ch, CURLOPT_ACCEPT_ENCODING, "");
curl_setopt($ch, CURLOPT_NOBODY, true);
curl_setopt($ch, CURLOPT_HTTP_VERSION, CURL_HTTP_VERSION_1_1);

curl_exec($ch);

$sizeDownload = curl_getinfo($ch, CURLINFO_SIZE_DOWNLOAD);
$sizeDownload = 'Download Size: ' . round($sizeDownload / 1024) . ' kb. WOW! Nothing to Download!' . "\n";
$totalTime = curl_getinfo($ch, CURLINFO_TOTAL_TIME);
$totalTime = 'Total Time: ' . $totalTime . ' seconds. NO wasting time for redirection. Total transaction time in seconds for last transfer';

echo $sizeDownload;
echo $totalTime;

curl_close($ch);
The result of this test is, that you can put your code in the trash. Forget to split URLs from shop that you want to recache. There is no recache, because you can't check cache status without changing status status. You can't tell the server:"Hi server, don't cache my request. I only want to check the cache status!". A request is always a request and there is nothing else as a request! Use SEO URls from sitemap.xml or log URLs that are requested by users or try to find out how OpenCart generates SEO URLs for sitemap.xml.

AND FYI: The final booster is missung in all code examples. With this booster you could speedup crawling to 100,000 URLs per hour, even on shared hosting!
 
Last edited:

AndreyPopov

Well-Known Member
#14
Important to know: All generated URLs are "non SEO URLs"!!!!
All URLs you generate are all non-SEO URL
are you sure?!?!?!?!?!?!?!

open your eyes and free your mind!!!!!!!!! again, again and again-again!

all links converted to SEO format previously by this function
$this->url->link
before sending to $this->crawlUrls as $urls array.
 
Last edited:

serpent_driver

Well-Known Member
#15
I don't know if can go shure. I only see your code. Maybe I am wrong or I overlooked the SEO part. Nevertheless, if there are SEO URLs add the single curl parameter for to CURLOPT_NOBODY to reduce request times.
 

AndreyPopov

Well-Known Member
#18
as I wrote above, links for $urls array for CrawlUrl function generated by function $this->url->link

if SEO enabled in Opencart settings then links become in SEO format according SEO rules.
if SEO disabled in Opencart settings then links stay in original Opencart format.
 

serpent_driver

Well-Known Member
#19
if SEO enabled in Opencart settings then links become in SEO format according SEO rules.
Yes, but I think this is for if you request a non-SEO URL to redirect to SEO URL. In your case I would check it at least access_log to verfiy if there is no redirection.
 

AndreyPopov

Well-Known Member
#20
you again not understanding what's happen :(

- recache function build list of links from DB in native Opencart format:
product/product&product_id=41
product/product&path=20_27&product_id=41
product/product&manufacturer_id=8&product_id=41

- but before send for crawling convert to full path url by $this->url->link :
a) https://<site_name>/<seo_path> if SEO enabled
b) https://<site_name>/index.php?route=product/product&product_id=41 if SEO disabled

- send full path url for crawling

if SEO enaled then crawler ALWAYS request(cache) full path SEO url.
 
Top