Advanced crawler for recache. some ideas and code

AndreyPopov

Well-Known Member
#41
LSCache Advanced Crawler for Opencart 3.x ocmod

(for https://github.com/litespeedtech/ls.../package/lscache-opencart3.0-latest.ocmod.zip from 18th June 2021)

!!!ATTENTION!!!!!
not work as expected if session driver is file
!!!ATTENTION!!!!!

with session driver is file you must:
- click Rebuild Part of LiteSpeed Cache once and wait until file cache expired (this check need settings and create them)
- click Rebuild Part of LiteSpeed Cache again and again wait until file cache expired (build urls list and save it to DB)
- only after that clicking Rebuild Part of LiteSpeed Cache starting recache process.

on my host I use Redis driver for cache and for session and all work perfect.
 

Attachments

AndreyPopov

Well-Known Member
#43
OK, what's a session driver?
you must change it by hand in
/system/config/default.php
/system/config/admin.php
/system/config/catalog.php

in Opencart 2.x default session driver is db
why in Opencart 3.x default session driver is file - I don't know and why with session driver file not work correctly also cannot understanding.

I add in code
PHP:
        $this->cache->delete('lscache_pages');
        $this->cache->delete('lscache_modules');
        $this->cache->delete('lscache_esi_modules');
        $this->cache->delete('lscache_itemes');
but it not work for file session driver :(
 

Lee

Well-Known Member
#44
Is this what you're talking about?
// Cache
$_['cache_engine'] = 'memcached'; // apc, file, mem or memcached
$_['cache_expire'] = 3600;
 
#48
Hello,
I have a problem with the creation of recache. I use version 2.3.0.2. Journal theme 3 to both eshops, but they can’ t be completed and they stop, because I have 40.000 products. Also, despite the fact that I have 384 categories, it completes only 84 and after that it moves to the products’ update. Could you do an updated version for the 2.3.0.2, which includes all of them and it can be stored in the database? Because right know it runs with Php and the package Web Host Elite is activated, but the sites delay. Moreover I had to deactivate another app to make Lite Speed work.
Thank you.
 

serpent_driver

Well-Known Member
#50
Hello,
I have a problem with the creation of recache. I use version 2.3.0.2. Journal theme 3 to both eshops, but they can’ t be completed and they stop, because I have 40.000 products. Also, despite the fact that I have 384 categories, it completes only 84 and after that it moves to the products’ update. Could you do an updated version for the 2.3.0.2, which includes all of them and it can be stored in the database? Because right know it runs with Php and the package Web Host Elite is activated, but the sites delay. Moreover I had to deactivate another app to make Lite Speed work.
Thank you.

Try this:

SetEnvIf Request_URI "^\/path\/to\/crawler\.php" noabort noconntimeout
For me on my test shared hosting server it works and is supported by LiteSpeed webserver. PHP scripts never stops and run for hours, but handle with care, if you don't have permissions to control running processes on your server. If so, you can't stop it!!!!! With this code script will only stop by session timeout if OpenCart crawler is binded to session.
 
Last edited:

AndreyPopov

Well-Known Member
#51
temporary post whole plugin (not extension mode)

last added features:

- for cli:
1. &startnum= and &endnum=
can use separate or combine

startnum - from what number in builded DB urls list start
endnum - at what number in builded DB urls list stop

for example:
curl -N "https://www.yourdomain.com/index.ph...e&from=cli&what=1&startnum=38760&endnum=40800"


2. &mode=restart
same as after press Save in GUI. clean saved already recached parameters, build new urls list

3. some others like &botsua= and &renew= describe later

- if enabled SEO, algorithm check same urls and exclude
 

Attachments

Last edited:

AndreyPopov

Well-Known Member
#52
- use more _lscache_vary cookies
- use curl_multi calls
- mod for Journal3 webp detection

rewrite rules fot .htaccess
Code:
### LITESPEED_CACHE_START - Do not remove this line
<IfModule LiteSpeed>
CacheLookup on
RewriteEngine On
RewriteCond %{HTTP_USER_AGENT} "bot|compatible|images|cfnetwork|favicon|facebook|crawler|spider|addthis" [NC]
RewriteCond %{HTTP_USER_AGENT} !Chrome [NC]
RewriteCond %{HTTP_USER_AGENT} !Mobile [NC]
RewriteRule .* - [E=Cache-Control:vary=isBot]
RewriteCond %{HTTP_USER_AGENT} Bot [NC]
RewriteCond %{HTTP_USER_AGENT} Mobile [NC]
RewriteCond %{HTTP_USER_AGENT} !Chrome [NC]
RewriteRule .* - [E=Cache-Control:vary=ismobilebot]
RewriteCond %{HTTP_USER_AGENT} "Android|iPhone|iPad" [NC]
RewriteRule .* - [E=Cache-Control:vary=ismobile]
</IfModule>
### LITESPEED_CACHE_END
 

Attachments

serpent_driver

Well-Known Member
#53
- use more _lscache_vary cookies
- use curl_multi calls
- mod for Journal3 webp detection

rewrite rules fot .htaccess
Code:
### LITESPEED_CACHE_START - Do not remove this line
<IfModule LiteSpeed>
CacheLookup on
RewriteEngine On
RewriteCond %{HTTP_USER_AGENT} "bot|compatible|images|cfnetwork|favicon|facebook|crawler|spider|addthis" [NC]
RewriteCond %{HTTP_USER_AGENT} !Chrome [NC]
RewriteCond %{HTTP_USER_AGENT} !Mobile [NC]
RewriteRule .* - [E=Cache-Control:vary=isBot]
RewriteCond %{HTTP_USER_AGENT} Bot [NC]
RewriteCond %{HTTP_USER_AGENT} Mobile [NC]
RewriteCond %{HTTP_USER_AGENT} !Chrome [NC]
RewriteRule .* - [E=Cache-Control:vary=ismobilebot]
RewriteCond %{HTTP_USER_AGENT} "Android|iPhone|iPad" [NC]
RewriteRule .* - [E=Cache-Control:vary=ismobile]
</IfModule>
### LITESPEED_CACHE_END
Far too fault-tolerant and therefore not usable.
 

serpent_driver

Well-Known Member
#55
They can't neither work nor perfectly. There are hundreds of bots that you can't catch with these primitive rules. Using User-Agent is highly unsafe.
 

serpent_driver

Well-Known Member
#57
Santa Claus never die!
Let me be your Santa Clause so you finally believe in me again. :)

You should really think about it, that an exclusive detection via the user agent cannot be a viable solution and you know it. Putting aside the insecure use of the user-agent, it takes far too long and costs unnecessary resources to search through a long user-agent string.

Try that. This solution is 99.99% secure. From where I know this? I've been tracking not only the users of my pages for about 3 years, but also bots. This allows me, unlike you, to qualify my code.

Code:
RewriteCond %{HTTP_USER_AGENT} "!safari|MAC OS X|android|opera|SamsungBrowser" [NC]
RewriteCond %{HTTP_USER_AGENT} "!chrome" [NC]
RewriteCond %{HTTP:Sec-Fetch-Dest} ^$ [NC]
RewriteCond %{HTTP:Sec-Fetch-Mode} ^$ [NC]
RewriteCond %{HTTP:Sec-Fetch-Site} ^$ [NC]
RewriteCond %{HTTP:Sec-Fetch-User} ^$ [NC]
RewriteRule .* - [E=Cache-Control:vary=isBot]

RewriteCond %{HTTP_USER_AGENT} "Applebot" [NC]
RewriteRule .* - [E=Cache-Control:vary=isBot]
 
Top