LScache internal crawler(recache) and SEO urls question

AndreyPopov

Well-Known Member
#1
Opencart have next route paths for each product

/index.php?route=product/product&path=24&product_id=28
/index.php?route=product/product&manufacturer_id=5&product_id=28
/index.php?route=product/product&product_id=28

by default internal crawler (when make recache) call:
/index.php?route=product/product&path=24&product_id=28 and /index.php?route=product/product&product_id=28
(
why not call /index.php?route=product/product&manufacturer_id=5&product_id=28 - not matter. it is another question )

I enabled SEO and use SEO extension with rewrite conditions to short path:

/index.php?route=product/product&path=24&product_id=28 => /product_28_seo_name
/index.php?route=product/product&manufacturer_id=5&product_id=28 => /product_28_seo_name
/index.php?route=product/product&product_id=28 => /product_28_seo_name


question:
what route path I need recache or all paths if I use SEO short urls?

enough recache only /index.php?route=product/product&product_id=28 ?

or I need recache all possible paths?
 
Last edited:

serpent_driver

Well-Known Member
#2
Lscrawler uses cURL for cache warmup and has "-L" parameter. That means if there is any redirection it follows the redirection URL and caches it, but not URLs before redirection. But undepending from this behaviour, if URLs have more than 1 GET Parameter caching could fail, because it is unsupported in some older cURL versions. Try it out.
 

AndreyPopov

Well-Known Member
#3
Lscrawler uses cURL for cache warmup
I know

if URLs have more than 1 GET Parameter caching could fail
what does it mean: "more than 1 GET Parameter"? any example?


-------------------------
recache crawl algorithm for now build urls array
PHP:
        foreach ($this->model_catalog_product->getProducts() as $result) {
            foreach ($this->model_catalog_product->getCategories($result['product_id']) as $category) {
                if(isset( $categoryPath[$category['category_id']] )){
                    $urls[] = $this->url->link('product/product', 'path=' . $categoryPath[$category['category_id']] . '&product_id=' . $result['product_id']);
                }
            }

            $urls[] = $this->url->link('product/product', 'product_id=' . $result['product_id']);
        }
algorithm add both urls:
/index.php?route=product/product&path=24&product_id=28
/index.php?route=product/product&product_id=28
and then crawl each urls in array (crawl both urls ).

in output I can see twice:
4168/13137 https://site-name/product_28_seo_name : 200
4169/13137 https://site-name/product_28_seo_name : 200

is necessary recache both paths? or for SEO only one path must be recache?
 
Last edited:

serpent_driver

Well-Known Member
#4
what does it mean: "more than 1 GET Parameter"? any example?
/index.php?route=product/product&path=24&product_id=28

1st GET Parameter --> route=product
2nd GET Parameter --> path=24
3rd GET Parameter --> product_id=28

I can't tell you if rewriting behaviour of OpenCart causes any malfunction of crawler. I can only tell you that in some cases cURL fails if an URL has more than 1 GET parameter, so it's up to you. Test it and find out if it works. If re-written URLs have cache hit header after crawling everything is okay.

is necessary recache both paths? or for SEO only one path must be recache?
This depends on where crawler fetches URLs from. If they come from sitemap.xml (I think so) and sitemap.xml already contains re-written URLs for better SEO you don't have to worry, everything is okay and crawler only requests "beautified" URLs.
 

AndreyPopov

Well-Known Member
#5
1st GET Parameter --> route=product
2nd GET Parameter --> path=24
3rd GET Parameter --> product_id=28
understand. thanks.


If re-written URLs have cache hit header after crawling everything is okay.
all cached and have hit in header




This depends on where crawler fetches URLs from.
internal crawler build it's own url's list (array) by product_id, category_id etc.


I only want to understand: crawl all possible paths is necessary or may by better for cache perfomance?
 

serpent_driver

Well-Known Member
#6
I only want to understand: crawl all possible paths is necessary or may by better for cache perfomance?
Why to crawl all URLs (paths)? If URL has a proper redirection there is absolutely no need to cache a URL that redirects to another URL. If there is a redirection with proper status code 302 or 301 such requests will not be cached, so you can't cache such URLs.
 

serpent_driver

Well-Known Member
#8
This is not a algorithm for the crawler. Crawler has no algorithm. It's only a small cURL function that requests defined URLs. Check function crawlUrls.
 

AndreyPopov

Well-Known Member
#9
This is not a algorithm for the crawler. Crawler has no algorithm. It's only a small cURL function that requests defined URLs. Check function crawlUrls.
crawlUrls contain:

PHP:
foreach ($urls as $url) {
.......

$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
......

$buffer = curl_exec($ch);
....
}
 

AndreyPopov

Well-Known Member
#11
Yes, this is the crawler function, simple and fast, but absolutely no algorithm..... ;)

Learn cURL and build your own crawler. Only 1 line of PHP code is needed.

https://curl.se/
my question NOT about curl, NOT about crawler.

my question about list of urls that crawled for recache!


what route path I need recache or all paths if I use SEO short urls?

enough recache only /index.php?route=product/product&product_id=28 ?

or I need recache all possible paths?
 

serpent_driver

Well-Known Member
#12
I already answered your questions. If a URL has a redirection like

/index.php?route=product/product&product_id=28 to short URL

crawler follows this redirection and caches the redirected URL (short URL), but not /index.php?route=product/product&product_id=28

It doesn't matter if crawler does a recache crawl or not and again, it doesn't make sense trying to cache a redirected URL, because such URLs can't be cached.

And for your information only, if crawler does a recache crawl, this crawling ist the same like standard crawling, but with recache crawling the crawler doesn't fetch content body and only headers will be fetched. This makes recache crawling faster. This is the only difference between both types of crawling.

If this still doesn't answer your question, try it out and see what happends.... If you use 3rd party extensions, you can't expect to get qualified support for it and it is up to you to look for a solution if this extension doesn't work with cache.
 

AndreyPopov

Well-Known Member
#13
I already answered your questions. If a URL has a redirection like

/index.php?route=product/product&product_id=28 to short URL

crawler follows this redirection and caches the redirected URL (short URL), but not /index.php?route=product/product&product_id=28
for SEO short URL is enough recache only one path.
I think same and want to know that is right.
thanks.
 
Top