LScache crawler improvements

serpent_driver

Well-Known Member
#1
Since I am using LiteSpeed I always use local PC with Powershell and cURL for Windows to warmup the cache, because it is faster than every other method. For customers I am currently checking LScache crawler if the crawler meets customer's requirements. While checking the code of the crawler script, I found some weak spots they force server load, reduce crawling speed and generates too much traffic. Not all of my improvements can be applied because they depend on which cURL version is used, so I can only add 1 additional header for version 7.29.

My improvements:
*********************************

search for:

Code:
CURLRESULT=$(curl ${CURL_OPTS} -siLk -b name="${3}" -X GET -H "${1}" ${2} | tac | tac | sed '/Server: /q')
and replace with:

Code:
CURLRESULT=$(curl ${CURL_OPTS} -siLk -b name="${3}" -X GET -H "Accept-Encoding: gzip, deflate, br" -H "${1}" ${2} | tac | tac | sed '/Server: /q')
Adding Accept-Encoding header forces to generate compressed version of cached URLs. Otherwise they are uncompressed and LiteSpeed musst generate a second compressed version additonal to existing uncompressed version.

*********************************

The crawler tries to use http/2 if supported by cURL version, but for reason I can't reproduce this doesn't work all the time, so crawler uses http/1.1. To force using http/2 add

Code:
--http2-prior-knowledge
parameter. This parameter doesn't work with cURL version lower than 7.47.0

*********************************

Function for using mobile devices should be removed and should be completely rebuild, because this function is very insufficient. It is not difficult to define a device detection that works up to 99% with a detection that can also differenciate between cell phones and tablets with a few line of code, but the current solution is unusable! If LiteSpeed needs support to define such define feel free to contact me.

Michael
 

serpent_driver

Well-Known Member
#3
There are more improvements they can easily added. For example if someone uses cookies and have vary cache rules defined in .htaccess. Add

--cookie "cookie_name=" //with double quotes

to CURLRESULT define, but take care of "=" suffix. Without this suffix cache rule will not work.
 

Unique_Eric

Administrator
Staff member
#4
Hi @serpent_driver,

I will adopt the compression part first.

Since the curl with HTTP 1.1 works on all my linux test environments, I might not consider to add --http2-prior-knowledge part at this moment.

About mobile part, I'm not sure the mobile part, do you mean we should consider cell phones and tablets to two extra cache copies?

About cookie part, can you help to verify if "-c" helps in your test?
-c, --with-cookie Crawl with site's cookies
Best,
Eric
 

serpent_driver

Well-Known Member
#5
Okay, I start from the end.

-c parameter works for me if I have defined a vary cache rule based on site's cookie, so it helps for me or whatever you mean with "help". Please explain if a missunderstand this question. -c parameter isn't the same like setting an extra cookie for specific vary cache rule.

Regarding mobile part it is a little bit more complicated. Leaving the current UA name "lscache_crawler iphone" would work if everyone would use an iphone and no tablets would exist, but that is not the problem. If you make such function for mobile devices available, it makes only sense if this logic is running with the same logic to detect a device that comes from the installed application. If this logic is different and it is very different to the available libraries to detect devices, LScache will store the wrong version of a URL. The solution to get crawlers function for mobile device work isn't very complicated, but should be a little bit more flexible. You only need to follow the way how Google does it. This would make the function almost perfect.

https://developer.chrome.com/multidevice/user-agent

But there is still a big problem if you only use "iphone" to simulate a cell phone. iphone == Apple and all Apple devices have problems with server PUSH, so LiteSpeed doesn't set ls_smartpush cookie if an Apple device is used. (Ask George if you need details) To prevent pushing sources in every request I use ls_smartpush cookie to push sources only if this cookie doesn't exist. If I warmup the cache with lscache crawler my URLs will always be cached without existing ls_smartpush cookie and as a result of this the defined sources for PUSH will always be pushed with every request. So it is better not to use "iphone" in crawler settings to simulate a cell phone. Better use "mobile", but this will exclude tablets if someone uses a logic in its app to differentiate also between cell phones and tablets.

This is only a very short description and doesn't give you a complete how to make it perfect, but it is better than the current solution. If you want to check how device detection can be together with LiteSpeed and LScache visit: https://www.speedtemplate.de or https://www.carrando.com.

For me it is like luck that http/2 works in your curl version (7.29), because this version uses http/2 by default and doesn't offer parameters to change http version. But this is not a guarantee that it works all the time and depends on application. If an application sends unsufficient request header LSWS will answer requests with HTTP/1.1. Maybe you remember such case in my last tickets regarding this topic? The problem with http protocol issue if lscache crawler is used is based on the way how the crawler script tries to check if http/2 is available. For this check function checkcurlver().

function checkcurlver(){
curl --help | grep 'Use HTTP 2' > /dev/null
if [ ${?} = 0 ]; then
CURL_OPTS='--http1.1'
fi
}

Older curl version (like yours) doesn't contain HTTP 2 expression. Therefore ${?} is always 0 and that's why http/2 doesn't work all the time correctly. Either you remove this function completely or you bind this function with used curl version. Only newer version support to change http version, but check the correct spelling of this parameter. The current one is wrong.
 
Top