[SOLVED] [Crawler Script] Curl operation/command failed due to server return code - 404, exit

#1
Hello,

I have an active license but I'd like to try asking for help here first before opening a ticket. Someone else maybe [hopefully] has had this same issue/experience (and this could help others). This is for LiteSpeed Web Server (not OpenLiteSpeed).

I recently converted from CentOS 7 to AlmaLinux 8.7 (I use WHM/cPanel).
On CentOS 7 this [crawler] script worked with no problems. So I'm wondering if there is something I am missing when it comes to configuration.

I'm using Magento [2.4.5-p1] but this topic is more towards the crawler script - I'm using the script linked on this page: Crawler Script | Magento 2 | LiteSpeed Documentation (litespeedtech.com). EDIT: I should also add that I have the crawler "enabled" as instructed HERE (the same as it was on CentOS 7).

As I mentioned above, it worked for the past two months with no issues, but now that I have upgraded this is what happens:

When I run the standard command (bash or sh - the error is the same):
Code:
sh M2-crawler.sh -c -r -v https://my-domain.com/sitemap.xml
or
sh M2-crawler.sh https://my-domain.com/sitemap.xml
The following error pops up in the terminal:
Code:
Curl operation/command failed due to server return code - 404, exit
If I run this command (for individual URLs) it runs no problem:
Code:
sh M2-crawler.sh -c -v -d https://my-domain.com/category
If I remove the "-d" then it says "Sitemap connection success, but is not a valid xml" which is accurate because it's a URL - but that means it is sort of working but not completely.

Now you might be thinking "well the .xml sitemap file is missing" - I have checked it by going to the URL and the sitemap file loads just fine. I also re-ran the chown command on the crawler script file (I checked ls -l and it's the same owner config as it was on CentOS 7). Tried 777 for permissions on the crawler file and the .xml file. No such luck. I have searched on here and googled for hours, can't seem to find much on this. I'm thinking it's specific to LSWS and the crawler? There are no errors in the "error_log" file in public_html and there are no errors in the cPanel > Errors section.

Could anyone please point me in the right direction? I basically need the crawler because my store is very heavy and without it many pages load around 5-8 seconds which is not good for customers (once loaded the cache works as intended, but it's that first page load that hurts the worst).

Thank you in advance, I really appreciate your time.
 
Last edited:

AndreyPopov

Well-Known Member
#6
As shown in the doc page, sh and bash are the same. sh or bash makes no difference.
sh and bash almost same. but bash is extended version of sh that support another syntax and variables, more programming features.
yes, in some Linux distributives sh alias to bash. when you start sh - system run bash.
 

serpent_driver

Well-Known Member
#7
Code:
Curl operation/command failed due to server return code - 404, exit
This error is generated by the crawler script, so it has nothing to with bash or sh.

@QBProducts
Please check the curl version and verify if the crawler function is enabled.
 

AndreyPopov

Well-Known Member
#8
Code:
Curl operation/command failed due to server return code - 404, exit
This error is generated by the crawler script, so it has nothing to with bash or sh.
may be or may not.

error genarated by this code

Bash:
function validmap(){

    CURL_CMD="curl -IkL -w httpcode=%{http_code}"

    CURL_MAX_CONNECTION_TIMEOUT="-m 100"

    CURL_RETURN_CODE=0

    CURL_OUTPUT=$(${CURL_CMD} ${CURL_MAX_CONNECTION_TIMEOUT} ${SITEMAP} 2> /dev/null) || CURL_RETURN_CODE=$?

    if [ ${CURL_RETURN_CODE} -ne 0 ]; then

        echoR "Curl connection failed with return code - ${CURL_RETURN_CODE}, exit"

        exit 1

    else

        HTTPCODE=$(echo "${CURL_OUTPUT}" | grep 'HTTP'| tail -1 | awk '{print $2}')

        if [ "${HTTPCODE}" != '200' ]; then

            echoR "Curl operation/command failed due to server return code - ${HTTPCODE}, exit"

            exit 1

        fi

        echoG "SiteMap connection success \n"

    fi

}
not sure variable ${SITEMAP} accepted correctly by sh or whole construction
Bash:
CURL_OUTPUT=$(${CURL_CMD} ${CURL_MAX_CONNECTION_TIMEOUT} ${SITEMAP} 2> /dev/null) || CURL_RETURN_CODE=$?
 

serpent_driver

Well-Known Member
#9
This crawler script runs on hundreds of hosts, so why shouldn't it run now?! If there are any errors with this variable, the error would be different.
 

AndreyPopov

Well-Known Member
#10
This crawler script runs on hundreds of hosts, so why shouldn't it run now?! If there are any errors with this variable, the error would be different.
error is 404 - NOT FOUND!!!!!

what is not found?

and WHY in DOCS recommended bash?
and why inside script bash?

Bash:
        echow "0. bash M2-crawler.sh -h                 ## help"
        echow "1. bash M2-crawler.sh SITE-MAP-URL       ## When desktop and mobile share same theme"
        echow "2. bash M2-crawler.sh -m SITE-MAP-URL    ## When desktop & mobile have different theme"
        echow "3. bash M2-crawler.sh -g -m SITE-MAP-URL ## Use general user-agent when mobile view not working"
        echow "4. bash M2-crawler.sh -c SITE-MAP-URL    ## For brining cookies case"
        echow "5. bash M2-crawler.sh -b -c SITE-MAP-URL ## For brining cookies case and blacklist check"
 
#11
Yes it's not the script itself that is the problem, as on CentOS7 (before my recent upgrade) it worked perfectly fine (sh or bash). It was when I upgraded from CentOS 7 to AlmaLinux 8.7 a few days ago that I'm having issues with the script now. This is more environment related - but I have gone through everything and I havn't been able to figure it out. Hence why I posted here. When I did the upgrade it seemed to remove all PHP (7.x & 8.x), not sure why, but I had to re-add that which means all my extensions and modules got changed [unfortunately].

Please check the curl version and verify if the crawler function is enabled.
Thank you for posting, I much appreciate it.

I looked at the php info and I see it says 7.61.1 for the version - is that outdated or not the correct version? I will attach a screenshot, maybe something else with it is disabled and shouldn't be? And I should have mentioned in the OP, I do have the crawler enabled in the pre_main_global.conf file (as instructed HERE), just as it was before the upgrade.

almalinux-87-curl-version.jpg
 
Last edited:

serpent_driver

Well-Known Member
#12
The curl version is up-to-date and I would have been surprised if curl had been outdated on a modern OS.

Can you make sure the url to the sitemap.xml is correct? The error that is displayed clearly indicates that the sitemap.xml cannot be found under the specified URL.

Can you try enabling debugging in the crawler script?
 
#13
Can you make sure the url to the sitemap.xml is correct? The error that is displayed clearly indicates that the sitemap.xml cannot be found under the specified URL.
Yup as mentioned in the OP, I was thinking the same exact thing but I made sure to check the sitemap location by going to it directly in the browser and it loads fine - the URL is correct. Magento is also not blocking the file as it would 404 in the browser if it was (I made sure to add the rewrite exception to the .htaccess file - though again, this didn't change from CentOS 7).

Can you try enabling debugging in the crawler script?
I'm not sure how to do this. I looked in the .sh file and only found references to "debugurl" which I assume is related to crawling one URL at a time. In the beginning of the file is the "DEBUGURL=OFF" I changed that to "DEBUGURL=ON" but it made no difference. In Magento I turned "Enable Debug" on but nothing popped up in the log files (pertaining to the crawler - other lines showed relating to the caching).

Only thing I can think of is I must be missing something with the PHP modules/extensions (because as noted earlier that got messed with during the upgrade and I had to re-do that so maybe I didn't enable something I should have). Here are my enabled PHP extensions and modules:

Code:
config
config-runtime
mod_brotli
mod_buffer
mod_bwlimited
mod_cache
mod_cache_socache
mod_cgid
mod_deflate
mod_env
mod_expires
mod_fcgid
mod_headers
mod_lsapi
mod_mpm_worker
mod_proxy
mod_proxy_fcgi
mod_proxy_http
mod_proxy_wstunnel
mod_security2
mod_socache_redis
mod_ssl
mod_suexec
mod_suphp
mod_unique_id
mod_version
tools

Code:
build
libc-client
pear
php-bcmath
php-bz2
php-calendar
php-cli
php-common
php-curl
php-dba
php-devel
php-enchant
php-exif
php-fileinfo
php-fpm
php-ftp
php-gd
php-gettext
php-gmp
php-iconv
php-imap
php-intl
php-ldap
php-litespeed
php-mbstring
php-memcached
php-mysqlnd
php-odbc
php-opcache
php-pdo
php-pgsql
php-posix
php-process
php-pspell
php-snmp
php-soap
php-sockets
php-sodium
php-tidy
php-xml
php-zip
runtime

I much appreciate your time on this.
 

serpent_driver

Well-Known Member
#14
Please try this in console:

Code:
curl -IkL -w "https://my-domain.com/sitemap.xml"  // replace my-domain.com with your domain.
and this

Code:
curl -IkL -w -X GET "https://my-domain.com/sitemap.xml"  // replace my-domain.com with your domain.
and post the results
 
Last edited:

AndreyPopov

Well-Known Member
#15
in script only one place where SITEMAP variable read from input:


Bash:
            SITEMAP=${1}
            storexml ${SITEMAP}

storexml call validemap
validmap give error 404.

input parameter ${1} not stored to SITEMAP
 
Last edited:
#20
I cannot open this url in browser!!

it redirect to main page.
It is not my domain as @serpent_driver made note of. I was just using it as an example. My domain isn't "top secret", I just don't want bots finding it - I've posted my link before (not here but other places like Github) and got bots spamming my site.

Please try this in console:
Soo, I tried those commands and what a rabbit hole I went down. I got page code instead but it said the title of the page was like "404 - Page not found", like it loaded the Magento 404 page. I started thinking maybe it really is something up with the files (as I have a few .xml files I use for sitemap crawling - each with different links).

Well, I decided to test the links using my phone and what do you know, they 404'd...0_0

If you'd rather not read the long-drawn out explanation, it ended up being a dummy-me moment and I forgot that I had changed the DocRoot path to /pub (for Magento best practices) after setting up AlmaLinux, which means all files in the /public_html/ directory wouldn't be found. Once I remembered that and placed the files in to the /pub/ folder, they are now being found and crawl no problem with the crawler script :facepalm:. I knew it was something I did I just couldn't figure out what. In the end, I much appreciate the help and commands, I wouldn't have figured it out without it.

Here are the steps I went through (for anyone else that may run into this - as edge case as this may end up being):
  1. I use a modified host file on my computer to bypass Cloudflare (because of Magento taking longer than 1m30s to do many tasks like CSV imports so I hate when Cloudflare times out at the 1m30s mark)
  2. I never put a SSL cert on my server because I use Cloudflare's certs live - so no real need for a server cert. So oddly enough (no idea why) when I was bypassing Cloudflare the .xml files showed fine in the browser. But when I reverted my host file to browse through Cloudflare, bam the files also 404'd on my computer as well.
  3. So now I started thinking it must be Cloudflare doing some shenanigans and blocking them somehow. I looked through the WAF (firewall) and no "block" was showing.
  4. There were no blocks happening in the WHM ModSec either.
  5. I decided to run your commands in Windows terminal (instead of the command line in cPanel) and the terminal was throwing errors saying it wouldn't connect because the link wasn't secure (as I put my host file back to bypassing Cloudflare because Cloudflare was doing some weird stuff).
  6. So I decided to add a cert to the server. Once this was added, now I couldn't get the .xml files to show in the browser even when bypassing Cloudflare. ...like whaat? So maybe it's not Cloudflare either, maybe something with SSL and AlmaLinux and single files. ....grrr getting annoyed now.
  7. After some research I couldn't find anything. So I had a random thought, let's try moving all of Magento files into a folder but the sitemap.xml file and see what happens (because it would 404 to a Magento 404 page - not a server/LiteSpeed 404 page, so Magento was involved somehow it seemed).
  8. When trying to browse the main URL+.xml file it would do a LiteSpeed 404 page. It also still did a LiteSpeed 404 page just on the main URL (without the .xml file), so that means I had to turn on "Indexes" in Global Config in WHM to see a "listing" of files. When I did that, it said "creating /public_html/pub directory"
  9. Then I remembered - this was the key [and dummy me] moment I guess - I had made changes to the DocRoot path after setting up AlmaLinux. 0_0....like really? I completely forgot that I had done that. So, that means any files in the /public_html/ directory were not going to be found. They had to be under the /public_html/pub directory.
It just all stemmed from the fact that I was bypassing Cloudflare that for some odd reason the .xml files showed no problem in the browser - so I kept thinking they are there and being found. But when I checked the file on my phone and it 404'd, I started thinking. So when I added the server cert and I then couldn't find them in the browser now even bypassing Cloudflare is when things started changing. What the SSL had to do with it I have no clue. Very odd.

In the end it just ended up being the DocRoot path was changed :shakingmyhead:. Like...gahh what a crazy process trying to figure this out. But again, I much appreciate the help on this from both of you @serpent_driver and @AndreyPopov. I apologize for having wasted both of your's time - but glad to have received the help :) :thumbsup:
 
Top