Crawler¶

The crawler travels through your site, refreshing pages that have expired in the cache. This makes it less likely that your visitors will encounter uncached pages.

The crawler must be enabled at the server-level or the virtual host level by a site admin. Please see: Enabling the Crawler at the Server or Virtual Host Level

Learn more about crawling on our blog.

Summary Tab¶

Crawler Cron¶

See the progress of the various crawlers enabled for your site. You can monitor the progress of each crawler via the color-coded rectangles in the Status column.

Note

Crawlers cannot run concurrently. If you have multiple crawlers defined, they will run one at a time.

Use the Reset Position button to start the crawler at the beginning again.

Use the Manually run button to start the crawler without waiting for the cron job.

Status Key¶

Gray, Waiting to be Crawled: the page is in the queue to be crawled
Green, Cache Hit: the page is already cached, so the crawler skipped it
Blue, Cache Miss: the page was not already cached, so the crawler has cached it
Red, Blocklisted: the page cannot be crawled (See the Blocklist section for more information.)

Watch Crawler Status¶

If you've opted to watch the crawler status, your screen will look something like the image below. The messages in the status window will vary from site to site.

Here is an explanation of some of the terms:

Size: The number of URLs in the sitemap. This example has 181.
Crawler: Indicates which crawler number you are watching. It's number 1 in this example. There could be multiple crawlers working, depending on your settings.
Position: The URL number currently being fetched from the sitemap list.
Threads: Indicates the number of threads currently being used to fetch URLs. There may be multiple threads fetching. It is smart and will adjust based on your load settings.
Status: Indicates the current crawler status. In this example, Stopped due to reset meta position means that the site purged or the sitemap changed while it was crawling, and as such, the crawler will restart from the top.

Map Tab¶

Sitemap List¶

This page displays the URIs currently in the crawler map. If you don't see any listed, try pressing the Refresh Crawler Map button.

You can search for a particular URL, which is helpful if your sitemap is large.

You can filter your URL list to only display those with a particular cache status (Cache Hit or Cache Miss) or those that have been added to the blocklist (Blocklisted). Click the corresponding link to do so.

You can manually add URIs to the blocklist via the button next to each entry.

The Crawler Status column uses colored dots to give you the status of each URI. See below the table for the key.

To start from scratch with the crawler map, press the Clean Crawler Map button.

Blocklist Tab¶

Blocklist¶

This page displays the URIs currently in the blocklist.

From here you can manually remove URIs from the blocklist via the button next to each entry.

The Status column uses colored dots to give you the status of each URI. See below the table for the key.

To start from scratch and clear out the blocklist, press the Empty Blocklist button.

General Settings Tab¶

Crawler¶

OFF

Set the to ON to enable crawling for this site.

Delay¶

500

The crawler sends requests to the backend, one page after another, as it traverses your site. This can put a heavy load on your server if there is no pause between requests. Set the Delay (in microseconds) to let LSCache know how often to send a new request to the server. The default is 500 microseconds (or 0.0005 seconds). You can increase this amount to lessen the load on the server, just be aware that will make the entire crawling process take longer.

This setting may be limited at the server level. Learn more about limiting the crawler's impact on the server.

Run Duration¶

400

In order to keep your server from getting bogged-down with behind-the-scenes crawling, you can put limits on the crawling duration. For example, if you leave Run Duration set to the default 400 seconds, then the crawler will run for nearly seven minutes before taking a break. After the break is over, the crawler will start back up exactly where it left off and run for another 400 seconds. This will continue until the entire site has been crawled.

Interval Between Runs¶

600

This setting determines the length of the break mentioned above. By default, the crawler rests for 600 seconds in between every 400-second run.

Crawl Interval¶

302400

How often do you want to re-initiate the crawling process? This depends on how long it takes to crawl your site. The best way to figure this out is to run the crawler a couple of times and keep track of the elapsed time. Once you’ve got that amount, set Crawl Interval to slightly more than that. For example, if your crawler routinely takes 4 hours to complete a run, you could set the interval to 5 hours (or 18000 seconds).

Threads¶

3

When Threads is set to the default of 3, then there are three separate crawling processes happening concurrently. The higher the number, the faster your site is crawled, but also the more load that is put on your server.

Timeout¶

30

The crawler has this many seconds to crawl a page before moving on to the next page. Value can range from 10 to 300 seconds.

Server IP¶

This option has moved to the General section as of v3.3

Server Load Limit¶

1

This setting is a way to keep the crawler from monopolizing system resources. Once the server's load average reaches this limit, the crawler will be terminated, rather than allowing it to compromise server performance.

Example

To limit the crawler so that it stops running when half of your server resources are being consumed, set Server Load Limit to 0.5 for a one-core server, 1 for a two-core server, 2 for a four-core server, and so on.

Note

This setting may be overridden at the server level. Learn more about limiting the crawler's impact on the server.

Simulation Settings Tab¶

Role Simulation¶

Empty list

By default, the crawler runs as a non-logged-in "guest" on your site. As such, the pages that are cached by the crawler are all for non-logged-in users. If you would like to also pre-cache logged-in views, you may do so here.

The crawler simulates a user account when it runs, so you need to specify user id numbers that correspond to the roles you'd like to cache.

Example

To cache pages for users with the "Subscriber" role, choose one user that has the "Subscriber" role, and enter that user's ID in the box.

You may crawl multiple points-of-view by entering multiple user ids in the box, one per line.

Note

Only one crawler may run at a time, so if you have specified one or more user ids in the Role Simulation box, first the "Guest" crawler will run, and then the role-based crawlers will run, one after the other.

To crawl for a particular cookie, enter the cookie name, and the values you wish to crawl for. Values should be one per line, and can include a blank line. There will be one crawler created per cookie value, per simulated role. Press the + button to add additional cookies, but be aware the number of crawlers grows quickly with each new cookie, and can be a drain on system resources.

Example

If you crawl for Guest and Administrator roles, and you add testcookie1 with the values A and B, you have 4 crawlers:

Guest, testcookie1=A
Guest, testcookie1=B
Administrator, testcookie1=A
Administrator, testcookie1=B

Add testcookie2 with the values C, D, and and you suddenly have 12 crawlers.

Guest, testcookie1=A, testcookie2=C
Guest, testcookie1=B, testcookie2=C
Administrator, testcookie1=A, testcookie2=C
Administrator, testcookie1=B, testcookie2=C
Guest, testcookie1=A, testcookie2=D
Guest, testcookie1=B, testcookie2=D
Administrator, testcookie1=A, testcookie2=D
Administrator, testcookie1=B, testcookie2=D
Guest, testcookie1=A, testcookie2=
Guest, testcookie1=B, testcookie2=
Administrator, testcookie1=A, testcookie2=
Administrator, testcookie1=B, testcookie2=

There aren't many situations where you would need to simulate a cookie crawler, but it can be useful for sites that use a cookie to control multiple languages or currencies.

Example

WPML uses the _icl_current_language= cookie to differentiate between visitor languages. An English speaker's cookie would look like _icl_current_language=EN, while a Thai speaker's cookie would look like _icl_current_language=TH. To crawl your site for a particular language, use a Guest user, and the appropriate cookie value.

Sitemap Settings Tab¶

Custom SiteMap¶

Empty string

A sitemap tells the crawler which pages on your site should be crawled. You can use a third party XML sitemap generation tool, and enter the sitemap's full URL in this field.

Multiple sitemaps are allowed, one per line.

Note

The sitemap must be in Google XML Sitemap format.

Drop Domain from Sitemap¶

ON

The crawler will parse the sitemap and save it into the database before crawling. When parsing the sitemap, dropping the domain can save DB storage.

Warning

If you are using multiple domains for one site, and you have multiple domains in the sitemap, please keep this option OFF. Otherwise, the crawler will only crawl one of the domains.

Sitemap Generation¶

As of v3.6, sitemap generation is no longer supported in LSCWP. Please use a third party sitemap generation tool, and enter the URL in the Custom SiteMap field above.

Last update: April 23, 2024

Crawler¶

Summary Tab¶

Crawler Cron¶

Status Key¶

Watch Crawler Status¶

Map Tab¶

Sitemap List¶

Blocklist Tab¶

Blocklist¶

General Settings Tab¶

Crawler¶

Delay¶

Run Duration¶

Interval Between Runs¶

Crawl Interval¶

Threads¶

Timeout¶

Server IP¶

Server Load Limit¶

Simulation Settings Tab¶

Role Simulation¶

Cookie Simulation¶

Sitemap Settings Tab¶

Custom SiteMap¶

Drop Domain from Sitemap¶

Sitemap Generation¶