This is an old revision of the document!


LSCWP Configuration Settings: Crawler

The crawler must be enabled at the server-level or the virtual host level by a site admin. Please see: Enabling the Crawler at the Server or Virtual Host Level

Learn more about crawling on our blog.

500

Set the Delay in microseconds to let LSCache know how often to send a new request to the server. You can increase this amount to lessen the load on the server, just be aware that will make the entire crawling process take longer.

This setting may be limited at the server level. Learn more about limiting the crawler's impact on the server.

200

This is how long the crawler runs before taking a break. The default of 200 has the crawler run for 200 seconds, then it temporarily stops. After the break is over, the crawler will start back up exactly where it left off and run for another 200 seconds. This will continue until the entire site has been crawled.

28800

This setting determines the length of the break mentioned above. By default, the crawler rests for 28800 seconds in between every 200-second run.

604800

This value determines how long to wait before re-initiating the entire crawling process. To keep your site regularly-crawled, determine how long the crawler usually takes to run, and set this value to slightly longer than that.

3

This is the number of separate crawling processes happening concurrently. The higher the number, the faster your site is crawled, but also the more load that is put on your server.

1

This setting is a way to keep the crawler from monopolizing system resources. Once it reaches this limit, the crawler will be terminated rather than allowing it to compromise server performance. This setting is based on linux server load. (A completely idle computer has a load average of 0. Each running process either using or waiting for CPU resources adds 1 to the load average.)

This setting may be limited at the server level. Learn more about limiting the crawler's impact on the server.

Empty string

As of v1.1.1, you can enter your Site’s IP address to simplify the crawling process and eliminate the overhead involved in DNS and Content Delivery Network (CDN) lookups. To understand why, let’s look at a few scenarios.

This is how it works if you’re using a CDN:

  1. The crawler gets http://yourserver.com/path from the sitemap
  2. The crawler checks with the DNS to find yourserver.com’s IP address
  3. The DNS returns the CDNs IP address to the crawler
  4. The crawler goes to the CDN to ask for the page
  5. The CDN grabs the page from yourserver.com
  6. The CDN returns the page to the crawler

This is how it works if you’re not using a CDN:

  1. The crawler gets http://yourserver.com/path from the sitemap
  2. The crawler checks with the DNS to find yourserver.com’s IP address
  3. The crawler grabs the page from yourserver.com

In both scenarios, there are lookups that occur, expending time and resources. These lookups can be eliminated by entering your site’s IP in this field.

When the crawler knows your IP, this is how it works:

  1. The crawler gets http://yourserver.com/path from the sitemap
  2. The crawler grabs the page directly from yourserver.combecause it already knows the IP address

The middlemen are eliminated, along with all of their overhead.

Empty list

By default, the crawler runs as a non-logged-in “guest” on your site. As such, the pages that are cached by the crawler are all for non-logged-in users. If you would like to also pre-cache logged-in views, you may do so here.

The crawler simulates a user account when it runs, so you need to specify user id numbers that correspond to the roles you'd like to cache. For example, to cache pages for users with the “Subscriber” role, choose one user that has the “Subscriber” role, and enter that user's ID in the box.

You may crawl multiple points-of-view by entering multiple user ids in the box, one per line.

NOTE: Only one crawler may run at a time, so if you have specified one or more user ids in the Role Simulation box, first the “Guest” crawler will run, and then the role-based crawlers will run, one after the other.

ON

This setting, when enabled, forces pages to be crawled using the HTTP/2 protocol.

Empty string

A sitemap tells the crawler which pages on your site should be crawled. By default, LSCache for WordPress generates its own sitemap. If, however, you already have a sitemap that you’d like to use, that is an option as of v1.1.1.

Enter the full URL to the sitemap in this field.

Note: the sitemap must be in Google XML Sitemap format.

Use these fields, if you don't already have a custom sitemap to use.

Include Posts / Include Pages / Include Categories / Include Tags

on

These four settings determine which taxonomies will be crawled. By default, all of them are.

Exclude Custom Post Types

Empty string

By default all custom taxonomies are crawled. If you have some that should not be crawled, list them in this field, one per line.

Date, descending

This field determines the order that the crawler will parse the sitemap. By default, priority is given to the newest content on your site. Set this value so that your most important content is crawled first, in the event the crawler is terminated before it completes the entire sitemap.

  • Admin
  • Last modified: 2018/03/15 19:30
  • by Lisa Clarke