Text Size

Spider image copyright Gio Diaz. Used under a CC-SA license.

Recently I've seen some of our shared servers getting bogged down when web crawlers start processing some large sites. What these sites have in common is that they have sections that need longer database queries to compose a page. For example, one of these sites has a large database for a directory.

From the hosting perspective, a slow shared server is a distressing prospect. It's fine if one site with inefficient queries takes longer to load, but it's not fine when that load affects other sites on the same server. It's pretty common for hosting companies (at least those few who monitor this sort of thing) to react by booting the problematic site, usually recommending a dedicated or virtual server in the process. But this can be impractical or grossly unfair. Why should a web site that has accumulated a large database of information but that has low overall traffic be forced into a much more expensive hosting plan just because of the way web crawlers work?

The quick fix is to exclude the problematic pages from being crawled by disallowing them in robots.txt, but that's hardly a solution. After all, the point of putting all this information online is to have it found, and without being indexed for search, it's never going to be found!

Fortunately there are a few ways to solve this:

Improve Performance

Sometimes sites are slow because they haven't been updated. They use out of date extensions that perform more poorly than the most up to date versions. Sometimes extensions have performance tools (options like "rebuild index" or "recalculate statistics" often improve performance). Sometimes the code is just slow or the database is poorly built. Try these things first and if they fail contact the developer of an extension that's causing problems to see if they have a solution.

But that's not all you can do. If you can't fix the problem, you can still work on the symptoms...

Use a Site Map

Site maps contain information on how frequently a page should be crawled for changes, plus a site map makes it easy for a crawler to discover new pages simply by comparing the current map with the version available the last time the bot visited. Combined, this means that bots need to do a lot less work to keep their index up to date, and less bot work means less server load.

There are a number of site map extensions for Joomla, my favourite is Xmap. Once you have a site map, most people go to Google Webmaster Tools and tell Google where to find the site map. That's fine for Google, but our experience is that Googlebot is already pretty good at levelling server load, it's the other crawlers that cause real trouble. So what about them?

The robots.txt file has a number of nonstandard extensions that can come in handy here. The first is the Sitemap directive. The directive is dead simple:

Sitemap: path-to-your-sitemap

The sitemap directive is supported by Bing and Yahoo, and presumably by many other crawlers. It's a good thing to do for any site that cares about Search Engine Optimization.

Set a Crawl Delay

So what happens if you get your site map set up and bots are still causing excessive load? There's a more brute force approach, which is another robots exclusion extension:

Crawl-Delay: time-in-seconds

The Crawl Delay directive instructs a bot to wait a certain number of seconds between web requests. If your site takes 10 seconds to generate a slow page, setting the crawl delay to 15 seconds should ensure that the load on the server is high, but manageable.

Still, a crawler has to "play nice" and respect the robots file. What if it doesn't?

Block the Bot

At this point, all major web crawlers should be behaving and not overloading your server. But there are some bots that won't. When these bots crawl your site, you wind up with the same problem, just less frequently. What to do?

Consider the merits of being crawled by these bots. At this point you have all the major search engines under control, so the next question is what's the purpose of the problematic bot? By checking the User Agent string in your web logs, it's usually possible to determine which bot is causing the problem. The next thing to do is verify that the agent string is not a forgery -- some hacker scripts pretend to be popular bots in order to mask their attacks. Given that the bot is legitimate, then you can check the crawler's web site to see why it's on your site. If the bot serves no useful purpose for you, for example it's just trying to evaluate back-links for a SEO service, then you can write a .htaccess rule to block it completely.

Conclusions

If your site has performance problems but low overall traffic, don't let your web host force you into a much more expensive solution just because of web crawlers. Use these steps to first fix the performance issues, and if that fails, reduce server load from web crawlers. Stay on a low cost plan and keep creating great content!