You may remember my previous blog, which covered the subject of customizing your web site based on the location of the person visiting it. Itching for some more tech info for your website? Don’t worry; I’ve got you covered.
Let’s talk Spambots. The internet is full of these automated scripts that crawl around the web for any number of reasons. Some, like the crawlers from Google and Bing, are there to read content, store relevant data and analyze it for search results. Those are the friendly bots that you want to come visit.
However, there are a lot of other crawlers that you really don’t want on your site. These evil bots may come to find email addresses to put on their spam lists, steal content or look for weaknesses in your code that allows them to cause damage.
The “good” crawlers can be taught. You can instruct these bots how to correctly crawl a site, telling it where it can go and where it can’t–and they actually listen! These instructions can be given in a file named “robots.txt”, or via meta tags at page level. Need details about instructing bots? Check out http://www.robotstxt.org/.
The bad crawlers, however, do not listen to instructions, and controlling them requires a much more heavy-handed approach.
Think that you may be bugged down by some evil bots? The first thing you need to do is identify these troublesome crawlers. This can be done by finding the name of the crawler, as all things crawling around out there have something called a “User Agent”. This is the name or identity of the crawler, for example, Google’s is called “GoogleBot”. Most web analytic programs have a way to aggregate a list of the user agents that are visiting your web site. When you acquire this list, you can search the web for info on each user agent or visit a resource such as http://www.user-agents.org/ to learn about them.
Here are some I have used with varying level of effectiveness:
Perishable Press Blacklist:
A great blacklisting script that helps filter out not only known user agents, but also IP addresses and known malicious requests intended to find weaknesses in your server.
Stop Forum Spam:
A community driven project that holds a database of known spammers on web sites that allow user generated content.
A system that helps you control user-generated content providers and contact forums; it analyzes the content and accepts or rejects it based on signals that would indicate spam.
I hope my experience with these bots, both good and evil, will help you determine how to best handle your own crawlers!
What other sites have helped you manage spambots?