Charles J. Brabec has kindly given permission for me to host this. He asks that you not contact him on it, because he's too busy. I have updated the links and added some comments -- Lee

Back to How to Defeat Bad Web Robots With Apache


Protect Your Webserver
From Spam Harvesters


I was browsing the log files on our webserver the other day and I noticed a browser name I hadn't seen before: "EmailSiphon". As you might guess, that's the ID of a program that spammers use to harvest email addresses from your webpages. With a bit of research I found some of the most popular (based on availability on the net) email collectors on the net, and determined the browser ID that each uses.

Known Spam Harvesting Programs

HarvesterBroswer ID String
ExtractorPro/WebWeasel"Crescent Internet ToolPak HTTP OLE Control v.1.0" *1 or
"ExtractorPro"
Harvester"Crescent Internet ToolPak HTTP OLE Control v.1.0" *1
Web Mole"Crescent Internet ToolPak HTTP OLE Control v.1.0" *1
Bull's Eye Gold"Mozilla/2.0 (compatible; NEWT ActiveX; Win32)" *2
Maverick II"Mozilla/2.0 (compatible; NEWT ActiveX; Win32)" *2
WebCollector"Mozilla/2.0 (compatible; NEWT ActiveX; Win32)" *2
Cherry Picker"CherryPicker/1.0" or
"CherryPickerSE/1.0" or
"CherryPickerElite/1.0"
Dynamic Web Wizard"Microsoft URL Control - 5.01.4511" *3
Email Digger Pro"Microsoft URL Control - 6.00.8140" *3
Email Collector"EmailCollector/1.0"
Email Wolf"EmailWolf 1.00"
NICErsPRO"NICErsPRO"
Advanced Email Extractor
(www.mailutilities.com)
"Mozilla/4.0 (compatible; Advanced Email Extractor v1.3)"
or any user-defined string *4
Nitro"Mozilla/3.Mozilla/2.01 (Win95; I)"
and many others! *4
Sonic Email Collector"EmailSiphon" or
others *4
Telesoft (by softcell.net)"Telesoft/1.29"
WebBandit"WebBandit/2.1" or
"WebBandit/3.50" or
"webbandit/4.00.0"
WebmailExtractor"WebEMailExtractor/1.0B"
Zeus Internet Marketing Robot
(by www.cyber-robotics.com)
"Zeus 2500 Webster Pro V2.9 Win32"
(the number 2500 will vary)
this program is not designed for spamming per se,
but it does collect email addresses
List Sorcerer"Mozilla/4.0+(compatible;+MSIE+4.01;+Windows+95)" *5
Webmole 2000"Mozilla/4.0 (compatible; MSIE 4.0; Windows NT)" *5
WebSnake"Mozilla/3.0 (Win95; I)" *5
Atomic Harvester '98"" (does not set browser ID) *6
Email Magnet"" (does not set browser ID) *6
eMailReaper"" (does not set browser ID) *6
Web Miner"" (does not set browser ID) *6
WebXtractor"" (does not set browser ID) *6

These Look Suspicious but They Are Probably OK

Other Suspicious Browsers

These are all browser ID's that I have not found the corresponding programs for. I have no idea whether they are used to harvest addresses or not.
This link will take you to a list of over 170 known browser ID's.

Thanks to Tom Shaw, Sketchy Albedo, JOWazzoo, Joseph (Anti-Abuse Team), Chris Grossman, Joseph Bridgewater, Stefan-Michael Guenther, Jerry Brenner, Tim Pierce, Kerry Beier, George Theall, David Brierley, MJ Farmer, and Bill Dimm for tips on some of these harvesters.

If your server logs browser agent, you'll probably find that you get a few hits a week from each of these (well, at least our server seems to).

Now that we've identified the enemy, here's one way to take action against him. We run an Apache 1.3.x server that has the mod_rewrite module installed, which allows the server to transparently redirect requests based upon all kinds of nifty things, including incoming browser agent. I added the following bit of information to our config file:

What this basically says is, if the incoming request comes from a browser that matches one of the known types, rewrite his request to the /badspammer.html page, no matter what he asked for. If you just make sure the page you send them to has no email addresses on it, they will never get any addresses off of your site. Plus, if their program follows links, it won't add any links to its queue so it will lower their traffic to your webserver as well.

Now, if your goal is to annoy the spammers, not just keep them off your site, you might look into a program like WPoison. This is a program that generates bogus email addresses and links for the spammers' harvest engines to gather. However, I no longer recommend the use of WPoison or other random-email address generators. It is true that these addresses will pollute a spammers list with undeliverable addresses. But, in practice these undeliverables are usually bounced back to the bogus From address in the message. That just subjects the real owner of that from address to more junk mail in response to a message that they never sent.
Thanks to Bruce Marcotte for pointing this out to me.


Footnotes

*1 - Crescent Internet ToolPak is a library set for MS Visual Basic. It allows VB programmers to write codes that interact directly with network protocols like SMTP, FTP, HTTP, and others. It is not inherently a bad thing, but a fair number of spammer wannabes have written their harvesters using VB and this library. If you block this browser ID, you might potentially cause problems with legitimate programs that use Crescent. Use your own discretion.


*2 - Again, use your discretion when blocking the ID "Mozilla/2.0 (compatible; NEWT ActiveX; Win32)", it might disable other, legitimate programs.

From J. Peter Mugaas:

That header is generated by an .OCX that shipped with several development environments such as Borland Delphi. I just generated a false positive with a program I had written with that control in Delphi.


*3 - The ID "Microsoft URL Control - x.xx.xxxx" is part of the Microsoft MSINET.OCX that I believe is used by the WinInet API under Visual C++. Just because this one harvester is written with this library does not mean that all hits from this ID are coming from harvesters... other software developers MAY be using this API for more honest applications, like search engine robots.

If you are a member of the Microsoft Site Builder Network, you probably do NOT want to block this ID. Site Builder's search robot was built with this OCX and thus identifies itself with this browser ID. If you block the agent, Site Builder will not be able to properly index your website. Thanks to Christopher Ostmo for pointing this out to me.


*4 - Finally, (and sadly), the spam software authors got smart. The ID string for Nitro is set to "Mozilla/3.Mozilla/2.01 (Win95; I)" by default. BUT, Nitro has an option inside of it to set the browser ID to any string you want to use (and they provide a long list of known Browser ID's to choose from.) Obviously we cannot filter all of them, but at least you can filter the default value and try to catch the lusers who don't change it. Sonic and Advanced Email Extractor also allow the user to provide any Broswer ID that they choose.


*5 - These browser IDs are the same as ones used by versions of common browsers. You can block them if you like, but there's a good chance you'll block out legitimate browser users too.


*6 - As I noted above, a couple of the spam harvesting programs attempt to avoid detection by not setting their browser-id tag at all. One can easily write a rewrite rule to redirect unidentified browsers to whatever spam page you like, but that would cause undue confusion to other people whose browsers do not set their ids. I did an impromptu survey of our webserver here, and discovered that of the approximately 200,000 hits we received in one week, 90 of them were from unidentified browsers, or about five-hundredths of one percent (0.045 %). To my thinking, that's a small enough percentage that I can get away with banning all access to our server from unidentified browsers. Here's the rewrite that does this for us:

The noID.pl script on our site simply sets the status to "403 Forbidden" as part of the headers and then prints an access denied type message informing the user that we do not accept unidentified browsers. The spam harvesters won't get any email addresses from that page, and the rest of the world will get an explanation for why they couldn't get into our site.


Last update: 03-Apr-2001 (Charles Brabrec)

Last update: (Lee Killough)

Back to How to Defeat Bad Web Robots With Apache