Charles J. Brabec has kindly given permission for me to host this. He asks that you not contact him on it, because he's too busy. I have updated the links and added some comments -- Lee
Back to How to Defeat Bad Web Robots With Apache
I was browsing the log files on our webserver the other day and I noticed a browser name I hadn't seen before: "EmailSiphon". As you might guess, that's the ID of a program that spammers use to harvest email addresses from your webpages. With a bit of research I found some of the most popular (based on availability on the net) email collectors on the net, and determined the browser ID that each uses.
| Harvester | Broswer ID String |
|---|---|
| ExtractorPro/WebWeasel | "Crescent Internet ToolPak HTTP OLE Control v.1.0" *1 or "ExtractorPro" |
| Harvester | "Crescent Internet ToolPak HTTP OLE Control v.1.0" *1 |
| Web Mole | "Crescent Internet ToolPak HTTP OLE Control v.1.0" *1 |
| Bull's Eye Gold | "Mozilla/2.0 (compatible; NEWT ActiveX; Win32)" *2 |
| Maverick II | "Mozilla/2.0 (compatible; NEWT ActiveX; Win32)" *2 |
| WebCollector | "Mozilla/2.0 (compatible; NEWT ActiveX; Win32)" *2 |
| Cherry Picker | "CherryPicker/1.0" or "CherryPickerSE/1.0" or "CherryPickerElite/1.0" |
| Dynamic Web Wizard | "Microsoft URL Control - 5.01.4511" *3 |
| Email Digger Pro | "Microsoft URL Control - 6.00.8140" *3 |
| Email Collector | "EmailCollector/1.0" |
| Email Wolf | "EmailWolf 1.00" |
| NICErsPRO | "NICErsPRO" |
| Advanced Email Extractor (www.mailutilities.com) | "Mozilla/4.0 (compatible; Advanced Email Extractor v1.3)" or any user-defined string *4 |
| Nitro | "Mozilla/3.Mozilla/2.01 (Win95; I)" and many others! *4 |
| Sonic Email Collector | "EmailSiphon" or others *4 |
| Telesoft (by softcell.net) | "Telesoft/1.29" |
| WebBandit | "WebBandit/2.1" or "WebBandit/3.50" or "webbandit/4.00.0" |
| WebmailExtractor | "WebEMailExtractor/1.0B" |
| Zeus Internet Marketing Robot (by www.cyber-robotics.com) | "Zeus 2500 Webster Pro V2.9 Win32" (the number 2500 will vary) this program is not designed for spamming per se, but it does collect email addresses |
| List Sorcerer | "Mozilla/4.0+(compatible;+MSIE+4.01;+Windows+95)" *5 |
| Webmole 2000 | "Mozilla/4.0 (compatible; MSIE 4.0; Windows NT)" *5 |
| WebSnake | "Mozilla/3.0 (Win95; I)" *5 |
| Atomic Harvester '98 | "" (does not set browser ID) *6 |
| Email Magnet | "" (does not set browser ID) *6 |
| eMailReaper | "" (does not set browser ID) *6 |
| Web Miner | "" (does not set browser ID) *6 |
| WebXtractor | "" (does not set browser ID) *6 |
- Bobby 3.0 and Bobby/3.1 - Bobby is a web-based tool that analyzes web pages for their accessibility to people with disabilities. Bobby is available from www.cast.org as either an online web-form, or a downloadable java application.
- CopyRightCheck 1.0 (and related CopyGuard) - Klaus Schallhorn, author of CopyRightCheck, wrote to me and explained that his programs are used to search for possible copyright infringements with pages owned by his clients. He suggests that you visit his service description page or his longer help page for more information. Both pages are written in German, however. (Now here: http://www.kso.co.uk/ -- Lee)
He also mentions that he runs a number of related spiders, with names like linkCheck, severify, mwatch, aa_bannercheck, and webaccel, all of which would be run from the kso.co.uk domain.
- Digimarc WebReader/1.x - From what I've seen this grabs only image files. Since Digimarc is in the digital signatures for images business, I suspect this is merely a program they have to check for signed graphics files on the web.
- DigOut4U - This seems to be a meta-search program designed to allow people to make "natural language" queries whiich are then reformatted and passed to the major search engines. The results are then ranked and the DigOut program proceeds to analyze the top result pages to make sure they are relevant to the user's query. In short, it's designed to be a search tool, not an address harvester. See http://www.arisem.com/en/index.html for more info.
- Harvest/1.4pl2 - This is the indexing program used by the Harvest search engine system, which is used to index web pages, not to gather email addresses.
- Mata Hari/1.10 - This is another meta-search program. It sends your query to a set of search engines, grabs the results, then downloads the pages for you and searches them locally to give you a better ranking for your search. In short, it does not collect email addresses. See this page for more info and the 30-day demo copy. (It is now called LexiBot -- Lee)
- Mozilla/3.0 (compatible; WebCapture 1.0; Auto; Windows) - This is the agent ID used by Adobe Acrobat 4.0. This version of Acrobat has a built in spider to allow you to capture a web page or site and convert it to a PDF document.
- SpaceBison/0.01 [fu] (Win67; X; ShonenKnife) - This is the default UserAgent string for a windows proxy program called The Proxomitron. It's not a harvester. It is a handy tool for people who want to filter webpages before they view them, to remove annoying things like advertising banners, javascript pop-ups, and so on. It can also be used to hide HTTP headers, including the UserAgent string. Thus, it is possible for this program to show up under any number of different ID's, just like the Nitro harvester mentioned above.
- WebReaper - Offline browser tool. Gathers pages for later viewing, not to collect addresses. See http://www.webreaper.net/. Mark Otway, author of WebReaper, has written to me to confirm that "WebReaper is not an email harvester (and never will be)." He also points out that his website contains his statement on spam, if you want to go see it for yourself.
- Wget/1.5.3- Wget is a gnu unix utility that fetches web pages, it leaves the user agent set to "Wget/1.5.3" , but can be set to anything. When used in a recursive or mirror mode, it fecthes and obeys robots.txt. See: http://www.gnu.org/software/wget/wget.html (But there are cheats to circumvent robots.txt -- Lee)
- Xenu's Link Sleuth 1.0p - A Link checking program for Windows. See http://home.snafu.de/tilman/xenulink.html.
- zzZ - I don't have program info for this one, but my logs show mostly HEAD requests (i.e. not grabbing the contents of the page) so it's not likely to be used to get addresses.
These are all browser ID's that I have not found the corresponding programs for. I have no idea whether they are used to harvest addresses or not.
- Mozilla/4.04 [en]C-bls40 (Win95; U) - Olivier Salaun points out that his log files show harvesting activity coming from a jax.bellsouth.net address using this browser ID. The ID itself appears to belong to the customized version of Netscape that is provided by BellSouth to its users, but Olivier points out that his logs show unusual activity for a person using a browser by hand (specifically, it requested 5 separate mailing-list pages from his site within 1 second... not many people click that fast!). We suspect somone might be spoofing this ID with their harvester.
- Microsoft Internet Explorer/5.0 - Robert Hood found this spider on his webserver. The Agent string is not the one that is actually used by Microsoft's program, but clearly is designed to make people think it is harmless. From the logs he sent me, it appears that this is a fairly well behaved spider: it checked /robots.txt and it always had an 8-10 second delay between hits on his server. All of the hits in this sighting came from the IP address 216.35.18.246, which belonged to exodus.net.
- Microsoft Internet Explorer/4.40.426 (Windows 95) - Bill Dimm found someone using this browser ID who was not well behaved. He writes: I've never seen it check robots.txt and it often requests many pages per second. It sent a few thousand page requests to our server in a few hours and it does send requests for things in cgi-bin (prohibited by our robots.txt).
- Scooter/2.0 G.R.A.B. V1.1.0 - This is AltaVista's search spider. It does not harvest addresses (as far as I know) but I have seen some very unpleasant behavior from it. On Tuesday, Oct 23, 2000, it asked for one of my site's robots.txt file. When it got a "404 Not Found" response, it asked again... for a total of 1461 tries in 25 seconds. I am still beating my head against the tech support at AV trying to find somebody with a clue behind the wall of level-1 support driods, but so far they have been unhelpful.
- CAST (this might have something to do with Bobby, above)
- Summit Site Validator
- TE
This link will take you to a list of over 170 known browser ID's.
Thanks to Tom Shaw, Sketchy Albedo, JOWazzoo, Joseph (Anti-Abuse Team), Chris Grossman, Joseph Bridgewater, Stefan-Michael Guenther, Jerry Brenner, Tim Pierce, Kerry Beier, George Theall, David Brierley, MJ Farmer, and Bill Dimm for tips on some of these harvesters.
If your server logs browser agent, you'll probably find that you get a few hits a week from each of these (well, at least our server seems to).
Now that we've identified the enemy, here's one way to take action against him. We run an Apache 1.3.x server that has the mod_rewrite module installed, which allows the server to transparently redirect requests based upon all kinds of nifty things, including incoming browser agent. I added the following bit of information to our config file:
RewriteEngine on
RewriteCond %{HTTP_USER_AGENT} ^EmailSiphon [OR]
RewriteCond %{HTTP_USER_AGENT} ^EmailWolf [OR]
RewriteCond %{HTTP_USER_AGENT} ^ExtractorPro [OR]
RewriteCond %{HTTP_USER_AGENT} ^Mozilla.*NEWT [OR]
RewriteCond %{HTTP_USER_AGENT} ^Crescent [OR]
RewriteCond %{HTTP_USER_AGENT} ^CherryPicker [OR]
RewriteCond %{HTTP_USER_AGENT} ^[Ww]eb[Bb]andit [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebEMailExtrac.* [OR]
RewriteCond %{HTTP_USER_AGENT} ^NICErsPRO [OR]
RewriteCond %{HTTP_USER_AGENT} ^Telesoft [OR]
RewriteCond %{HTTP_USER_AGENT} ^Zeus.*Webster [OR]
RewriteCond %{HTTP_USER_AGENT} ^Microsoft.URL [OR]
RewriteCond %{HTTP_USER_AGENT} ^Mozilla/3.Mozilla/2.01 [OR]
RewriteCond %{HTTP_USER_AGENT} ^EmailCollector
RewriteRule ^.*$ /badspammer.html [L]
What this basically says is, if the incoming request comes from a browser that matches one of the known types, rewrite his request to the /badspammer.html page, no matter what he asked for. If you just make sure the page you send them to has no email addresses on it, they will never get any addresses off of your site. Plus, if their program follows links, it won't add any links to its queue so it will lower their traffic to your webserver as well.
Now, if your goal is to annoy the spammers, not just keep them off
your site, you might look into a program like
WPoison. This is a
program that generates bogus email addresses and links for the
spammers' harvest engines to gather. However, I no longer recommend the
use of WPoison or other random-email address generators. It is true
that these addresses will pollute a spammers list with undeliverable
addresses. But, in practice these undeliverables are usually bounced
back to the bogus From address in the message. That just subjects the
real owner of that from address to more junk mail in response to a
message that they never sent.
Thanks to Bruce Marcotte for pointing this out to me.
*1 - Crescent Internet ToolPak is a library set for MS Visual Basic. It allows VB programmers to write codes that interact directly with network protocols like SMTP, FTP, HTTP, and others. It is not inherently a bad thing, but a fair number of spammer wannabes have written their harvesters using VB and this library. If you block this browser ID, you might potentially cause problems with legitimate programs that use Crescent. Use your own discretion.
*2 - Again, use your discretion when blocking the ID "Mozilla/2.0 (compatible; NEWT ActiveX; Win32)", it might disable other, legitimate programs.
From J. Peter Mugaas:
That header is generated by an .OCX that shipped with several development environments such as Borland Delphi. I just generated a false positive with a program I had written with that control in Delphi.
*3 - The ID "Microsoft URL Control - x.xx.xxxx" is part of the Microsoft MSINET.OCX that I believe is used by the WinInet API under Visual C++. Just because this one harvester is written with this library does not mean that all hits from this ID are coming from harvesters... other software developers MAY be using this API for more honest applications, like search engine robots.
If you are a member of the Microsoft Site Builder Network, you probably do NOT want to block this ID. Site Builder's search robot was built with this OCX and thus identifies itself with this browser ID. If you block the agent, Site Builder will not be able to properly index your website. Thanks to Christopher Ostmo for pointing this out to me.
*4 - Finally, (and sadly), the spam software authors got smart. The ID string for Nitro is set to "Mozilla/3.Mozilla/2.01 (Win95; I)" by default. BUT, Nitro has an option inside of it to set the browser ID to any string you want to use (and they provide a long list of known Browser ID's to choose from.) Obviously we cannot filter all of them, but at least you can filter the default value and try to catch the lusers who don't change it. Sonic and Advanced Email Extractor also allow the user to provide any Broswer ID that they choose.
*5 - These browser IDs are the same as ones used by versions of common browsers. You can block them if you like, but there's a good chance you'll block out legitimate browser users too.
*6 - As I noted above, a couple of the spam harvesting programs attempt to avoid detection by not setting their browser-id tag at all. One can easily write a rewrite rule to redirect unidentified browsers to whatever spam page you like, but that would cause undue confusion to other people whose browsers do not set their ids. I did an impromptu survey of our webserver here, and discovered that of the approximately 200,000 hits we received in one week, 90 of them were from unidentified browsers, or about five-hundredths of one percent (0.045 %). To my thinking, that's a small enough percentage that I can get away with banning all access to our server from unidentified browsers. Here's the rewrite that does this for us:
RewriteCond %{HTTP_USER_AGENT} ^-?$
RewriteRule ^.*$ /usr/local/etc/httpd/cgi-bin/noID.pl [L,T=application/x-httpd-cgi]
The noID.pl script on our site simply sets the status to "403 Forbidden" as part
of the headers and then prints an access denied type message informing the user
that we do not accept
unidentified browsers. The spam harvesters won't get any email addresses
from that page, and the rest of the world will get an explanation for
why they couldn't get into our site.
Last update: 03-Apr-2001 (Charles Brabrec)
Last update: (Lee Killough)