Bruce R. Mitchell, BSEE, BSIE, BSCSci
Machine Intelligence
Late one evening -- or early one morning, depending on your viewpoint -- I was reading Apache logs, and feeling much aggravated over the need to do it. "The situation is becoming exponentially worse," I thought. "There used to be one site stripper a month, and now I have to deal with at least one a day. What's worse, the authors finally wised up and are now using browser ident strings, so the only way to tell it's a site stripper is looking at the logs after it's already too late. Like it or not, I'll have to start setting spider traps, and make sure they fall into them as quickly as possible to minimize the damage."
That raised an interesting question. Where should spider traps be placed so that the offending spider falls into one quickly? After thinking about it a while, I came up with these thoughts on the subject.
When setting spider traps, the primary object is of course to catch the spider. A secondary object nearly as important is to catch it as soon as possible, so that if it is going to do something hostile (strip the site, scan for email addresses, steal graphics, etc.) it is blocked as quickly as possible.
To beat the spiders at their own game, a little reverse-engineering is in order. It is necessary to know how a spider is going to approach the problem of extracting an entire web site without missing anything in the process. The selection of an efficient search algorithm is closely related to the structure being searched.
Thinking about web site structure, it is usual to consider that structure as a tree. The root of the tree is almost always http://this.web.site/index.html -- the one place that a spider can always expect to exist on any web site The site's actual internal structure and implementation (B-tree, ad hoc, HTML, SHTML, CGI or other) does not matter; what matters is that from an external viewpoint, it can be efficiently and completely treated as a tree.
This forces the spider author into a tree search algorithm. There are relatively few good ones, all discovered early in the history of Computer Science, and all treated at length in elementary data structures texts. Depending on how knowledgeable the spider author is, he may use any of the classic tree search algorithms, or reinvent the wheel and code his own from scratch. However, it is safe to assume that a tree search algorithm will be used, and it is probably also safe to assume that it will be one of the classic algorithms.
The classic tree search algorithms divide into two categories:
Consider a small web site with a structure as follows:
/index.html
|
/a/index.html /m/index.html /z/index.html
| |
/a/a.html /a/z.html /z/a.html /z/z.html
The root index.html file makes references to three other files, /a/index.html, /m/index.html and /z/index.html. The /a/index.html file in turn refers to /a/a.html and /a/z.html, and the /z/index.html file refers to /z/a.html and /z/z.html.
A depth-first search of this tree starts at the root /index.html. It finds all references to HTML entities in this file. Then it starts at either the first (/a/index.html) or last (/z/z.html) entity. Nothing says it must start at either, but first and last are typical. It finds all references to HTML entities in that file. The process iterates until it can go no deeper into the tree; then it goes back up one level, steps over one entity, and continues. This is ideal for recursive programming.
Let us say a spider is using depth-first search on this site, using the first entity at each level. It examines the pages in the order /index.html, /a/index.html, /a/a.html, /a/z.html, /m/index.html, /z/a.html, /z/z.html.
On the other hand, it could equally well use breadth-first search. This type of search also starts at the root, /index.html. It finds all references to HTML entities in this file. Then it starts at (again, typically) either the first or last entity, and scans the file for more HTML entities. Here the algorithms diverge.
Instead of using the references in this file to go deeper, it continues at the same level until all entities have been scanned. Then it goes down to the next level and scans all entities at the next level, etc. This is not quite so elegant to program, requiring more storage, but is perhaps simpler to comprehend for those spider authors of an uncomplicated mind-set, e.g. the "script kiddies."
Let us say a spideris using breadth-first search using the last entity at each level. It examines pages in the order /index.html, /z/index.html, /m/index.html, /a/index.html, /z/z/index.html, /z/a/index.html, /a/z/index.html, /a/a/index.html.
If we consider both these approaches from the point of setting spider traps that an abusive spider will hit quickly, we conclude trivially that there are five optimal places to set spider traps:
Setting spider traps in these five locations should produce good results trapping spiders that use classical search methods.
It should be noted, however, that this method is completely successful only for those spiders entering at the root of the web site. If a spider enters anywhere else, whether it sees a spider trap quickly depends on the search method the spider is using. If it uses breadth-first search, it may never see a spider trap in the subtree available from its starting point. On the other hand, if it uses depth-first search and the subtree includes the files from (4) or (5) above, it will quickly see and trip a spider trap.
Set spider traps in the locations described by (1) through (5) above to quickly catch spiders entering the site at the root. Spider traps should also be scattered randomly throughout the site to catch spiders entering at locations other than the root.
The author gratefully acknowledges the inspiration provided by Mr. Lee Killough and Mr. Neil Gunton for their documentation of spider trapping techniques using the Apache web server on their respective Web sites.
Copyright © 2002, Bruce R. Mitchell
All rights reserved