Toomre Capital Markets LLC

Real-Time Capital Markets -- Analytics, Visualization, Event Processing, and Intelligence

Dealing with Aggressive Spiders and Bots on Drupal Websites

IP address maps to a spider computer hosted on the Yandex Enterprise Network. About every minute and a half or so, a spider process on that computer (still) attempts to retrieve yet another piece of content from the Toomre Capital Markets ("TCM") website. Many of the pages this spider requests either do not exist or are part of the no-follow rule section in the robots.txt file. This spider certainly is aggressive and ignores the rules that many other bots seem to respect.

A few months ago, after watching this particular malformed spider consume more five percent of the total hosting bandwidth used that month, we had had enough. Hence, some modifications were made to a custom Drupal module running on the TCM website concerning visitor information (including various bots) and what specific information was being sought. Now as a result, when this Yandex spider come looking for a page like "search/node/facebook", it somehow ends up redirected to a page from a third-party website.

One would think that the person(s) controlling the spider would get the message after some fifty thousand plus attempts to get information from the TCM website. Somehow a human user might wonder why attempts to retrieve information on structured finance products, risk management and/or MATLAB topics always results in a page full of "gay anal porn" or other similar material. Until then, the TCM website might well become a frequent referrer to certain pornographic websites.

Of course, there are other ways of dealing with such obnoxious bot visitors. One frequently used is to add a specific IP address to the file that controls the Apache server that is used to host many Drupal based websites. However, the question then becomes is that really a solution when one is bombarded by more than five hundred zombie'/bot visitors that appear to be probing for various website server vulnerabilities?

Fortunately, now once an IP address is identified as a malicious user or as a part of bot network, that visitor can be dealt with appropriately, which often includes terminating the dynamic generation of the requested web page. The bandwidth use has been significantly reduced and the website has become more responsive to true users. If you would like more information on how TCM can help solve a similar problem with your Drupal website, please contact us.