A slap on the collected wrist for the search engine(s) Baidu and Yandex search engines for ignoring Robots Text file rules!
We noticed on a couple of client sites that bandwidth was being sucked and vaporized into a no-return-on-investment void. After some statistics / analytics investigation we noticed that these two search engines bots were hammering the server quite hard. However they were completely ignoring the robots text file.
For a fix to this I got in contact with Paul Arlott from Tolra Systems who always loves a good challenge and we implemented the following into the top of the HTACCESS file that so far seems to have blocked the Baidu, and Yandex bots from spidering the website.
SetEnvIfNoCase User-agent “Baidu” spammer=yes
SetEnvIfNoCase User-agent “Yandex” spammer=yes
SetEnvIfNoCase User-agent “Sosospider” spammer=yes
<Limit GET PUT POST>
deny from env=spammer
So far so good, and of course any determined app will always try and find a way through. Let us know how you have blocked unwanted spiders from your website?
In contrast to the above, please do not block these search spiders if you are doing business in Russia, and/or China. The Yandex search engine is by far the most popular search engine in Russia with approximately 61% of the market share according to LiveInternet.ru stats. If you want your website indexed by Yandex then don’t block it.
We have updated the file to include the Sosospider which we found repeatedly ‘hitting a server’.