
Every time I search the web for information on how to block spam bots, scrapers, and harvesters, I always see an Apache .htaccess file or some code to dump into httpd.conf to achieve this. I’m a bit against using this method for blocking evil bots. I do respect Apache for being a flexible & modular web server (that’s why I still use it), but I do not have much to boast about Apache’s speed and efficiency.
To achieve the same purpose on my server with greater efficiency, I made use of my Varnish reverse proxy configurations (located under /usr/local/etc/varnish/default.vcl).
In this post, I will only be discussing about vcl_recv subroutine, which gets called when a client request is received.
First of all, I have to create a white list. Here for example, I had to make use of the client.ip variable to allow livejournal’s URI::Fetch agent to get my site to lookup my OpenID servers. I also allowed scraper scripts from my own in-house server.
## Called when a client request is received sub vcl_recv { # block list # 204.9.177.18 = livejournal.com if ( client.ip != "204.9.177.18" && client.ip != "192.168.2.102") {
I then added all the user agent regular expressions that I do not like.
if ( req.http.user-agent ~ "^$" || req.http.user-agent ~ "^Java" || req.http.user-agent ~ "^Jakarta" || req.http.user-agent ~ "IDBot" || req.http.user-agent ~ "id-search" || req.http.user-agent ~ "User-Agent" || req.http.user-agent ~ "compatible ;" || req.http.user-agent ~ "ConveraCrawler" || req.http.user-agent ~ "^Mozilla$" || req.http.user-agent ~ "libwww" || req.http.user-agent ~ "lwp-trivial" || req.http.user-agent ~ "curl" || req.http.user-agent ~ "PHP/" || req.http.user-agent ~ "urllib" || req.http.user-agent ~ "GT:WWW" || req.http.user-agent ~ "Snoopy" || req.http.user-agent ~ "MFC_Tear_Sample" || req.http.user-agent ~ "HTTP::Lite" || req.http.user-agent ~ "PHPCrawl" || req.http.user-agent ~ "URI::Fetch" || req.http.user-agent ~ "Zend_Http_Client" || req.http.user-agent ~ "http client" || req.http.user-agent ~ "PECL::HTTP" || req.http.user-agent ~ "panscient.com" || req.http.user-agent ~ "IBM EVV" || req.http.user-agent ~ "Bork-edition" || req.http.user-agent ~ "Fetch API Request" || req.http.user-agent ~ "PleaseCrawl" || req.http.user-agent ~ "[A-Z][a-z]{3,} [a-z]{4,} [a-z]{4,}" || req.http.user-agent ~ "layeredtech.com" || req.http.user-agent ~ "WEP Search" || req.http.user-agent ~ "Wells Search II" || req.http.user-agent ~ "Missigua Locator" || req.http.user-agent ~ "ISC Systems iRc Search 2.1" || req.http.user-agent ~ "Microsoft URL Control" || req.http.user-agent ~ "Indy Library" || req.http.user-agent == "8484 Boston Project v 1.0" || req.http.user-agent == "Atomic_Email_Hunter/4.0" || req.http.user-agent == "atSpider/1.0" || req.http.user-agent == "autoemailspider" || req.http.user-agent == "China Local Browse 2.6" || req.http.user-agent == "ContactBot/0.2" || req.http.user-agent == "ContentSmartz" || req.http.user-agent == "DataCha0s/2.0" || req.http.user-agent == "DataCha0s/2.0" || req.http.user-agent == "DBrowse 1.4b" || req.http.user-agent == "DBrowse 1.4d" || req.http.user-agent == "Demo Bot DOT 16b" || req.http.user-agent == "Demo Bot Z 16b" || req.http.user-agent == "DSurf15a 01" || req.http.user-agent == "DSurf15a 71" || req.http.user-agent == "DSurf15a 81" || req.http.user-agent == "DSurf15a VA" || req.http.user-agent == "EBrowse 1.4b" || req.http.user-agent == "Educate Search VxB" || req.http.user-agent == "EmailSiphon" || req.http.user-agent == "EmailWolf 1.00" || req.http.user-agent == "ESurf15a 15" || req.http.user-agent == "ExtractorPro" || req.http.user-agent == "Franklin Locator 1.8" || req.http.user-agent == "FSurf15a 01" || req.http.user-agent == "Full Web Bot 0416B" || req.http.user-agent == "Full Web Bot 0516B" || req.http.user-agent == "Full Web Bot 2816B" || req.http.user-agent == "Guestbook Auto Submitter" || req.http.user-agent == "Industry Program 1.0.x" || req.http.user-agent == "ISC Systems iRc Search 2.1" || req.http.user-agent == "IUPUI Research Bot v 1.9a" || req.http.user-agent == "LARBIN-EXPERIMENTAL (efp@gmx.net)" || req.http.user-agent == "LetsCrawl.com/1.0 +http://letscrawl.com/" || req.http.user-agent == "Lincoln State Web Browser" || req.http.user-agent == "LMQueueBot/0.2" || req.http.user-agent == "LWP::Simple/5.803" || req.http.user-agent == "Mac Finder 1.0.xx" || req.http.user-agent == "MFC Foundation Class Library 4.0" || req.http.user-agent == "Microsoft URL Control - 6.00.8xxx" || req.http.user-agent == "Missauga Locate 1.0.0" || req.http.user-agent == "Missigua Locator 1.9" || req.http.user-agent == "Missouri College Browse" || req.http.user-agent == "Mizzu Labs 2.2" || req.http.user-agent == "Mo College 1.9" || req.http.user-agent == "Mozilla/2.0 (compatible; NEWT ActiveX; Win32)" || req.http.user-agent == "Mozilla/3.0 (compatible; Indy Library)" || req.http.user-agent == "Mozilla/4.0 (compatible; Advanced Email Extractor v2.xx)" || req.http.user-agent == "Mozilla/4.0 (compatible; Iplexx Spider/1.0 http://www.iplexx.at)" || req.http.user-agent == "Mozilla/4.0 (compatible; MSIE 5.0; Windows NT; DigExt; DTS Agent" || req.http.user-agent == "Mozilla/4.0 efp@gmx.net" || req.http.user-agent == "Mozilla/5.0 (Version: xxxx Type:xx)" || req.http.user-agent == "MVAClient" || req.http.user-agent == "NameOfAgent (CMS Spider)" || req.http.user-agent == "NASA Search 1.0" || req.http.user-agent == "Nsauditor/1.x" || req.http.user-agent == "PBrowse 1.4b" || req.http.user-agent == "PEval 1.4b" || req.http.user-agent == "Poirot" || req.http.user-agent == "Port Huron Labs" || req.http.user-agent == "Production Bot 0116B" || req.http.user-agent == "Production Bot 2016B" || req.http.user-agent == "Production Bot DOT 3016B" || req.http.user-agent == "Program Shareware 1.0.2" || req.http.user-agent == "PSurf15a 11" || req.http.user-agent == "PSurf15a 51" || req.http.user-agent == "PSurf15a VA" || req.http.user-agent == "psycheclone" || req.http.user-agent == "RSurf15a 41" || req.http.user-agent == "RSurf15a 51" || req.http.user-agent == "RSurf15a 81" || req.http.user-agent == "searchbot admin@google.com" || req.http.user-agent == "ShablastBot 1.0" || req.http.user-agent == "snap.com beta crawler v0" || req.http.user-agent == "Snapbot/1.0" || req.http.user-agent == "sogou develop spider" || req.http.user-agent == "Sogou Orion spider/3.0(+http://www.sogou.com/docs/help/webmasters.htm#07)" || req.http.user-agent == "sogou spider" || req.http.user-agent == "Sogou web spider/3.0(+http://www.sogou.com/docs/help/webmasters.htm#07)" || req.http.user-agent == "sohu agent" || req.http.user-agent == "SSurf15a 11" || req.http.user-agent == "TSurf15a 11" || req.http.user-agent == "Under the Rainbow 2.2" || req.http.user-agent == "User-Agent: Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)" || req.http.user-agent == "VadixBot" || req.http.user-agent == "WebVulnCrawl.blogspot.com/1.0 libwww-perl/5.803" || req.http.user-agent == "Wells Search II" || req.http.user-agent == "WEP Search 00" ) { error 403 "You are banned from this site. Please contact via a different client configuration if you believe that this is a mistake."; } }
As a bonus, I added the following to prevent hotlinking from people I don’t know. (code adapted from Turbocharged Blog)
# if images, prevent hotlinking for some, permit for some if ( req.url ~ "wp-content/uploads/(images|200.)" || req.url ~ "wp-content/themes/[a-z-]/images") { if ( req.http.referer ~ "^http(s)?://" ) { if ( !(req.http.referer ~ "^http(s)?://([a-z-]+\.)?(omninoggin|postnerd|twidded|ciciscafe)\.com" )) { error 403 "Hotlinking is forbidden"; } } } lookup; }
Simple isn’t it? Now your Apache process will not have to deal with these spammers at all and save memory / CPU cycles for more important tasks like executing PHP.
Do you have a better bots list? Do you have a better method of doing this? Please share below!
This website uses IntenseDebate comments, but they are not currently loaded because either your browser doesn't support JavaScript, or they didn't load fast enough.