Block Unwanted Spam Bots Using Varnish VCL

User ImageThaya Kareeson
Popularity: 30%
Updated: Jun 18, 2008


Every time I search the web for information on how to block spam bots, scrapers, and harvesters, I always see an Apache .htaccess file or some code to dump into httpd.conf to achieve this. I’m a bit against using this method for blocking evil bots. I do respect Apache for being a flexible & modular web server (that’s why I still use it), but I do not have much to boast about Apache’s speed and efficiency.

To achieve the same purpose on my server with greater efficiency, I made use of my Varnish reverse proxy configurations (located under /usr/local/etc/varnish/default.vcl).

In this post, I will only be discussing about vcl_recv subroutine, which gets called when a client request is received.

First of all, I have to create a white list. Here for example, I had to make use of the client.ip variable to allow livejournal’s URI::Fetch agent to get my site to lookup my OpenID servers. I also allowed scraper scripts from my own in-house server.

## Called when a client request is received
sub vcl_recv {
  # block list
  # 204.9.177.18 = livejournal.com
  if ( client.ip != "204.9.177.18" && client.ip != "192.168.2.102") {

I then added all the user agent regular expressions that I do not like.

    if (
      req.http.user-agent ~ "^$"
      || req.http.user-agent ~ "^Java"
      || req.http.user-agent ~ "^Jakarta"
      || req.http.user-agent ~ "IDBot"
      || req.http.user-agent ~ "id-search"
      || req.http.user-agent ~ "User-Agent"
      || req.http.user-agent ~ "compatible ;"
      || req.http.user-agent ~ "ConveraCrawler"
      || req.http.user-agent ~ "^Mozilla$"
      || req.http.user-agent ~ "libwww"
      || req.http.user-agent ~ "lwp-trivial"
      || req.http.user-agent ~ "curl"
      || req.http.user-agent ~ "PHP/"
      || req.http.user-agent ~ "urllib"
      || req.http.user-agent ~ "GT:WWW"
      || req.http.user-agent ~ "Snoopy"
      || req.http.user-agent ~ "MFC_Tear_Sample"
      || req.http.user-agent ~ "HTTP::Lite"
      || req.http.user-agent ~ "PHPCrawl"
      || req.http.user-agent ~ "URI::Fetch"
      || req.http.user-agent ~ "Zend_Http_Client"
      || req.http.user-agent ~ "http client"
      || req.http.user-agent ~ "PECL::HTTP"
      || req.http.user-agent ~ "panscient.com"
      || req.http.user-agent ~ "IBM EVV"
      || req.http.user-agent ~ "Bork-edition"
      || req.http.user-agent ~ "Fetch API Request"
      || req.http.user-agent ~ "PleaseCrawl"
      || req.http.user-agent ~ "[A-Z][a-z]{3,} [a-z]{4,} [a-z]{4,}"
      || req.http.user-agent ~ "layeredtech.com"
      || req.http.user-agent ~ "WEP Search"
      || req.http.user-agent ~ "Wells Search II"
      || req.http.user-agent ~ "Missigua Locator"
      || req.http.user-agent ~ "ISC Systems iRc Search 2.1"
      || req.http.user-agent ~ "Microsoft URL Control"
      || req.http.user-agent ~ "Indy Library"
      || req.http.user-agent == "8484 Boston Project v 1.0"
      || req.http.user-agent == "Atomic_Email_Hunter/4.0"
      || req.http.user-agent == "atSpider/1.0"
      || req.http.user-agent == "autoemailspider"
      || req.http.user-agent == "China Local Browse 2.6"
      || req.http.user-agent == "ContactBot/0.2"
      || req.http.user-agent == "ContentSmartz"
      || req.http.user-agent == "DataCha0s/2.0"
      || req.http.user-agent == "DataCha0s/2.0"
      || req.http.user-agent == "DBrowse 1.4b"
      || req.http.user-agent == "DBrowse 1.4d"
      || req.http.user-agent == "Demo Bot DOT 16b"
      || req.http.user-agent == "Demo Bot Z 16b"
      || req.http.user-agent == "DSurf15a 01"
      || req.http.user-agent == "DSurf15a 71"
      || req.http.user-agent == "DSurf15a 81"
      || req.http.user-agent == "DSurf15a VA"
      || req.http.user-agent == "EBrowse 1.4b"
      || req.http.user-agent == "Educate Search VxB"
      || req.http.user-agent == "EmailSiphon"
      || req.http.user-agent == "EmailWolf 1.00"
      || req.http.user-agent == "ESurf15a 15"
      || req.http.user-agent == "ExtractorPro"
      || req.http.user-agent == "Franklin Locator 1.8"
      || req.http.user-agent == "FSurf15a 01"
      || req.http.user-agent == "Full Web Bot 0416B"
      || req.http.user-agent == "Full Web Bot 0516B"
      || req.http.user-agent == "Full Web Bot 2816B"
      || req.http.user-agent == "Guestbook Auto Submitter"
      || req.http.user-agent == "Industry Program 1.0.x"
      || req.http.user-agent == "ISC Systems iRc Search 2.1"
      || req.http.user-agent == "IUPUI Research Bot v 1.9a"
      || req.http.user-agent == "LARBIN-EXPERIMENTAL (efp@gmx.net)"
      || req.http.user-agent == "LetsCrawl.com/1.0 +http://letscrawl.com/"
      || req.http.user-agent == "Lincoln State Web Browser"
      || req.http.user-agent == "LMQueueBot/0.2"
      || req.http.user-agent == "LWP::Simple/5.803"
      || req.http.user-agent == "Mac Finder 1.0.xx"
      || req.http.user-agent == "MFC Foundation Class Library 4.0"
      || req.http.user-agent == "Microsoft URL Control - 6.00.8xxx"
      || req.http.user-agent == "Missauga Locate 1.0.0"
      || req.http.user-agent == "Missigua Locator 1.9"
      || req.http.user-agent == "Missouri College Browse"
      || req.http.user-agent == "Mizzu Labs 2.2"
      || req.http.user-agent == "Mo College 1.9"
      || req.http.user-agent == "Mozilla/2.0 (compatible; NEWT ActiveX; Win32)"
      || req.http.user-agent == "Mozilla/3.0 (compatible; Indy Library)"
      || req.http.user-agent == "Mozilla/4.0 (compatible; Advanced Email Extractor v2.xx)"
      || req.http.user-agent == "Mozilla/4.0 (compatible; Iplexx Spider/1.0 http://www.iplexx.at)"
      || req.http.user-agent == "Mozilla/4.0 (compatible; MSIE 5.0; Windows NT; DigExt; DTS Agent"
      || req.http.user-agent == "Mozilla/4.0 efp@gmx.net"
      || req.http.user-agent == "Mozilla/5.0 (Version: xxxx Type:xx)"
      || req.http.user-agent == "MVAClient"
      || req.http.user-agent == "NameOfAgent (CMS Spider)"
      || req.http.user-agent == "NASA Search 1.0"
      || req.http.user-agent == "Nsauditor/1.x"
      || req.http.user-agent == "PBrowse 1.4b"
      || req.http.user-agent == "PEval 1.4b"
      || req.http.user-agent == "Poirot"
      || req.http.user-agent == "Port Huron Labs"
      || req.http.user-agent == "Production Bot 0116B"
      || req.http.user-agent == "Production Bot 2016B"
      || req.http.user-agent == "Production Bot DOT 3016B"
      || req.http.user-agent == "Program Shareware 1.0.2"
      || req.http.user-agent == "PSurf15a 11"
      || req.http.user-agent == "PSurf15a 51"
      || req.http.user-agent == "PSurf15a VA"
      || req.http.user-agent == "psycheclone"
      || req.http.user-agent == "RSurf15a 41"
      || req.http.user-agent == "RSurf15a 51"
      || req.http.user-agent == "RSurf15a 81"
      || req.http.user-agent == "searchbot admin@google.com"
      || req.http.user-agent == "ShablastBot 1.0"
      || req.http.user-agent == "snap.com beta crawler v0"
      || req.http.user-agent == "Snapbot/1.0"
      || req.http.user-agent == "sogou develop spider"
      || req.http.user-agent == "Sogou Orion spider/3.0(+http://www.sogou.com/docs/help/webmasters.htm#07)"
      || req.http.user-agent == "sogou spider"
      || req.http.user-agent == "Sogou web spider/3.0(+http://www.sogou.com/docs/help/webmasters.htm#07)"
      || req.http.user-agent == "sohu agent"
      || req.http.user-agent == "SSurf15a 11"
      || req.http.user-agent == "TSurf15a 11"
      || req.http.user-agent == "Under the Rainbow 2.2"
      || req.http.user-agent == "User-Agent: Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)"
      || req.http.user-agent == "VadixBot"
      || req.http.user-agent == "WebVulnCrawl.blogspot.com/1.0 libwww-perl/5.803"
      || req.http.user-agent == "Wells Search II"
      || req.http.user-agent == "WEP Search 00"
    ) {
      error 403 "You are banned from this site.  Please contact via a different client configuration if you believe that this is a mistake.";
    }
  }

As a bonus, I added the following to prevent hotlinking from people I don’t know. (code adapted from Turbocharged Blog)

  # if images, prevent hotlinking for some, permit for some
  if ( req.url ~ "wp-content/uploads/(images|200.)" || req.url ~ "wp-content/themes/[a-z-]/images") {
    if ( req.http.referer ~ "^http(s)?://" ) {
      if ( !(req.http.referer ~ "^http(s)?://([a-z-]+\.)?(omninoggin|postnerd|twidded|ciciscafe)\.com" )) {
        error 403 "Hotlinking is forbidden";
      }
    }
  }
  lookup;
}

Simple isn’t it? Now your Apache process will not have to deal with these spammers at all and save memory / CPU cycles for more important tasks like executing PHP.

Do you have a better bots list? Do you have a better method of doing this? Please share below!

Rate this:
3.5

2 Responses to “Block Unwanted Spam Bots Using Varnish VCL”

  1. no imageJTPratt's Blogging Mistakes (Check me out!) (4 comments)

    You always see .htaccess fixes because 99% of blog owners are on a shared web host…

    =)

    [Reply]

  2. no imageThaya Kareeson (Check me out!) (89 comments)

    That does make sense :). Doh! This post only targeted 1% of my audience.

    [Reply]


Leave a Reply