Subscribe via

Block Unwanted Spam Bots Using Varnish VCL

Thaya Kareeson


Every time I search the web for information on how to block spam bots, scrapers, and harvesters, I always see an Apache .htaccess file or some code to dump into httpd.conf to achieve this. I’m a bit against using this method for blocking evil bots. I do respect Apache for being a flexible & modular web server (that’s why I still use it), but I do not have much to boast about Apache’s speed and efficiency.

To achieve the same purpose on my server with greater efficiency, I made use of my Varnish reverse proxy configurations (located under /usr/local/etc/varnish/default.vcl).

In this post, I will only be discussing about vcl_recv subroutine, which gets called when a client request is received.

First of all, I have to create a white list. Here for example, I had to make use of the client.ip variable to allow livejournal’s URI::Fetch agent to get my site to lookup my OpenID servers. I also allowed scraper scripts from my own in-house server.

## Called when a client request is received
sub vcl_recv {
  # block list
  # 204.9.177.18 = livejournal.com
  if ( client.ip != "204.9.177.18" && client.ip != "192.168.2.102") {

I then added all the user agent regular expressions that I do not like.

    if (
      req.http.user-agent ~ "^$"
      || req.http.user-agent ~ "^Java"
      || req.http.user-agent ~ "^Jakarta"
      || req.http.user-agent ~ "IDBot"
      || req.http.user-agent ~ "id-search"
      || req.http.user-agent ~ "User-Agent"
      || req.http.user-agent ~ "compatible ;"
      || req.http.user-agent ~ "ConveraCrawler"
      || req.http.user-agent ~ "^Mozilla$"
      || req.http.user-agent ~ "libwww"
      || req.http.user-agent ~ "lwp-trivial"
      || req.http.user-agent ~ "curl"
      || req.http.user-agent ~ "PHP/"
      || req.http.user-agent ~ "urllib"
      || req.http.user-agent ~ "GT:WWW"
      || req.http.user-agent ~ "Snoopy"
      || req.http.user-agent ~ "MFC_Tear_Sample"
      || req.http.user-agent ~ "HTTP::Lite"
      || req.http.user-agent ~ "PHPCrawl"
      || req.http.user-agent ~ "URI::Fetch"
      || req.http.user-agent ~ "Zend_Http_Client"
      || req.http.user-agent ~ "http client"
      || req.http.user-agent ~ "PECL::HTTP"
      || req.http.user-agent ~ "panscient.com"
      || req.http.user-agent ~ "IBM EVV"
      || req.http.user-agent ~ "Bork-edition"
      || req.http.user-agent ~ "Fetch API Request"
      || req.http.user-agent ~ "PleaseCrawl"
      || req.http.user-agent ~ "[A-Z][a-z]{3,} [a-z]{4,} [a-z]{4,}"
      || req.http.user-agent ~ "layeredtech.com"
      || req.http.user-agent ~ "WEP Search"
      || req.http.user-agent ~ "Wells Search II"
      || req.http.user-agent ~ "Missigua Locator"
      || req.http.user-agent ~ "ISC Systems iRc Search 2.1"
      || req.http.user-agent ~ "Microsoft URL Control"
      || req.http.user-agent ~ "Indy Library"
      || req.http.user-agent == "8484 Boston Project v 1.0"
      || req.http.user-agent == "Atomic_Email_Hunter/4.0"
      || req.http.user-agent == "atSpider/1.0"
      || req.http.user-agent == "autoemailspider"
      || req.http.user-agent == "China Local Browse 2.6"
      || req.http.user-agent == "ContactBot/0.2"
      || req.http.user-agent == "ContentSmartz"
      || req.http.user-agent == "DataCha0s/2.0"
      || req.http.user-agent == "DataCha0s/2.0"
      || req.http.user-agent == "DBrowse 1.4b"
      || req.http.user-agent == "DBrowse 1.4d"
      || req.http.user-agent == "Demo Bot DOT 16b"
      || req.http.user-agent == "Demo Bot Z 16b"
      || req.http.user-agent == "DSurf15a 01"
      || req.http.user-agent == "DSurf15a 71"
      || req.http.user-agent == "DSurf15a 81"
      || req.http.user-agent == "DSurf15a VA"
      || req.http.user-agent == "EBrowse 1.4b"
      || req.http.user-agent == "Educate Search VxB"
      || req.http.user-agent == "EmailSiphon"
      || req.http.user-agent == "EmailWolf 1.00"
      || req.http.user-agent == "ESurf15a 15"
      || req.http.user-agent == "ExtractorPro"
      || req.http.user-agent == "Franklin Locator 1.8"
      || req.http.user-agent == "FSurf15a 01"
      || req.http.user-agent == "Full Web Bot 0416B"
      || req.http.user-agent == "Full Web Bot 0516B"
      || req.http.user-agent == "Full Web Bot 2816B"
      || req.http.user-agent == "Guestbook Auto Submitter"
      || req.http.user-agent == "Industry Program 1.0.x"
      || req.http.user-agent == "ISC Systems iRc Search 2.1"
      || req.http.user-agent == "IUPUI Research Bot v 1.9a"
      || req.http.user-agent == "LARBIN-EXPERIMENTAL (efp@gmx.net)"
      || req.http.user-agent == "LetsCrawl.com/1.0 +http://letscrawl.com/"
      || req.http.user-agent == "Lincoln State Web Browser"
      || req.http.user-agent == "LMQueueBot/0.2"
      || req.http.user-agent == "LWP::Simple/5.803"
      || req.http.user-agent == "Mac Finder 1.0.xx"
      || req.http.user-agent == "MFC Foundation Class Library 4.0"
      || req.http.user-agent == "Microsoft URL Control - 6.00.8xxx"
      || req.http.user-agent == "Missauga Locate 1.0.0"
      || req.http.user-agent == "Missigua Locator 1.9"
      || req.http.user-agent == "Missouri College Browse"
      || req.http.user-agent == "Mizzu Labs 2.2"
      || req.http.user-agent == "Mo College 1.9"
      || req.http.user-agent == "Mozilla/2.0 (compatible; NEWT ActiveX; Win32)"
      || req.http.user-agent == "Mozilla/3.0 (compatible; Indy Library)"
      || req.http.user-agent == "Mozilla/4.0 (compatible; Advanced Email Extractor v2.xx)"
      || req.http.user-agent == "Mozilla/4.0 (compatible; Iplexx Spider/1.0 http://www.iplexx.at)"
      || req.http.user-agent == "Mozilla/4.0 (compatible; MSIE 5.0; Windows NT; DigExt; DTS Agent"
      || req.http.user-agent == "Mozilla/4.0 efp@gmx.net"
      || req.http.user-agent == "Mozilla/5.0 (Version: xxxx Type:xx)"
      || req.http.user-agent == "MVAClient"
      || req.http.user-agent == "NameOfAgent (CMS Spider)"
      || req.http.user-agent == "NASA Search 1.0"
      || req.http.user-agent == "Nsauditor/1.x"
      || req.http.user-agent == "PBrowse 1.4b"
      || req.http.user-agent == "PEval 1.4b"
      || req.http.user-agent == "Poirot"
      || req.http.user-agent == "Port Huron Labs"
      || req.http.user-agent == "Production Bot 0116B"
      || req.http.user-agent == "Production Bot 2016B"
      || req.http.user-agent == "Production Bot DOT 3016B"
      || req.http.user-agent == "Program Shareware 1.0.2"
      || req.http.user-agent == "PSurf15a 11"
      || req.http.user-agent == "PSurf15a 51"
      || req.http.user-agent == "PSurf15a VA"
      || req.http.user-agent == "psycheclone"
      || req.http.user-agent == "RSurf15a 41"
      || req.http.user-agent == "RSurf15a 51"
      || req.http.user-agent == "RSurf15a 81"
      || req.http.user-agent == "searchbot admin@google.com"
      || req.http.user-agent == "ShablastBot 1.0"
      || req.http.user-agent == "snap.com beta crawler v0"
      || req.http.user-agent == "Snapbot/1.0"
      || req.http.user-agent == "sogou develop spider"
      || req.http.user-agent == "Sogou Orion spider/3.0(+http://www.sogou.com/docs/help/webmasters.htm#07)"
      || req.http.user-agent == "sogou spider"
      || req.http.user-agent == "Sogou web spider/3.0(+http://www.sogou.com/docs/help/webmasters.htm#07)"
      || req.http.user-agent == "sohu agent"
      || req.http.user-agent == "SSurf15a 11"
      || req.http.user-agent == "TSurf15a 11"
      || req.http.user-agent == "Under the Rainbow 2.2"
      || req.http.user-agent == "User-Agent: Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)"
      || req.http.user-agent == "VadixBot"
      || req.http.user-agent == "WebVulnCrawl.blogspot.com/1.0 libwww-perl/5.803"
      || req.http.user-agent == "Wells Search II"
      || req.http.user-agent == "WEP Search 00"
    ) {
      error 403 "You are banned from this site.  Please contact via a different client configuration if you believe that this is a mistake.";
    }
  }

As a bonus, I added the following to prevent hotlinking from people I don’t know. (code adapted from Turbocharged Blog)

  # if images, prevent hotlinking for some, permit for some
  if ( req.url ~ "wp-content/uploads/(images|200.)" || req.url ~ "wp-content/themes/[a-z-]/images") {
    if ( req.http.referer ~ "^http(s)?://" ) {
      if ( !(req.http.referer ~ "^http(s)?://([a-z-]+\.)?(omninoggin|postnerd|twidded|ciciscafe)\.com" )) {
        error 403 "Hotlinking is forbidden";
      }
    }
  }
  lookup;
}

Simple isn’t it? Now your Apache process will not have to deal with these spammers at all and save memory / CPU cycles for more important tasks like executing PHP.

Do you have a better bots list? Do you have a better method of doing this? Please share below!

Save and Share
StumbleUpon
Reddit

19 Responses to “Block Unwanted Spam Bots Using Varnish VCL”

[go to last comment]
  1. JTPratt’s Blogging Mistakes

    You always see .htaccess fixes because 99% of blog owners are on a shared web host…

    =)

  2. Thaya Kareeson

    That does make sense :). Doh! This post only targeted 1% of my audience.

  3. Phil Cryer

    This is great, I've always done mod_rewrite rules to redirect hot linkers, and a long block in httpd.conf to shut out bad bots, but moving both upstream to Varnish is a great idea! Thanks for the effort.

  4. Thaya Kareeson

    @Phil Cryer
    Glad that it helped you out!

  5. Albert

    Varnish is really great – thanks for the idea. I found your post while searching for a method to combine similar user agents like Firefox and Iceweasel to increase cache hits because I use vary: user-agent.

    And Apache is awesome too, but the days when I would put it on the front line are long gone!

  6. Mack

    This might work for a while until "harvesters" find a new way.

  7. erik

    Counterproductive. This pushes harvester/spambots towards using a IE/Firefox user agent string and getting stealthy. Which we do not want.

  8. Thaya Kareeson

    Indeed. See http://omninoggin.com/wordpress-plugins/project-h… for a better solution.

  9. Thaya Kareeson

    Indeed. See http://omninoggin.com/wordpress-plugins/project-h… for a better solution.

  10. Thaya Kareeson

    Indeed. See http://omninoggin.com/wordpress-plugins/project-h… for a better solution.

  11. Thaya Kareeson

    Indeed. See http://omninoggin.com/wordpress-plugins/project-h… for a better solution.

  12. Thaya Kareeson

    Indeed. See http://omninoggin.com/wordpress-plugins/project-h… for a better solution.

  13. Thaya Kareeson

    Indeed. See http://omninoggin.com/wordpress-plugins/project-h… for a better solution.

  14. Thaya Kareeson

    Indeed. See http://omninoggin.com/wordpress-plugins/project-h… for a better solution.

  15. Felipe

    I am having a problem where the iPhone user agent should redirect the browser to a different site,
    what happens is if the user goes to the site like http://site.com/ it does not redirect but if the user goes into
    http://site.com/default.asp it works correctly.
    Any ideas what can be done?

  16. Thaya Kareeson

    I'm not sure why that would be the case. I haven't touch Varnish VCL for so long now I wouldn't be of much help.

  17. stef

    scrapers clone the googlebot user agent string and have for years.

  18. tkp

    Hey, thanks for the tips and trick! I've add user agents such as "Hajiv" to prevent SQL Injection, XSS and such.

  19. ScathreetrY

    Now I am feeling that way again so I am writing it out instead of running to purge.
    Super ist vorallem das die Anmeldung kostenlos, unverbindlich und schnell funktioniert.
    Ahnlichkeitsmerkmale: SexdatingDatingSexdates Weitere ahnliche Websites wie Live-sex-dating.
    Im weiteren Verlauf wurde die Veroffentlichung des ergaunerten Materials fur den Fall angedroht, dass die geforderten 200 Euro nicht per Western Union transferiert werden.
    Der Preis fur den Versand einer SMS innerhalb des Portals betragt EURO 1,99.

[go to first comment]

Leave a Reply