The problem
A website I ran was being aggresively scraped by someone using a particular UserAgent string. It wasn't a normal search engine or similar bot (e.g. GoogleBot) so adding something to robots.txt wasn't going to work.
This site is using dokku to serve the web app, dokku uses nginx to route requests to the app.
The solution
You can use dokku's server configuration to add a rule to nginx telling it to return HTTP status 444 when given a particular User Agent string.
Dokku loads any nginx conf files which are stored in the /home/dokku/$APPNAME/nginx.conf.d/
directory (replace $APPNAME
with the name of your app). These files are loaded into the server block of the main configuration for the app.
On your server, first make sure the directory exists by running:
mkdir /home/dokku/$APPNAME/nginx.conf.d/
(again, replace $APPNAME
with the app you're using it on).
Then create the file. I called mine bots.conf
, and used nano to edit it:
nano /home/dokku/$APPNAME/nginx.conf.d/bots.conf
Add the following rule to the file, and then save it.
if ($http_user_agent = "bot-name"){
return 444;
}
444 is an unofficial status code that nginx uses. Instead of returning anything, nginx will simply close the connection without a response.
My example will close the connection if the user agent string exactly equals "bot-name". You could also use regular expressions here for more flexible matching (see the documentation on if
for more details).
To exclude more than one bot, simply add multiple rules to the file.
Next make sure that dokku owns the file you've created
chown dokku:dokku /home/dokku/$APPNAME/nginx.conf.d/bots.conf
Finally reload nginx - the rule should then be applied.
service nginx reload