skip to content
Matthias Wünsch - Blog

Abstract

AI Webscrapers place multiple new burdens onto server admins, that can be - at least to an extent - be remitted using the novel firewall utility Anubis. In it’s documentation Anubis mentions that it is not a golden bullet solution, but it can effectively mitigate some of the scraper-imposed burdens, without requiring a lot of configuration or the involvement of a third party Web Application Firewall (WAF).

The Problem with AI Scrapers

The rise of large language models and AI systems has led to an explosion in web scraping activity. Unlike traditional web scrapers that respect robots.txt, many AI training systems aggressively scrape content without regard for server load or copyright concerns. This creates several challenges for website operators:

  • Increased bandwidth costs: AI scrapers consume significant bandwidth, leading to higher hosting costs
  • Server load: Multiple concurrent scraping requests can degrade site performance for legitimate users
  • Copyright concerns: Content trained into AI models without permission raises legal and ethical questions
  • SEO impact: Duplicate content indexed by search engines can affect rankings

Introduction to Anubis

Anubis is a firewall utility designed to detect and block AI scraper requests. It works by analyzing request patterns and identifying characteristics common to AI training systems. The tool integrates directly with your firewall and requires minimal configuration.

Key features of Anubis:

  • Lightweight and low-overhead filtering
  • Pattern-based detection of known AI scrapers
  • Configurable blocking rules
  • Compatible with standard Linux firewall tools
  • No third-party WAF or cloud service required

Installation on NixOS

NixOS makes installing Anubis straightforward. Add it to your system packages:

environment.systemPackages = with pkgs; [
anubis
];

Then rebuild your system:

Terminal window
sudo nixos-rebuild switch

Configuration

Anubis can be configured through /etc/anubis/config.yaml or through environment variables. A basic configuration might look like:

firewall:
enabled: true
block_openai: true
block_google: true
block_anthropic: true
custom_patterns: []

For NixOS, you can define this in your configuration:

services.anubis = {
enable = true;
settings = {
firewall.enabled = true;
firewall.block_openai = true;
firewall.block_google = true;
};
};

How Anubis Detects Scrapers

Anubis uses multiple detection methods:

  1. User-Agent Analysis: Identifies known AI scraper user agents
  2. Request Patterns: Detects unusual request frequencies and sequences
  3. Behavioral Analysis: Identifies scraping patterns (rapid page fetches, resource downloading)
  4. IP Reputation: Uses known scraper IP ranges
  5. Custom Rules: Allows you to define your own blocking criteria

Limitations

While Anubis is effective, it has limitations:

  • Not foolproof: Determined scrapers can spoof user agents and requests
  • False positives: Legitimate automated tools may be blocked
  • Maintenance required: New scraper patterns require updates
  • Performance impact: Additional filtering adds slight latency
  • Incomplete coverage: Some lesser-known scrapers may not be detected

Complementary Approaches

Anubis works best as part of a defense-in-depth strategy. Consider combining it with:

robots.txt Configuration

Create a comprehensive robots.txt that explicitly blocks known AI scrapers:

User-agent: CCBot
Disallow: /
User-agent: GPTBot
Disallow: /
User-agent: ChatGPT-User
Disallow: /

Content Protection Headers

Add headers that indicate content should not be used for training:

add_header X-Robots-Tag "noimageindex, noodp, noydir, noindex, nofollow";

Rate Limiting

Implement rate limiting on your web server to restrict request frequency:

limit_req_zone $binary_remote_addr zone=general:10m rate=10r/s;
limit_req_zone $binary_remote_addr zone=api:10m rate=5r/s;
location / {
limit_req zone=general burst=20;
}
location /api/ {
limit_req zone=api burst=10;
}

Monitoring and Logging

Enable detailed logging to track blocking events:

Terminal window
journalctl -u anubis -f

Testing Your Configuration

After setting up Anubis, test that it’s working:

Terminal window
# Check service status
systemctl status anubis
# View recent logs
journalctl -u anubis -n 50
# Test a known scraper user agent
curl -H "User-Agent: GPTBot" https://your-site.com

If configured correctly, the curl command with a known scraper user agent should be blocked.

Performance Considerations

Monitor the performance impact of Anubis:

Terminal window
# Check CPU and memory usage
top -p $(systemctl show -p MainPID --value anubis)
# Monitor network statistics
ss -an | grep ESTAB | wc -l

Most users report minimal performance impact (< 1% CPU overhead) when Anubis is properly configured.

Conclusion

Anubis provides an effective, lightweight solution for protecting your site from AI scrapers without requiring complex cloud-based WAF services. When combined with robots.txt configuration, rate limiting, and proper monitoring, it forms a solid defense against automated content extraction.

For more information and updates, check the Anubis documentation and community resources.

Further Reading