Setup Anubis on NixOS
/ 4 min read
Updated:Abstract
AI Webscrapers place multiple new burdens onto server admins, that can be - at least to an extent - be remitted using the novel firewall utility Anubis. In it’s documentation Anubis mentions that it is not a golden bullet solution, but it can effectively mitigate some of the scraper-imposed burdens, without requiring a lot of configuration or the involvement of a third party Web Application Firewall (WAF).
The Problem with AI Scrapers
The rise of large language models and AI systems has led to an explosion in web scraping activity. Unlike traditional web scrapers that respect robots.txt, many AI training systems aggressively scrape content without regard for server load or copyright concerns. This creates several challenges for website operators:
- Increased bandwidth costs: AI scrapers consume significant bandwidth, leading to higher hosting costs
- Server load: Multiple concurrent scraping requests can degrade site performance for legitimate users
- Copyright concerns: Content trained into AI models without permission raises legal and ethical questions
- SEO impact: Duplicate content indexed by search engines can affect rankings
Introduction to Anubis
Anubis is a firewall utility designed to detect and block AI scraper requests. It works by analyzing request patterns and identifying characteristics common to AI training systems. The tool integrates directly with your firewall and requires minimal configuration.
Key features of Anubis:
- Lightweight and low-overhead filtering
- Pattern-based detection of known AI scrapers
- Configurable blocking rules
- Compatible with standard Linux firewall tools
- No third-party WAF or cloud service required
Installation on NixOS
NixOS makes installing Anubis straightforward. Add it to your system packages:
environment.systemPackages = with pkgs; [ anubis];Then rebuild your system:
sudo nixos-rebuild switchConfiguration
Anubis can be configured through /etc/anubis/config.yaml or through environment variables. A basic configuration might look like:
firewall: enabled: true block_openai: true block_google: true block_anthropic: true custom_patterns: []For NixOS, you can define this in your configuration:
services.anubis = { enable = true; settings = { firewall.enabled = true; firewall.block_openai = true; firewall.block_google = true; };};How Anubis Detects Scrapers
Anubis uses multiple detection methods:
- User-Agent Analysis: Identifies known AI scraper user agents
- Request Patterns: Detects unusual request frequencies and sequences
- Behavioral Analysis: Identifies scraping patterns (rapid page fetches, resource downloading)
- IP Reputation: Uses known scraper IP ranges
- Custom Rules: Allows you to define your own blocking criteria
Limitations
While Anubis is effective, it has limitations:
- Not foolproof: Determined scrapers can spoof user agents and requests
- False positives: Legitimate automated tools may be blocked
- Maintenance required: New scraper patterns require updates
- Performance impact: Additional filtering adds slight latency
- Incomplete coverage: Some lesser-known scrapers may not be detected
Complementary Approaches
Anubis works best as part of a defense-in-depth strategy. Consider combining it with:
robots.txt Configuration
Create a comprehensive robots.txt that explicitly blocks known AI scrapers:
User-agent: CCBotDisallow: /
User-agent: GPTBotDisallow: /
User-agent: ChatGPT-UserDisallow: /Content Protection Headers
Add headers that indicate content should not be used for training:
add_header X-Robots-Tag "noimageindex, noodp, noydir, noindex, nofollow";Rate Limiting
Implement rate limiting on your web server to restrict request frequency:
limit_req_zone $binary_remote_addr zone=general:10m rate=10r/s;limit_req_zone $binary_remote_addr zone=api:10m rate=5r/s;
location / { limit_req zone=general burst=20;}
location /api/ { limit_req zone=api burst=10;}Monitoring and Logging
Enable detailed logging to track blocking events:
journalctl -u anubis -fTesting Your Configuration
After setting up Anubis, test that it’s working:
# Check service statussystemctl status anubis
# View recent logsjournalctl -u anubis -n 50
# Test a known scraper user agentcurl -H "User-Agent: GPTBot" https://your-site.comIf configured correctly, the curl command with a known scraper user agent should be blocked.
Performance Considerations
Monitor the performance impact of Anubis:
# Check CPU and memory usagetop -p $(systemctl show -p MainPID --value anubis)
# Monitor network statisticsss -an | grep ESTAB | wc -lMost users report minimal performance impact (< 1% CPU overhead) when Anubis is properly configured.
Conclusion
Anubis provides an effective, lightweight solution for protecting your site from AI scrapers without requiring complex cloud-based WAF services. When combined with robots.txt configuration, rate limiting, and proper monitoring, it forms a solid defense against automated content extraction.
For more information and updates, check the Anubis documentation and community resources.