From 1be1a7d451f7aaa6622e2c71a006f65e17d59959 Mon Sep 17 00:00:00 2001 From: Michael Fabian 'Xaymar' Dirks Date: Sat, 7 Sep 2024 13:45:37 +0200 Subject: [PATCH] Update robots and fix typo --- ...9-07-the-end-of-the-web-as-we-know-it.html | 2 +- robots.txt | 39 +++++++++++++++++++ 2 files changed, 40 insertions(+), 1 deletion(-) diff --git a/_posts/2024/2024-09-07-the-end-of-the-web-as-we-know-it.html b/_posts/2024/2024-09-07-the-end-of-the-web-as-we-know-it.html index d1161a1..872d4f7 100644 --- a/_posts/2024/2024-09-07-the-end-of-the-web-as-we-know-it.html +++ b/_posts/2024/2024-09-07-the-end-of-the-web-as-we-know-it.html @@ -6,7 +6,7 @@ tags: [ "AI", "Machine Learning", "Crawler", "robots.txt", ]

Today's gonna be a bit of a ranter. The whole AI-Crawler situation has gone from bad to awful, potentially ending the web as we know it. AI companies like OpenAI (ChatGPT) and Anthropic (Claude) have gotten so used to stealing that they're no longer afraid to cause extreme costs to others for their own gain. Many of them now employ crawlers that include methods to bypass limits and filters. It needs to stop.

-

I've personally been hit by OpenAI weeks ago, which managed to generate 49 TB of traffic costing me about 13€ for that day, and later on Anthropic tried to do the same based on the access logs. And it seems I'm not the only one, with Uberspace being hit even worse. rileyb3d appears to have been harassed in a similar way and was forced to take down their own website entirely. And you know it's gotten really bad the CloudFlare out of all things makes their AI-Crawler protection tool available for Free users. They only make things available for Free users that are widespread, so even CloudFlare has had enough now.

+

I've personally been hit by OpenAI weeks ago, which managed to generate 49 TB of traffic costing me about 13€ for that day, and later on Anthropic tried to do the same based on the access logs. And it seems I'm not the only one, with Uberspace being hit even worse. rileyb3d appears to have been harassed in a similar way and was forced to take down their own website entirely. And you know it's gotten really bad when CloudFlare out of all things makes their AI-Crawler protection tool available for Free users. They only make things available for Free users that are widespread, so even CloudFlare has had enough now.

This situation has completely spiraled out of control for everyone and I don't see much future in the free web anymore if it continues. AI Companies no longer care about copyright, licensing or similar, and it's only going to get worse until governments wake up. Any work you published is being used to train AI models, no matter if your license allows for it or requires payment. None of them care, and lawsuits are piling up.

diff --git a/robots.txt b/robots.txt index e80b111..3567238 100644 --- a/robots.txt +++ b/robots.txt @@ -5,3 +5,42 @@ Disallow: /feed.xml Disallow: /restricted/ Disallow: /404.html Disallow: /redirects.json + +User-agent: AI2Bot +User-agent: Ai2Bot-Dolma +User-agent: Amazonbot +User-agent: Applebot +User-agent: Applebot-Extended +User-agent: Bytespider +User-agent: CCBot +User-agent: ChatGPT-User +User-agent: Claude-Web +User-agent: ClaudeBot +User-agent: Diffbot +User-agent: FacebookBot +User-agent: FriendlyCrawler +User-agent: GPTBot +User-agent: Google-Extended +User-agent: GoogleOther +User-agent: GoogleOther-Image +User-agent: GoogleOther-Video +User-agent: iaskspider/2.0 +User-agent: ICC-Crawler +User-agent: ImagesiftBot +User-agent: Meta-ExternalAgent +User-agent: Meta-ExternalFetcher +User-agent: OAI-SearchBot +User-agent: PerplexityBot +User-agent: PetalBot +User-agent: Scrapy +User-agent: Timpibot +User-agent: VelenPublicWebCrawler +User-agent: Webzio-Extended +User-agent: YouBot +User-agent: anthropic-ai +User-agent: cohere-ai +User-agent: facebookexternalhit +User-agent: img2dataset +User-agent: omgili +User-agent: omgilibot +Disallow: /