Wow, so to prevent AI scrapers from harvesting my data I need to send all of my traffic through a third party company that gets to decide who gets to view my content. Great idea!
Yes, they could roll their own, but you have no issues with this being necessary? I think the attitude of "just deal with it" is far more negative than someone expressing they are upset with the state of the internet, its controllers, and its abusers.
This is like saying "lets just get rid of all the guns" to solve gun violence and gun crime in the USA. The cat is out of the bag and no one can put it back. We live in a different world now and we have to figure it out.
> Must everything in AI threads be so negative and condescending?
Because if I own a website or a service and it is being degraded or slowed by some third party tool that wants to slurp its content for his own profit and don't even share, I tend to be irritated. And AI apologists/evangelists don't help when they try to justify the behavior.
I use iocaine[0] to generate a tarpit. Yesterday it served ~278k "pages" consisting of ~500MB of gibberish (and that's despite banning most AI scrapers in robots.txt.)
It still fails with all of my extensions disabled (wipr, privacy redirect). I just get a download dialog. I don't know what the HTTP status code is, however.
I found a flagged HN submission about it and it has just about the same result for me and for others. My first tap failed in a weird way (showed some text then redirected quickly to its git repo) and all subsequent taps trigger a download.
Unfortunately and you kind of have to count this as the cost of the Internet. You've wasted 500Mb of bandwidth.
I've had colocation for eight years+. My monthly b/w cost is now around 20-30Gb a month given to scrapers where I was only be using 1-2Gb a month, years prior.
I pay for premium bandwidth (it's a thing) and only get 2TB of usable data. Do I go offline or let it continue?
i have no idea what this does because the site is rejecting my ordinary firefox browser with "Error code: 418 I'm a teapot". Even from a private browser.
If I hit it with Chrome, now I can see a site.
Seems pretty not ready for prime time as a lot of my viewers use Firefox
Anubis is the only tool that claims to have heuristics to identify a bot, but my understanding is that it does this by presenting obnoxious challenges to all users. Not really feasible. Old school approaches like ip blocking or even ASN blocking are obsolete - these crawlers purposely spam from thousands of IPs, and if you block them on a common ASN, they come back a few days later from thousands of unique ASNs. So this is not really a "roll your own" situation, especially if you are running off the shelf software that doesn't have some straightforward means of building in these various approaches of endless page mazes (which I would still have to serve anyway).
Unfortunately, Cloudflare often destroys the experience for users with shared connections, VPNs, exotic browsers… I had to remove it from my site after too many complaints.
Cloudflare works fine with public relay - they and Fastly provide infrastructure for that service (one half of the blinded pair) so it’s definitely something they test.
Savvy move by cloudflare, once they have enough sites behind their service they can charge the AI companies to access their cached copies on a back channel