Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Cloudflare has a service for this now that will detect AI scrapers and send them to a tarpit of infinite AI generated nonsense pages.


Wow, so to prevent AI scrapers from harvesting my data I need to send all of my traffic through a third party company that gets to decide who gets to view my content. Great idea!


You don’t need to do anything. You can use any number of solutions or roll your own.

Someone shared an alternative. Must everything in AI threads be so negative and condescending?


Yes, they could roll their own, but you have no issues with this being necessary? I think the attitude of "just deal with it" is far more negative than someone expressing they are upset with the state of the internet, its controllers, and its abusers.


There's trillions invested in AI. Don't expect any introspective insight or criticism about it.


This is like saying "lets just get rid of all the guns" to solve gun violence and gun crime in the USA. The cat is out of the bag and no one can put it back. We live in a different world now and we have to figure it out.


> Must everything in AI threads be so negative and condescending?

Because if I own a website or a service and it is being degraded or slowed by some third party tool that wants to slurp its content for his own profit and don't even share, I tend to be irritated. And AI apologists/evangelists don't help when they try to justify the behavior.


You can implement this yourself, who is stopping you?


Citation needed


I use iocaine[0] to generate a tarpit. Yesterday it served ~278k "pages" consisting of ~500MB of gibberish (and that's despite banning most AI scrapers in robots.txt.)

[0] https://iocaine.madhouse-project.org


Can't seem to access this.

It flashes some text briefly then gives me an 418 TEAPOT response. I wonder if it's because I'm on Linux?

EDIT: Begrudgingly checked Chrome, and it loads. I guess it doesn't like Firefox?


Doesn't work on my firefox either.

Friendly fire, I suppose.


Works on my Firefox. Mac and Linux


Nor Safari on iOS.


Works fine on my iOS Safari - maybe there's some extension that's tickling it just the wrong way?


It still fails with all of my extensions disabled (wipr, privacy redirect). I just get a download dialog. I don't know what the HTTP status code is, however.

I found a flagged HN submission about it and it has just about the same result for me and for others. My first tap failed in a weird way (showed some text then redirected quickly to its git repo) and all subsequent taps trigger a download.

https://news.ycombinator.com/item?id=44538010


Unfortunately and you kind of have to count this as the cost of the Internet. You've wasted 500Mb of bandwidth.

I've had colocation for eight years+. My monthly b/w cost is now around 20-30Gb a month given to scrapers where I was only be using 1-2Gb a month, years prior.

I pay for premium bandwidth (it's a thing) and only get 2TB of usable data. Do I go offline or let it continue?


> You've wasted 500Mb of bandwidth.

Yep, it sucks, but on the positive side, I'm feeding 500Mb of garbage into them every day and that feels like enough of a small win for me.

> My monthly b/w cost is now around 20-30Gb a month given to scrapers [...] 1-2Gb a month

That definitely sucks.

> Do I go offline or let it continue?

Might be time to start blocking entire IP ranges and ASNs and see if that helps.


i have no idea what this does because the site is rejecting my ordinary firefox browser with "Error code: 418 I'm a teapot". Even from a private browser.

If I hit it with Chrome, now I can see a site.

Seems pretty not ready for prime time as a lot of my viewers use Firefox


One of the most popular ones is Anubis. It uses a proof of work and can even do poisoning: https://anubis.techaro.lol/

They even mention iocaine. I know, inconceivable!: https://iocaine.madhouse-project.org/

There's also tons of HN posts on the topic with varying solutions:

https://news.ycombinator.com/item?id=45935729

https://news.ycombinator.com/item?id=45711094

https://news.ycombinator.com/item?id=44142761

https://news.ycombinator.com/item?id=44378127


Anubis is the only tool that claims to have heuristics to identify a bot, but my understanding is that it does this by presenting obnoxious challenges to all users. Not really feasible. Old school approaches like ip blocking or even ASN blocking are obsolete - these crawlers purposely spam from thousands of IPs, and if you block them on a common ASN, they come back a few days later from thousands of unique ASNs. So this is not really a "roll your own" situation, especially if you are running off the shelf software that doesn't have some straightforward means of building in these various approaches of endless page mazes (which I would still have to serve anyway).


https://forge.hackers.town/hackers.town/nepenthes

> Citation needed

this reply kinda sucks :)


Unfortunately, Cloudflare often destroys the experience for users with shared connections, VPNs, exotic browsers… I had to remove it from my site after too many complaints.


I am sure Cloudflare would have no problem selling you a VPN service.

After all, it's not very far from hosting booters and selling DoS protection.



Also iCloud Private Relay.

CloudFlare is making it impossible to browse privately


Cloudflare works fine with public relay - they and Fastly provide infrastructure for that service (one half of the blinded pair) so it’s definitely something they test.


Not sure "TLS added and removed here :)" as a Service is the right tool in the drawer for this.



cloudflare also blocks my human-is-driving browser all the time

"enahble javascript and cookies to continue"

also unsupported browser


Savvy move by cloudflare, once they have enough sites behind their service they can charge the AI companies to access their cached copies on a back channel


Modern scrapers are using headless chromium which will not see the invisible links, so I'm not sure how long this will be effective.


Which is still a far worse experience than if Cloudflare's services weren't needed.


Except for the scrapers that pay cloudflare to exempt them.


The solution, as always, is noise.


Do you have a link to that?





Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: