The SQLite team faced a similar problem last year, and Richard Hipp (the creator of SQLite) made almost the same comment:
"The malefactor behind this attack could just clone the whole SQLite source repository and search all the content on his own machine, at his leisure. But no: Being evil, the culprit feels compelled to ruin it for everyone else. This is why you don't get to keep nice things...."
I am with you in that this rhetoric is getting exhausting.
In this particular case though I don't think "evil” is a moral claim, more shorthand for cost externalizing behavior. Hammering expensive dynamic endpoints with millions of unique requests isn’t neutral automation, it's degrading a shared public resource. Call it evil, antisocial, or extractive, the outcome is the same.
Sounds like you have zero empathy for the real costs AI is driving and feelings that this creates for website owners. How about you pony up and pay for your scraping?
Why don’t you take a moment to explain to the class why you think web crawling means you can’t cache anything?
It seems to me that the very first thing I’d try to solve if I were writing a tool for an LLM to search the web, would be caching.
An LLM should have to go through a proxy to fetch any URL. That proxy should be caching results. The cache should be stored on the LLM’s company’s servers. It should not be independently hitting the same endpoint repeatedly any time it wants to fetch the same URL for its users.
Is it expensive to cache everything the LLM fetches? You betcha. Can they afford to spend of the billions they have for capex to buy some fucking hard drives? Absolutely. If archive.org can do it via funding from donations, a trillion dollar AI company should have no problem.
There are people behind the web crawler. If they’re so well funded they can exert a little effort to not so badly inconvenience people as they steal their training data.
I've downvoted you for being incredibly aggressive in your responses. I'm not sure why you're ad homineming the parent commenter, but it's not helping the discussion.
I don’t even really get what they are saying. I am also saying that they are hostile, and with all of their money they can afford to not be hostile. So I feel like we agree?
"The malefactor behind this attack could just clone the whole SQLite source repository and search all the content on his own machine, at his leisure. But no: Being evil, the culprit feels compelled to ruin it for everyone else. This is why you don't get to keep nice things...."
https://sqlite.org/forum/forumpost/7d3eb059f81ff694