lads

daniskarma@lemmy.dbzer0.com · edit-2 2 days ago

That whole thing is under two wrong suppositions.

It assumes that we sites are under constant ddos and that cannot exist if there is not ddos protection.

This is false.

It assumes that anubis is effective against ddos attacks. Which is not. Is a mitigation, but any ddos attack worth is name would not have any issue bringing down a site with anubis. As the sever still have to handle request even if they are smaller requests.

Anubis only use case is to make AI scrappers to consume more energy while scrapping, while also making many legitimate users also use more energy. It’s just being promoted in the anti-AI wave, but I don’t really see much usefulness into it.

rtxn@lemmy.world · edit-2 2 days ago

It assumes that we sites are under constant ddos

It is literally happening. https://www.youtube.com/watch?v=cQk2mPcAAWo https://thelibre.news/foss-infrastructure-is-under-attack-by-ai-companies/

It assumes that anubis is effective against ddos attacks

It’s being used by some little-known entities like the LKML, FreeBSD, SourceHut, UNESCO, and the fucking UN, so I’m assuming it probably works well enough. https://policytoolbox.iiep.unesco.org/ https://xeiaso.net/notes/2025/anubis-works/

anti-AI wave

Oh, you’re one of those people. Enough said. (edit) By the way, Anubis’ author seems to be a big fan of machine learning and AI.

(edit 2 just because I’m extra cross that you don’t seem to understand this part)

Do you know what a web crawler does when a process finishes grabbing the response from the web server? Do you think it takes a little break to conserve energy and let all the other remaining processes do their thing? No, it spawns another bloody process to scrape the next hyperlink.

daniskarma@lemmy.dbzer0.com · edit-2 2 days ago

Some websites being under ddos attack =/= all sites are under constant ddos attack, nor it cannot exist without it.

First there’s a logic fallacy in there. Being used by does not mean it’s useful. Many companies use AI for some task, does that make AI useful? Not.

The logic it’s still there all anubis can do against ddos is raising a little the barrier before the site goes down. That’s call mitigation not protection. If you are targeted for a ddos that mitigation is not going to do much, and your site is going down regardless.

CanadaPlus@lemmy.sdf.org · edit-2 2 days ago

If a request is taking a full minute of user CPU time, it’s one hell of a mitigation, and anybody who’s not a major corporation or government isn’t going to shrug it off.

daniskarma@lemmy.dbzer0.com · edit-2 2 days ago

Precisely that’s my point. It fits a very small risk profile. People who is going to be ddosed but not by a big agent.

It’s not the most common risk profile. Usually ddos attacks are very heavy or doesn’t happen at all. These “half gas” ddos attacks are not really common.

I think that’s why when I read about Anubis is never in a context of ddos protection. It’s always on a context of “let’s fuck AI”, like this precise line of comments.

CanadaPlus@lemmy.sdf.org · edit-2 1 day ago

There’s heavy, and then there’s heavy. I don’t have any experience dealing with threats like this myself, so I can’t comment on what’s most common, but we’re talking about potentially millions of times more resources for the attacker than the defender here.

There is a lot of AI hype and AI anti-hype right now, that’s true.

setVeryLoud(true);@lemmy.ca · 1 day ago

I do. I have a client with a limited budget whose websites I’m considering putting behind Anubis because it’s getting hammered by AI scrapers.

It comes in waves, too, so the website may randomly go down or slow down significantly, which is really annoying because it’s unpredictable.

daniskarma@lemmy.dbzer0.com · 1 day ago

I don’t think is millions. Take into account that a ddos attacker is not going to execute JavaScript code, at least not any competent one, so they are not going to run the PoW.

In fact the unsolicited and unwarned PoW does not provide more protection than a captcha again ddos.

The mitigation comes from the smaller and easier requests response by the server, so the number of requests to saturate the service must increase. How much? Depending how demanding the “real” website would be in comparison. I doubt the answer is millions. And they would achieve the exact same result with a captcha without running literal malware on the clients.

CanadaPlus@lemmy.sdf.org · 1 day ago

Depending how demanding the “real” website would be in comparison. I doubt the answer is millions.

The one service I regularly see using something like this is Invidious. I can totally get how even a bit of bot traffic would make the host’s life really hard.

It’s true a captcha would achieve something similar, if we assume a captcha-solving AI has a certain minimum cost. That means typical users will have to do a lot more work, though, which is why creepy things like Cloudflare have become popular, and I’m not sure what the advantages are.

daniskarma@lemmy.dbzer0.com · 1 day ago

Cloudfare have a clear advantage in the sense that can put the door away from the host and can redistribute the attacks between thousands of servers. Also it’s able to analyze attacks from their position of being able to see half the internet so they can develop and implement very efficient block lists.

I’m the first one who is not fan of cloudfare though. So I use crowdsec which builds community blocklists based on user statistics.

PoW as a bot detection is not new. It has been around for ages, but it has never been popular because there have always been better ways to achieve the same or even better results. Captcha may be more user intrusive, but it can actually deflect bots completely (even the best AI could be unable to solve a well made captcha), while PoW only introduces a energy penalty expecting to act as deterrent.

My bet is that invidious is under constant Google attack by obvious reasons. It’s a hard situation to be overall. It’s true that they are a very particular usercase, with both a lot of users and bots interested in their content, a very resource heavy content, and also the target of one of the biggest corporations of the world. I suppose Anubis could act as mitigation there, at the cost of being less user friendly. And if youtube goes a do the same it would really made for a shitty experience.

ℍ𝕂-𝟞𝟝@sopuli.xyz · 2 days ago

Websites were under a constant noise of malicious requests even before AI, but now AI scraping of Lemmy instances usually triples traffic. While some sites can cope with this, this means a three-fold increase in hosting costs in order to essentially fuel investment portfolios.

AI scrapers will already use as much energy as available, so making them use more per site measn less sites being scraped, not more total energy used.

And this is not DDoS, the objective of scrapers is to get the data, not bring the site down, so while the server must reply to all requests, the clients can’t get the data out without doing more work than the server.

daniskarma@lemmy.dbzer0.com · 2 days ago

AI does not triple traffic. It’s a completely irrational statement to make.

There’s a very limited number of companies training big LLM models, and these companies do train a model a few times per year. I would bet that the number of requests per year of s resource by an AI scrapper is on the dozens at most.

Using as much energy as a available per scrapping doesn’t even make physical sense. What does that sentence even mean?

grysbok@lemmy.sdf.org · 2 days ago

You’re right. AI didn’t just triple the traffic to my tiny archive’s site. It way more than tripled it. After implementing Anubis, we went from 3000 ‘unique’ visitors down to 20 in a half-day. Twenty is a much more expected number for a small college archive in the summer. That’s before I did any fine-tuning to Anubis, just the default settings.

I was getting constant outage reports. Now I’m not.

For us, it’s not about protecting our IP. We want folks to get to find out information. That’s why we write finding aids, scan it, accession it. But, allowing bots to siphon it all up inefficiently was denying everyone access to it.

And if you think bots aren’t inefficient, explain why Facebook requests my robots.txt 10 times a second.

daniskarma@lemmy.dbzer0.com · edit-2 2 days ago

How do you know those reduced request were AI companies and not any other purpose?

grysbok@lemmy.sdf.org · edit-2 1 day ago

Timing and request patterns. The increase in traffic coincided with the increase in AI in the marketplace. Before, we’d get hit by bots in waves and we’d just suck it up for a day. Now it’s constant. The request patterns are deep deep solr requests, with far more filters than any human would ever use. These are expensive requests and the results aren’t any more informative that just scooping up the nicely formatted EAD/XML finding aids we provide.

And, TBH, I don’t care if it’s AI. I care that it’s rude. If the bots respected robots.txt then I’d be fine with them. They don’t and they break stuff for actual researchers.

daniskarma@lemmy.dbzer0.com · edit-2 1 day ago

I mean number of pirates correlates with global temperature. That doesn’t mean causation.

The rest of the indices would aso match for any archiving bot, or with any bit in search of big data. We must remember that big data is used for much more than AI. At the end of the day scraping is cheap, but very few companies in the world have access to the processing power to train that amount of data. That’s why it seems so illogical to me.

We are seeing how many LLM models which are results of a full train, per year? Ten? twenty? Even if they update and retrain often it’s not compatible with the amount of request people are implying as AI scraping that would put services into dos risk. Specially when I would think that any AI company would not try to scrap the same data twice.

I have also experience an increase in bot requests in my host. But I just think is a result of internet getting bigger, more people using internet with more diverse intentions, some ill some not. I’ve also experience a big increase on probing and attack attempts on general, and I don’t think it’s OpenAI trying some outdated Apache vulnerability on my server. Internet is just a bigger sea with more fish in it.

grysbok@lemmy.sdf.org · 1 day ago

I just looked at my log for this morning. 23% of my total requests were from the useragent GoogleOther. Other visitors include GPTBot, SemanticScholarBot, and Turnitin. That’s the crawlers that are still trying after I’ve had Anubis on the site for over a month. It was much, much worse before, when they could crawl the site, instead of being blocked.

That doesn’t include the bots that lie about being bots. Looking back at an older screenshot of a monitors—I don’t have the logs themselves anymore—I seriously doubt I had 43,000 unique visitors using Windows per day in March.

daniskarma@lemmy.dbzer0.com · edit-2 1 day ago

Why would they request so many times a day the same data if the objective was AI model training. It makes zero sense.

Also google bots obeys robots.txt so they are easy to manage.

There may be tons of reasons google is crawling your website. From ad research to any kind of research. The only AI related use I can think of is RAG. But that would take some user requests aways because if the user got the info through the AI google response then they would not enter the website. I suppose that would suck for the website owner, but it won’t drastically increase the number of requests.

But for training I don’t see it, there’s no need at all to keep constantly scraping the same web for model training.

then_three_more@lemmy.world · 1 day ago

Does it matter what the purpose was? It was still causing them issues hosting their site.

daniskarma@lemmy.dbzer0.com · 1 day ago

Not really. I only ask because people always say it’s for LLM training, which seem a little illogical to me, knowing the small number of companies that have access to the computer power to actually do a training with that data. And big companies are not going to scrape hundreds of times the same resource for a piece of information they already have.

But I think people should be more critique trying to understand who is making the request and with which purpose. So then people could make a better informed decision of they need that system (which is very intrusive for the clients) or not.

then_three_more@lemmy.world · 1 day ago

knowing the small number of companies that have access to the computer power to actually do a training with that data

the 70,717 AI startups worldwide

https://edgedelta.com/company/blog/ai-startup-statistics

Not every company will be training a model as big as the big names, but combined that’s a hell of a lot.

daniskarma@lemmy.dbzer0.com · edit-2 1 day ago

Most of those companies are what’s called “gpt wrappers”. They don’t train anything. They just wrap an existing model or service into their software. AI is a trendy word that gets quick funds, many companies will say they are AI related even if they are just making an API call to chatGPT.

For the few that will attempt to train something, there are already a wide variety of datasets for AI training. Or they will may try to get data of a very specific topic. But in order to be scraping the bottom of the pan so hard that you need to scrap some little website you need to be talking about a model with a massive amount of parameters. Something that only like 5 companies in the world would actually need to improve their models. The rest of the people trying to train a model is not going to go try to scrap the whole internet, because they have no way to process and train that.

Also if some company is willing to waste a ton of energy training some data, doing some PoW to obtain that data, while it would be an inconvenient I don’t think it will stop them. They are literally building nuclear plants for training, a little crypto challenge is nothing in comparison. But it can be quite intrusive for legitimate users. For starters it forbids navigation with js deactivated.

ℍ𝕂-𝟞𝟝@sopuli.xyz · 2 days ago

AI does not triple traffic. It’s a completely irrational statement to make.

Multiple testimonials from people who host sites say they do. Multiple Lemmy instances also supported this claim.

I would bet that the number of requests per year of s resource by an AI scrapper is on the dozens at most.

You obviously don’t know much about hosting a public server. Try dozens per second.

There is a booming startup industry all over the world training AI, and scraping data to sell to companies training AI. It’s not just Microsoft, Facebook and Twitter doing it, but also Chinese companies trying to compete. Also companies not developing public models, but models for internal use. They all use public cloud IPs, so the traffic is coming from all over incessantly.

Using as much energy as a available per scrapping doesn’t even make physical sense. What does that sentence even mean?

It means that Microsoft buys a server for scraping, they are going to be running it 24/7, with the CPU/network maxed out, maximum power use, to get as much data as they can. If the server can scrape 100 sites per minute, it will scrape 100 sites. If it can scrape 1000, it will scrape 1000, and if it can do 10, it will do 10.

It will not stop scraping ever, as it is the equivalent of shutting down a production line. Everyone always uses their scrapers as much as they can. Ironically, increasing the cost of scraping would result in less energy consumed in total, since it would force companies to work more “smart” and less “hard” at scraping and training AI.

Oh, and it’s S-C-R-A-P-I-N-G, not scrapping. It comes from the word “scrape”, meaning to remove the surface from an object using a sharp instrument, not “scrap”, which means to take something apart for its components.

daniskarma@lemmy.dbzer0.com · 2 days ago

I’m not native English speaker. So I would apologize if there’s bad English in my response. And would thank any corrections.

That being said I do host public services, before and after AI was a thing. And I have asked many of these people who claim “we are under AI bot attacks” how are they able to differentiate when a request is from a AI scrapper or just any other scrapper and there was no satisfying answer.

ℍ𝕂-𝟞𝟝@sopuli.xyz · 1 day ago

Yeah but it doesn’t matter what the objective of the scraper is, the only thing that matters is that it’s an automated client that is going to send mass requests to you. If it wasn’t, Anubis would not be a problem for it.

The effect is the same, increased hosting costs and less access for legitimate clients. And sites want to defend against it.

That said, it is not mandatory, you can avoid using Anubis as a host. Nobody is forcing you to use it. And as someone who regularly gets locked out of services because I use a VPN, Anubis is one of the least intrusive protection methods out there.

daniskarma@lemmy.dbzer0.com · edit-2 1 day ago

It’s very intrusive in the sense that it runs a PoW challenge, unsolicited on the client. That’s literally like having a cryptominer running on your computer for each challenge.

Each one would do what they want with their server, of course. But for instance I’m very fond of scraping. For instance I have FreshRSS running ok my server, and the way it works is that when the target website doesn’t provide a RSS feed ot scrapes it to get the articles. I also have other service that scrapes to get pages changes.

I think part of the beauty of internet is being able to automate processes, software lile Anubis puts a globally significant energy tax on theses automations.

Once again, each one it’s able to do with their server whatever they want. But the think I like the least is that they are targeting with some great PR their software as part of some great anti-AI crusade, I don’t know if the devs itself or any other party. And I don’t like this mostly because I think is disinformation and just manipulative towards people who is maybe easy to manipulate if you say the right words. I also think that it’s a discourse that pushes into radicalization from certain topic, and I’m a firm believer that right now we need to overall reduce radicalization, not increase it.

xthexder@l.sw0.com · edit-2 1 day ago

A proof of work challenge is infinitely better than the alternative of “fuck you, you’re accessing this through a VPN and the IP is banned for being owned by Amazon (or literally any data center)”