lads

grysbok@lemmy.sdf.org · 2 days ago

You’re right. AI didn’t just triple the traffic to my tiny archive’s site. It way more than tripled it. After implementing Anubis, we went from 3000 ‘unique’ visitors down to 20 in a half-day. Twenty is a much more expected number for a small college archive in the summer. That’s before I did any fine-tuning to Anubis, just the default settings.

I was getting constant outage reports. Now I’m not.

For us, it’s not about protecting our IP. We want folks to get to find out information. That’s why we write finding aids, scan it, accession it. But, allowing bots to siphon it all up inefficiently was denying everyone access to it.

And if you think bots aren’t inefficient, explain why Facebook requests my robots.txt 10 times a second.

daniskarma@lemmy.dbzer0.com · edit-2 1 day ago

How do you know those reduced request were AI companies and not any other purpose?

grysbok@lemmy.sdf.org · edit-2 1 day ago

Timing and request patterns. The increase in traffic coincided with the increase in AI in the marketplace. Before, we’d get hit by bots in waves and we’d just suck it up for a day. Now it’s constant. The request patterns are deep deep solr requests, with far more filters than any human would ever use. These are expensive requests and the results aren’t any more informative that just scooping up the nicely formatted EAD/XML finding aids we provide.

And, TBH, I don’t care if it’s AI. I care that it’s rude. If the bots respected robots.txt then I’d be fine with them. They don’t and they break stuff for actual researchers.

daniskarma@lemmy.dbzer0.com · edit-2 1 day ago

I mean number of pirates correlates with global temperature. That doesn’t mean causation.

The rest of the indices would aso match for any archiving bot, or with any bit in search of big data. We must remember that big data is used for much more than AI. At the end of the day scraping is cheap, but very few companies in the world have access to the processing power to train that amount of data. That’s why it seems so illogical to me.

We are seeing how many LLM models which are results of a full train, per year? Ten? twenty? Even if they update and retrain often it’s not compatible with the amount of request people are implying as AI scraping that would put services into dos risk. Specially when I would think that any AI company would not try to scrap the same data twice.

I have also experience an increase in bot requests in my host. But I just think is a result of internet getting bigger, more people using internet with more diverse intentions, some ill some not. I’ve also experience a big increase on probing and attack attempts on general, and I don’t think it’s OpenAI trying some outdated Apache vulnerability on my server. Internet is just a bigger sea with more fish in it.

grysbok@lemmy.sdf.org · 1 day ago

I just looked at my log for this morning. 23% of my total requests were from the useragent GoogleOther. Other visitors include GPTBot, SemanticScholarBot, and Turnitin. That’s the crawlers that are still trying after I’ve had Anubis on the site for over a month. It was much, much worse before, when they could crawl the site, instead of being blocked.

That doesn’t include the bots that lie about being bots. Looking back at an older screenshot of a monitors—I don’t have the logs themselves anymore—I seriously doubt I had 43,000 unique visitors using Windows per day in March.

daniskarma@lemmy.dbzer0.com · edit-2 1 day ago

Why would they request so many times a day the same data if the objective was AI model training. It makes zero sense.

Also google bots obeys robots.txt so they are easy to manage.

There may be tons of reasons google is crawling your website. From ad research to any kind of research. The only AI related use I can think of is RAG. But that would take some user requests aways because if the user got the info through the AI google response then they would not enter the website. I suppose that would suck for the website owner, but it won’t drastically increase the number of requests.

But for training I don’t see it, there’s no need at all to keep constantly scraping the same web for model training.

grysbok@lemmy.sdf.org · edit-2 1 day ago

Like I said, [edit: at one point] Facebook requested my robots.txt multiple times a second. You’ve not convinced me that bot writers care about efficiency.

[edit: they’ve since stopped, possibly because now I give a 404 to anything claiming to be from facebook]

The Quuuuuill@slrpnk.net · 1 day ago

You’ve not convinced me that bot writers care about efficiency.

and why should bot writers care about efficiency when what they really care about is time. they’ll burn all your resources without regard simply because they’re not who’s paying

grysbok@lemmy.sdf.org · 1 day ago

Yep, they’ll just burn taxpayer resources (me and my poor servers) because it’s not like they pay taxes anyway (assuming they are either a corporation or not based in the same locality as I am).

There’s only one of me and if I’m working on keeping the servers bare minimum functional today I’m not working on making something more awesome for tomorrow. “Linux sysadmin” is only supposed to be up to 30% of my job.

then_three_more@lemmy.world · 1 day ago

Does it matter what the purpose was? It was still causing them issues hosting their site.

daniskarma@lemmy.dbzer0.com · 1 day ago

Not really. I only ask because people always say it’s for LLM training, which seem a little illogical to me, knowing the small number of companies that have access to the computer power to actually do a training with that data. And big companies are not going to scrape hundreds of times the same resource for a piece of information they already have.

But I think people should be more critique trying to understand who is making the request and with which purpose. So then people could make a better informed decision of they need that system (which is very intrusive for the clients) or not.

then_three_more@lemmy.world · 1 day ago

knowing the small number of companies that have access to the computer power to actually do a training with that data

the 70,717 AI startups worldwide

https://edgedelta.com/company/blog/ai-startup-statistics

Not every company will be training a model as big as the big names, but combined that’s a hell of a lot.

daniskarma@lemmy.dbzer0.com · edit-2 1 day ago

Most of those companies are what’s called “gpt wrappers”. They don’t train anything. They just wrap an existing model or service into their software. AI is a trendy word that gets quick funds, many companies will say they are AI related even if they are just making an API call to chatGPT.

For the few that will attempt to train something, there are already a wide variety of datasets for AI training. Or they will may try to get data of a very specific topic. But in order to be scraping the bottom of the pan so hard that you need to scrap some little website you need to be talking about a model with a massive amount of parameters. Something that only like 5 companies in the world would actually need to improve their models. The rest of the people trying to train a model is not going to go try to scrap the whole internet, because they have no way to process and train that.

Also if some company is willing to waste a ton of energy training some data, doing some PoW to obtain that data, while it would be an inconvenient I don’t think it will stop them. They are literally building nuclear plants for training, a little crypto challenge is nothing in comparison. But it can be quite intrusive for legitimate users. For starters it forbids navigation with js deactivated.