Most of those companies are what’s called “gpt wrappers”. They don’t train anything. They just wrap an existing model or service into their software. AI is a trendy word that gets quick funds, many companies will say they are AI related even if they are just making an API call to chatGPT.
For the few that will attempt to train something, there are already a wide variety of datasets for AI training. Or they will may try to get data of a very specific topic. But in order to be scraping the bottom of the pan so hard that you need to scrap some little website you need to be talking about a model with a massive amount of parameters. Something that only like 5 companies in the world would actually need to improve their models. The rest of the people trying to train a model is not going to go try to scrap the whole internet, because they have no way to process and train that.
Also if some company is willing to waste a ton of energy training some data, doing some PoW to obtain that data, while it would be an inconvenient I don’t think it will stop them. They are literally building nuclear plants for training, a little crypto challenge is nothing in comparison. But it can be quite intrusive for legitimate users. For starters it forbids navigation with js deactivated.
Most of those companies are what’s called “gpt wrappers”. They don’t train anything. They just wrap an existing model or service into their software. AI is a trendy word that gets quick funds, many companies will say they are AI related even if they are just making an API call to chatGPT.
For the few that will attempt to train something, there are already a wide variety of datasets for AI training. Or they will may try to get data of a very specific topic. But in order to be scraping the bottom of the pan so hard that you need to scrap some little website you need to be talking about a model with a massive amount of parameters. Something that only like 5 companies in the world would actually need to improve their models. The rest of the people trying to train a model is not going to go try to scrap the whole internet, because they have no way to process and train that.
Also if some company is willing to waste a ton of energy training some data, doing some PoW to obtain that data, while it would be an inconvenient I don’t think it will stop them. They are literally building nuclear plants for training, a little crypto challenge is nothing in comparison. But it can be quite intrusive for legitimate users. For starters it forbids navigation with js deactivated.