I wonder if this will be one of those situations where because everyone else does the wrong thing, the people that try to do the right thing actually end up worse of. Mostly because if models will not be trained on european data, the internet and cultural hegemony of the US will go into overdrive with all models churning out only US-centric views.
IMHO when the gap is too wide it sticks differently, it creates a caricature or an utopia in a similar way how Hollywood created an imaginary US utopia everywhere in the world or how the Anime culture created a caricature of Japanese people.
I don't know if it is good or bad thing in total. I remember idealising the USA as a teenager, not only its technological might but imagining that everybody lives in large detached homes only later to find out that Americans who live like that can't do anything without a car, don't have groceries, restaurants, cafes, museums, cinemas in walking distance.
My bet is that it will simply stick differently and will create an image of America through which Europeans will reflect on themselves and have conversations about how things can be done differently.
I don't believe that hearing different viewpoints automatically makes you to subscribe to them and that's why I'm very ant-censorship and I think it's a mistake to block or remove content from the internet. Real people don't get influenced the way priests influence villagers in Age of Empires 2.
Actually, I think a bit of cultural decoupling from the US will have positive effects as the US itself is in cultural crisis.
> the internet and cultural hegemony of the US will go into overdrive with all models churning out only US-centric views
I saw a fascinating outcome of this at a conference recently. Polish speaker made an IRS joke to a room full of Europeans at a conference in Europe. Everyone got it.
Do you really live in Poland? I'm a foreigner (well, not any more) and I've been always calling it urzad skarbowy even when speaking with other foreigners. If I called it IRS, no one would understand it.
I have the weird situation where in my country the IRS is called the IRS even though the letters stand for words in my language for "income tax" it ends up being the same acronym and its used interchangeably with the actual name of the government entity.
That can be an issue, although - I think - the European approach will be to block private and semi-private data (like fb posts), but allow public data (articles etc).
I can imagine models getting their knowledge and intelligence by getting trained on global - english and chinese - data, and then learning country specific languages and viewpoints from way smaller datasets.
You can see that already - gpt was trained on way less Polish language material than english, but it’s almkst just as smart and fluent in both.
As for cultural quirks - here it’s more about who does hrlf and alignment than the source material.
That isn't the European approach. From the article:
"Meta on Monday said it hoped to use Europeans' data to train its models. It promised to only use public posts and comments — not private chats and DMs — and to not use any content from anyone under the age of 18. Crucially, the biz said it would give Euro folks a chance to opt out; a safeguard not extended to the rest of the world."
Private messages have never been in play here. The regulators are arguing that stuff posted for the whole world to see is not OK to use for training purposes if it was posted to Facebook, but it is OK if it was posted to a blog.
Nah, China already sealed of their internet as did Russia and to a degree India. If we have, in the end, a dead US-centric internet, well, as soon as people start ignoring that it wpuldn't be a real loss to humankind. Until we reach that point so, we are in deep shit. And LLM spam and crap content being US centric is least of our problems.
Bandwidth in and out of the country is heavily throttled in both directions and packet filtering applies on both. It's very hard to crawl the Chinese internet from the west and it has always been that way. If you want to do it you have to do the crawl from inside China, and then you're open to having your datacenters raided and software stack stolen (this isn't a theoretical concern).
Because US tech companies have no restrictions in selling the US data to EU consumers.
And how can EU consumers know which of the data they get is US data or EU data? I'm European but almost everything I write online even about my home country, including this comment you're reading, is in English on US developed platforms, not in my mother tongue. Now, is that US data or EU data?
Don't feel so puffed up, my American friend. We really don't want to use it, but we have no other choice.
AI is devaluing you as an employee. We are only at the beginning of the AI apocalypse. If things keep going this way, all that is left to meatbags like you or me will be hard manual labour under strict AI supervision. This is not the future we asked for.