LLMs are DRM for information

21 November 2024

Oh man I just realised I'll be having conversations like this in 5 years
"Hey I was trying to find your opening hours but couldn't find them anywhere"
"Really? We submit them to both Gemini and GPT"

(me, earlier)

In this post I want to make a few unhappy observations. I will also attempt to predict an enshittification cycle in advance. Many would say that AI is already enshittified but in this case I'm actually talking about Doctorow's original formulation, not just trying to say "shitty" ornately. At the end I've included an upbeat call to action but we all know it won't do anything. Let's begin.

I believe we are currently experiencing the "good old days" of LLM technology. Tech companies are hoovering up everything on the open internet and elsewhere to feed their models so that it can be regurgitated in different forms. Certainly this is unpleasant but for now it's kind of background noise. The LLMs are a low resolution screenshot of the broader internet and you can ignore them more or less without peril. Everything you need is available other ways.

The trouble is, this training on all the world's data is no longer enough of a commercial advantage for these companies. Lots of people are making them and the open models are catching up. If you've already ingested everything, where do you go from there?

Obviously you want new and exclusive data. That way your LLM can answer questions that nobody else's can and you become more competitive. However, collecting new forms of contemporary data is complex and expensive. Instead, what if we made individuals and businesses feed in relevant data en masse, all on their own? What if they had a motivation to do so?

Consider Sammy, the owner of a small fictional cafe on an alley in Fitzroy. Her supplies of eggs have dried up for two weeks due to a fictional H5N1 outbreak and she needs to take items off her menu.

"Hey," she says into her Pixel. "Update the public menu information for the cafe. The PDF is in my Google Drive but get rid of all the recipes containing eggs. We hope to have them back on the 22nd."

What better way to get people to feed your model unique data destined for public consumption? Self-interest and a low-friction interface.

Here's where we can get even more devious. A naive Google would interpret Sammy's instructions, update a database of structured data, and display the new information on Google Maps. Evil Google can instead do two things: actively avoid turning this into structured data—simply store it for training—and ensure this information can be retrieved only by directing queries at their own AI tool, like at the top of a Google search.

Now their LLM is not only more valuable, it is virtually impossible to scrape. Output is customised for each user and their specific query and circumstances. Keeping information in unstructured format and accessing it via weights and APIs is a strong defence against anybody else hoovering your up your data. It's DRM for information.

So here is how the enshittification happens, maybe.

  1. Be good to users. Best I can tell, this part is still underway though people are starting to grumble about ChatGPT's quality. Tokens are cheap and there are plenty of optimistic users who feel like the tools are on their side.
  2. Abuse users to make things better for business customers. Businesses pay to get LLMs to spruik their products while making it sound natural(-ish). For a while it will be flexible and inexpensive to get in front of the kind of people who like talking to AIs all the time.
  3. Abuse the business customers. Now the mainstream are talking to AIs to search for information so you have to pay to be there. Worse still, they don't tend to swap between them—a particular person speaks to the same AI all the time so if you want good reach you're compelled to advertise on all the major ones. It's gonna cost you. Advertisers start backing out and profits start drooping.
  4. Then they die. And something of value is lost—all the genuine information that could have been put on a WordPress site in the first place.

What can we do about it? I guess stop giving corporations datasets for free unless you're putting it somewhere else too. At least in my part of the world, OpenStreetMap has terrible information about basic details like business opening hours so consider contributing to those. Also please build great apps that build on truly open data sets like OSM—free apps, proprietary apps, I don't care—the more of everyday computing is based on open and structured data, the less will get locked up inside LLMs.


Tom's Opinions Blog by Thomas Karpiniec
Posts RSS, Atom