On the 19th of September 2024, developer Robyn Speer announced that she would no longer be updating wordfreq , a Python library that lets users look up the frequency of words in online text across an impressive range of different languages. Wordfreq is a handy open-source utility that’s particularly useful for building natural language processing applications, and generally a signal case of what’s cool about the DIY open web – it leverages the vast scale of the data we collectively generate through our online activity to create a tool that anyone can use.
So, why’s Speer walking away from wordfreq? In her own words: “I don’t think anyone has reliable information about post-2021 language usage by humans. The open Web (via OSCAR) was one of wordfreq’s data sources. Now the Web at large is full of slop generated by large language models, written by no one to communicate nothing. Including this slop in the data skews the frequencies.”
To put it a little differently: since the beginnings of the generative AI bonanza around 2022, an increasing proportion of the content appearing on the open web is being created by ChatGPT-like text generators. This is now happening at such a scale that the word frequencies represented in the wordfreqdata increasingly reflect the language use of these generative AI programmes, rather than that of human writers. In this sense, the contents of wordfreq can no longer be seen as “natural language”; if generative AI is really, as Ted Chiang put it, a “blurry JPEG” of the Internet, wordfreq would be in danger of turning into a diagram of that copy.
This is a worrying precedent for anyone who does work with language on the web – us included. Whilst we don’t use wordfreq in any of our programs, a lot of what we do is based on the same underlying premise: that the language that exists on the internet represents meaningful communication.
To give an example, we run several tools that analyse text from company websites in order to understand how these organisations express their engagement with policy frameworks (such as ESG or DEI) or specific pieces of legislation (like Wales’s Wellbeing of Future Generations Act). Without getting into the precise mechanics of these systems, the presumption underpinning them is that the decisions that companies make about the language used on their website are meaningful down to a fairly granular scale – including the frequency with which certain words occur! We don’t just read the language on a binary level (do they have an ESG statement yes/no, or are they DEI positive or negative?), but to understand what it might tell us about how a company’s positions have evolved over time, or how they compare to their peers.
Of course, company websites have always contained a large amount of boilerplate text and spammy SEO-speak. But up until now, the use of heavily standardised language still counted as a meaningful datapoint. The fact that thousands of companies used essentially the same text to address a particular issue (privacy, for instance) told you something about how the business community was thinking about and reacting to that problem. On the other hand, if one company uses boilerplate language to talk about its climate commitments while another is concrete and specific, then that could reflect a meaningful distinction between those two companies, at least at the level of messaging.
There have always been powerful technological and economic forces that tend to homogenise language use on the internet. LLMs are not new in this regard, but the wordfreq affair is a sign that their widespread adoption could add up to a qualitative shift in how language works online. As we all surely know by now, LLMs produce text by guessing the next most likely word token in the current string based on the distribution of the language in its training data. If certain language patterns are represented more commonly in the data, they are more likely to appear as outputs, which is why text produced by LLMs often feels conservative and bland. Hence Speer’s overwhelming sense of futility. Using LLM-generated text to inform the next iteration of wordfreq would effectively mean training the system on its own outputs: Saturn devouring his children.
From our perspective, the problem with studying AI-generated text to ascertain its meaning is that the outputs of an LLM are likely to reflect the internal structure of the program far more than they do the intentions or characteristics of the prompter. Speer cites the well-known example of the word “delve”, which suddenly became a commonly used term across thousands of scientific papers in 2023. Presumably, researchers were using the same LLM (Open AI’s ChatGPT) to help write their papers, meaning that they all ended up displaying the same linguistic foible. But why is ChatGPT so obsessed with “delve”? The short answer is, we don’t know – although journalists have suggested it could be related to the fact that many of the content moderators employed by OpenAI reside in Nigeria, where “delve” is apparently more frequently used in business-speak. The point is, whilst the prevalence of “delve” could reveal genuinely fascinating insights into how ChatGPT is constructed and managed, it doesn’t really tell us anything about the scientists who wrote those research papers, beyond the fact that they all used the same LLM to describe their findings.
Does it matter that scientific researchers – many of whom will not speak English as a first language – are using LLMs to cut down on annoying busywork? Perhaps not, but as the example of wordfreq shows, seemingly trivial first-order phenomena can compound to have serious second-order consequences. A web built by LLMs, it seems, will be simultaneously stranger and duller than what went before, characterised by weird linguistic bugs and tics that spread, unmotivated and unexplained, across otherwise uncorrelated swathes of the internet. As we try to figure out what’s going on, it won’t make sense to look to the owners of the websites hosting the text, or to the structure of the language itself, but to the models that produced it. And since these models will be predominantly owned by massive global corporations, who do not want us peeking at the man behind the curtain, answers are not likely to be forthcoming.
This connects us to another of Speer’s reasons for giving up on wordfreq – the paywalling of linguistic datasets, like Twitter/X and Reddit, that used to be accessible for free. With the rise of LLMs, proprietors of these social sites are heavily incentivised to limit access to the giant linguistic corpuses they control, either so they can use it to train their own gen AIs (as with X’s utterly putrid Grok), or so they can sell it for astronomical sums to OpenAI. So, at the same time that open-source tools like wordfreq are being rendered unviable, the datasets that were used to build them are made accessible only to billionaires. The problem is not simply that the web is being filled with AI-generated spam, but that the means to track and parse this process are being concentrated in the hands of a few highly capitalised companies.
Generative AI represents two simultaneous phenomena – one the one hand, a wave of linguistic pollution as the open web is flooded with AI-generated text, and on the other, a new round of digital enclosures, as resources that used to be commonly accessible are stashed behind billion-dollar drawbridges. The result is not just a homogenisation of speech, but of thought. As Speer remarks, the diverse research field she once knew as “natural language processing” has now been “devoured by generative AI. Other techniques still exist but generative AI sucks up all the air in the room and gets all the money.” Other ways of producing knowledge about the web – and therefore the world – are at risk of being choked off by the dominance of LLMs and generative AI.
Whilst the internet of the past ten years may have been dominated by a handful of big platform providers, there was still enough space in the cracks of that system for smaller actors to do interesting work. These are the spaces that, up until recently, were occupied by people like Robyn Speer and, in our own way, Etic Lab. The end of wordfreq makes us wonder where we’ll fit in the new web currently being built by OpenAI and their “intelligent” machines.