Subscriber OnlyOpinion

The giant machine for plagiarism is now complaining that it has been plagiarised. Oh, the irony

DeepSeek may have been trained on OpenAI’s own software, it is claimed, but it sounds like fair use to me

Sam Altman, chief executive of OpenAI, at the firm’s headquarters in San Francisco. Photograph: Jim Wilson/The New York Times
Sam Altman, chief executive of OpenAI, at the firm’s headquarters in San Francisco. Photograph: Jim Wilson/The New York Times

Sam Altman, the co-founder and chief executive of OpenAI, has for many years now been among the most prominent over-inflators of expectations and fears around the technology we have come to refer to as artificial intelligence. (I put it like this because to even use this term, as unavoidable as it has become, is to be complicit in this over-inflation.)

As the public face of the company that makes ChatGPT, and which has thereby become synonymous with this understanding of AI, Altman’s job is to project to the world in general, and to investors in particular, a vision of a singular technology that will effect profound and unprecedented changes.

His claims have always tended toward the grandiose: a worst-case scenario in which an improperly coded all-powerful AI will wipe out humanity, or, at the other end of the scale, a new age of enlightenment, in which machine intelligence brings about prosperity for all, and transformative innovations such as technological solutions to the climate crisis.

We would do well to treat these claims with scepticism, if not outright contempt; they are intended to frame Altman as both sorcerer and sage – urging caution about the Promethean power he himself has snatched from the gods – and to advertise to investors the near-limitless market value of that power.

READ MORE

In an interview last year, Altman made the assertion – a comparatively measured one by his standards – that, in the long term, given the power of the technology he and people like him are building, “the whole structure of society itself will be up for some degree of debate and reconfiguration”.

The interview resurfaced earlier this week, and circulated widely in certain tech-focused corners of social media, in the midst of the stock market freakout over the launch of a generative AI software called R1.

DeepSeek, the Chinese start-up that created R1, reportedly built the LLM from scratch for about $5.6m (€5.4m). It does in effect the same thing as OpenAI’s latest ChatGPT model, and does it roughly as well, despite Open AI receiving investment of more than $17.9bn.

The DeepSeek app rapidly became a top downloaded application. Photograph: Mladen Antonov/Getty
The DeepSeek app rapidly became a top downloaded application. Photograph: Mladen Antonov/Getty

It was, to say the least, a bad day at the office – not just for Altman and OpenAI, but for Nvidia, the American multinational that manufactures the processors used for machine-learning software. The US government has gone out of its way to prevent Chinese developers from getting their hands on Nvidia’s chips, in an effort to ensure American dominance in AI.

That a Chinese company with fewer than 200 employees was able to build a viable competitor to ChatGPT, and to do it for about the kind of money you’d need to open a cocktail bar around the corner from OpenAI’s Mission District HQ, seemed to upend the paradigm of the entire AI racket.

About $589bn was wiped off Nvidia’s market value overnight. The entire US utilities sector, which had been riding high on the expectation of an energy-hungry AI boom, took a very deep bath in some very cold water – water that may no longer be needed to cool those gigantic server farms.

A trader at the New York Stock Exchange. The arrival of the Chinese AI startup DeepSeek sparked a sell-off in tech stocks this week. Photograph: Michael M. Santiago/Getty
A trader at the New York Stock Exchange. The arrival of the Chinese AI startup DeepSeek sparked a sell-off in tech stocks this week. Photograph: Michael M. Santiago/Getty

Perhaps the most compelling aspect of the whole fiasco, though, was OpenAI’s subsequent claim that DeepSeek may have used OpenAI’s proprietary model to train R1.

A spokesperson for the company told the Financial Times that there was evidence that DeepSeek may have employed a practice known as “distillation” – using a larger model’s outputs to train a smaller one – to speed-run the construction of its generative AI. Such a practice, the company says, is in direct contravention of its terms of service, and violates its intellectual property.

David Sacks, Trump’s newly appointed AI and crypto tsar, described distillation as “when one model kind of sucks the knowledge out of the parent model”; there was, he claimed, “substantial evidence that what DeepSeek did here is they distilled the knowledge out of OpenAI models, and I don’t think OpenAI is very happy about this”.

I don’t know much about intellectual property law, but this sounds like fair use to me.

ChatGPT 'learns' from material published online, answers requests from humans and is even able to produce academic essays. Photograph: John Walton/PA
ChatGPT 'learns' from material published online, answers requests from humans and is even able to produce academic essays. Photograph: John Walton/PA

I do, however, know a little bit about dramatic irony. And it seems to me that what we have here is one company, whose product might be understood as a vast mechanism of automated plagiarism, claiming that that vast mechanism of automated plagiarism has been plagiarised by another company to make a smaller mechanism of automated plagiarism.

OpenAI’s technology was built, and is still being built, on scraping the entirety of the internet – academic papers, poems, blog posts, Jacobean revenge tragedies, Holocaust survivor memoirs, political speeches, social media posts, fantasy novels, opinion pieces; something approaching the entire archive of the written world – and repurposing it as training data.

Copyleaks found that 60 per cent of ChatGPT’s outputs contained straightforward plagiarism

In no case has it compensated, or even attempted to compensate, any of the authors of the texts it has appropriated.

OpenAI’s defence of this practice, reflecting its chief executive’s combination of nerd-grandiosity and carnival hucksterism, basically rests on two claims.

First, that it needs to scrape and repurpose all available written texts, because that’s the only way it can train its language model and build the most important technology ever conceived of.

Second, that this in any case does not constitute plagiarism, because it’s not regurgitating these texts wholesale, but rather doing its own thing with them. It’s fair use, in other words. In the light of a report published last year in which the plagiarism detection software Copyleaks found that 60 per cent of ChatGPT’s outputs contained straightforward plagiarism, this argument might seem suspect.

Microsoft, OpenAI probe if China’s DeepSeek-linked firm improperly accessed data ]

But if it works as a defence of OpenAI’s wholesale appropriation of intellectual property, then it should work just as well for a company appropriating the results of that appropriation.

As someone who makes his living from writing, and who otherwise values the written word, I don’t much care whether the machine-learning algorithm built on the mass appropriation of original texts that ultimately prevails turns out to be American or Chinese. I don’t care about this in the same way I don’t care whether the accumulated trash in the landfills originated from Amazon or Temu.

But if we’re going to think of it in purely capitalist terms, as a question of market Darwinism, I suppose it would make sense to back a company that can produce a given product for a fraction of the cost its competitor can.