As Artificial Intelligence (AI) transforms the digital landscape, reshaping it with generative AI models and increasingly complex algorithms, it has also brought a little-known market to the forefront of the public consciousness – data.
AI models, especially large language models that use machine learning techniques like OpenAI’s ChatGPT program, require massive amounts of data.
Data that doesn’t come cheap – the data analytics market was valued at US$41.05 billion in 2022 and is expected to grow at an eye-watering compounding annual growth rate of 27.3% to hit US$279.21 billion by 2030.
Now, a former employee and researcher for OpenAI, Suchir Balaji, has raised the alarm over the company’s data collection practices, claiming OpenAI is “destroying” the internet and directly infringing on copyright law.
Gen AI researcher speaks out
Balaji was employed at OpenAI from 2020 until August this year – his LinkedIn page states he was working on post-training for ChatGPT, reasoning algorithms, pre-training for ChatGPT and reinforcement learning for the web version of ChatGPT.
He was part of the team organising and leveraging the huge reams of data the company used to build its GenAI bot.
After ChatGPT’s release to market in 2022, he began to consider the implications of what OpenAI was doing.
In August this year, he chose to leave the company because of ethical concerns with the way the AI pioneer was collecting and using data.
“If you believe what I believe, you have to just leave the company,” he said during a recent series of interviews with The New York Times.
Is GenAI destroying the internet?
Recently, Balaji published a post on his own website explaining the damage OpenAI and similar GenAI models are already doing to the internet.
Programming in particular is suffering, with many open-source platforms losing participants at staggering rates as individuals turn to AI to answer questions rather than their peers.
Balaji is a published AI researcher – he has three papers on various elements of AI models, with more than 8,000 citations.
In the post, titled “When does generative AI qualify for fair use?” he argues that GenAI is not truly transformative as required by fair use laws, since it simply alters the form and structure of content.
Balaji also argues GenAI content threatens to replace the very market it feeds from – should GenAI replace elements of content creation altogether, it will very quickly lose the ability to train on new data.
The lack of good data causes a lot of problems in these LLM models, leading to what researchers call ‘hallucinations’ … essentially, the model begins to make things up and churn out nonsense outputs.
Balaji argues none of the fair use defences clearly favour ChatGPT, or any other GenAI model for that matter, especially in light of the potential economic harm they could represent.
“This is not a sustainable model for the internet ecosystem as a whole,” he told The New York Times.
Issue before the courts
OpenAI has categorically disagreed with Balaji’s assessment.
“We build our AI models using publicly available data, in a manner protected by fair use and related principles, and supported by longstanding and widely accepted legal precedents,” the company said in a statement.
“We view this principle as fair to creators, necessary for innovators, and critical for US competitiveness.”
The truth of that will be revealed in court – The New York Times has sued OpenAI and Microsoft (NASDAQ:MSFT) for copyright infringement, and it's not the only one to go into bat against the GenAI company.
“Defendants seek to free-ride on The Times’s massive investment in its journalism,” the complaint says, accusing OpenAI and Microsoft of “using The Times’s content without payment to create products that substitute for The Times and steal audiences away from it.”
As of April this year, eight newspapers and a slew of YouTube creators, actors, authors and the Center for Investigative Reporting are all actively suing OpenAI for copyright infringement claims.
Intellectual Property lawyer Bradley J. Hulbert told The New York Times that intellectual property laws were woefully out of date, and the issue is yet to be decided definitively in court.
“Given that AI is evolving so quickly,” he said, “it is time for Congress to step in.”