The surge in AI, particularly with systems like ChatGPT, is facing a potential slowdown due to the impending depletion of publicly available text data, according to a study by Epoch AI. The shortage is projected to occur between 2026 and 2032, highlighting a critical challenge in maintaining the rapid advancement of AI.
AI's growth has relied heavily on vast amounts of human-generated text data, but this finite resource is diminishing. Companies like OpenAI and Google are currently purchasing high-quality data sources, such as content from Reddit and news outlets, to sustain their AI training. However, the scarcity of fresh data might soon force them to consider using sensitive private data or less reliable synthetic data.
The Epoch AI study emphasises that scaling AI models, which requires immense computing power and large data sets, may become unfeasible as data sources dwindle. While new techniques have somewhat mitigated this issue, the fundamental need for high-quality human-generated data remains. Some experts suggest focusing on specialised AI models rather than larger ones to address this bottleneck.
In response to these challenges, AI developers are exploring alternative methods, including generating synthetic data. However, concerns about the quality and efficiency of such data persist, underlining the complexity of sustaining AI advancements in the face of limited natural resources.