Reddit has announced updates to its Robots Exclusion Protocol (robots.txt file), which regulates automated web bot access to websites. Traditionally used to allow search engines to index site content, the protocol now faces challenges with AI-driven scraping for model training, often without proper attribution.
In addition to the revised robots.txt file, Reddit will enforce rate limits and blocks on unidentified bots and crawlers. According to multiple sources, these measures apply to entities not complying with Reddit's Public Content Policy or lacking formal agreements with the platform. The changes are aimed at deterring AI companies from using Reddit content to train large language models without permission. Despite these updates, AI crawlers could potentially disregard Reddit's directives, as highlighted by recent incidents.
Recently, Wired uncovered that AI-powered startup Perplexity continued scraping Reddit content despite being blocked in the robots.txt file. Perplexity's CEO argued that robots.txt isn't legally binding, raising questions about the effectiveness of such protocols in regulating AI scraping practices.
Reddit's updates will exempt authorised partners like Google, with whom Reddit has a substantial agreement allowing AI model training on its data. This move signals Reddit's stance on controlling access to its content for AI training purposes, emphasising compliance with its policies to safeguard user interests.
These developments align with Reddit's recent policy updates, underscoring its efforts to manage and regulate data access and use by commercial entities and partners.