Consent in Crisis: The Rapid Decline of the AI Data Commons

Published in Neurips 2024

Recommended citation: @article{Longpre2024ConsentIC, title={Consent in Crisis: The Rapid Decline of the AI Data Commons}, author={Shayne Longpre and Robert Mahari and Ariel Lee and Campbell Lund and Hamidah Oderinwale and William Brannon and Nayan Saxena and Naana Obeng-Marnu and Tobin South and Cole Hunter and Kevin Klyman and Christopher Klamm and Hailey Schoelkopf and Nikhil Singh and Manuel Cherep and Ahmad Anis and An Dinh and Caroline Chitongo and Da Yin and Damien Sileo and Deividas Mataciunas and Diganta Misra and Emad Alghamdi and Enrico Shippole and Jianguo Zhang and Joanna Materzynska and Kun Qian and Kush Tiwary and Lester James Validad Miranda and Manan Dey and Minnie Liang and Mohammed Hamdy and Niklas Muennighoff and Seonghyeon Ye and Seungone Kim and Shrestha Mohanty and Vipul Gupta and Vivek Sharma and Vu Minh Chien and Xuhui Zhou and Yizhi Li and Caiming Xiong and Luis Villa and Stella Biderman and Hanlin Li and Daphne Ippolito and Sara Hooker and Jad Kabbara and Sandy Pentland}, journal={ArXiv}, year={2024}, volume={abs/2407.14933}, url={https://api.semanticscholar.org/CorpusID:271328989} } https://arxiv.org/abs/2407.14933

Download paper here

General-purpose artificial intelligence (AI) systems are built on massive swathes of public web data, assembled into corpora such as C4, RefinedWeb, and Dolma. To our knowledge, we conduct the first, large-scale, longitudinal audit of the consent protocols for the web domains underlying AI training corpora. Our audit of 14,000 web domains provides an expansive view of crawlable web data and how codified data use preferences are changing over time. We observe a proliferation of AI-specific clauses to limit use, acute differences in restrictions on AI developers, as well as general inconsistencies between websites’ expressed intentions in their Terms of Service and their robots.txt. We diagnose these as symptoms of ineffective web protocols, not designed to cope with the widespread re-purposing of the internet for AI. Our longitudinal analyses show that in a single year (2023-2024) there has been a rapid crescendo of data restrictions from web sources, rendering ~5%+ of all tokens in C4, or 28%+ of the most actively maintained, critical sources in C4, fully restricted from use. For Terms of Service crawling restrictions, a full 45% of C4 is now restricted. If respected or enforced, these restrictions are rapidly biasing the diversity, freshness, and scaling laws for general-purpose AI systems. We hope to illustrate the emerging crises in data consent, for both developers and creators. The foreclosure of much of the open web will impact not only commercial AI, but also non-commercial AI and academic research.

Recommended citation: @article{Longpre2024ConsentIC, title={Consent in Crisis: The Rapid Decline of the AI Data Commons}, author={Shayne Longpre and Robert Mahari and Ariel Lee and Campbell Lund and Hamidah Oderinwale and William Brannon and Nayan Saxena and Naana Obeng-Marnu and Tobin South and Cole Hunter and Kevin Klyman and Christopher Klamm and Hailey Schoelkopf and Nikhil Singh and Manuel Cherep and Ahmad Anis and An Dinh and Caroline Chitongo and Da Yin and Damien Sileo and Deividas Mataciunas and Diganta Misra and Emad Alghamdi and Enrico Shippole and Jianguo Zhang and Joanna Materzynska and Kun Qian and Kush Tiwary and Lester James Validad Miranda and Manan Dey and Minnie Liang and Mohammed Hamdy and Niklas Muennighoff and Seonghyeon Ye and Seungone Kim and Shrestha Mohanty and Vipul Gupta and Vivek Sharma and Vu Minh Chien and Xuhui Zhou and Yizhi Li and Caiming Xiong and Luis Villa and Stella Biderman and Hanlin Li and Daphne Ippolito and Sara Hooker and Jad Kabbara and Sandy Pentland}, journal={ArXiv}, year={2024}, volume={abs/2407.14933}, url={https://api.semanticscholar.org/CorpusID:271328989} }