Skip to main content
SearchLoginLogin or Signup

Data Authenticity, Consent, and Provenance for AI Are All Broken: What Will It Take to Fix Them?

AI training data organization and transparency remains opaque, and this impedes our understanding of data authenticity, consent, and the harms and biases in AI models. Regulators and researchers have called for new solutions to address obstacles around responsible and account . .

Published onMar 27, 2024
Data Authenticity, Consent, and Provenance for AI Are All Broken: What Will It Take to Fix Them?
·
history

You're viewing an older Release (#1) of this Pub.

  • This Release (#1) was created on Mar 27, 2024 ()
  • The latest Release (#2) was created on Mar 29, 2024 ().

Abstract

New AI capabilities are owed in large part to massive, widely sourced, and underdocumented training data collections. Dubious collection practices have spurred crises in data transparency, authenticity, consent, privacy, representation, bias, copyright infringement, and the overall development of ethical and trustworthy AI systems. In response, AI regulation is emphasizing the need for training data transparency to understand AI model limitations. Based on a large-scale analysis of the AI training data landscape and existing solutions, we identify the missing infrastructure to facilitate responsible AI development practices. We explain why existing tools for data authenticity, consent, and documentation alone are unable to solve the core problems facing the AI community, and outline how policymakers, developers, and data creators can facilitate responsible AI development, through universal data provenance standards.

Keywords: Data Provenance, Artificial Intelligence, Authenticity, Consent, Transparency, Copyright1

🎧 Listen to this article

AI training data organization and transparency remains opaque, and this impedes our understanding of data authenticity, consent, and the harms and biases in AI models. Regulators and researchers have called for new solutions to address obstacles around responsible and accountable AI. Many tools for data authenticity, consent, and provenance are being developed in isolation. We argue that what is needed is a unified framework dedicated to the structured documentation of data properties, which requires action from multiple stakeholders. Data creators should implement tagging and licensing for content while adopting and advocating for standardized annotation practices. AI developers should commit to providing standardized data provenance documentation and actively contribute to and utilize dataset libraries in model training. Lawmakers should set and enforce minimum standards for data provenance for major developers and providers while providing the necessary support and funding for the development of data standards. Researchers should foster norms around data provenance in academic research; collaborate with other stakeholders in developing data provenance standards.

1. The Need for Data Provenance

In the last decade, data from across the web, news, social media, and encyclopedias has become a vital resource in generative AI consumer technologies like GPT-4, Midjourney, and Whisper. These technologies, some with 100M+ weekly users,2 have already begun to catalyze innovation3 and scientific inquiry while starting to affect wide swaths of the economy and many everyday consumers.4 These models are trained on diverse compilations of text, image, and audio data scraped from the web, synthetically generated by other models, or hand curated. The resulting arms race to scrape, secure, and mass produce massive collections of loosely structured data has come with consequences. Current practices are to widely source and bundle data without tracking or vetting their original sources,5 creator intentions,6 copyright and licensing status,7 or even basic composition and properties.8 These dubious practices are creating an ethical, legal, and transparency crisis for both the users and developers of AI.

Poorly understood data has led to or accentuated many real-world problems, including leaks of personally identifiable information (PII),9 generating nonconsensual intimate imagery (NCII) or child sexual abuse material (CSAM),10 creating misinformation or deepfakes,11 proliferating biases or discrimination,12 as well as triggering intellectual property disputes, culminating in lawsuits against major generative AI companies like Stability AI, Midjourney, and OpenAI (see Figure 1). Model developers have no reliable method to retract data from a model after the expensive training process is complete.13 As a result, early choices made in large-scale machine learning projects around training data have long-term consequences, creating a pressing need for resources that allow AI developers to find and fully understand the benefits and risks of using various training data.

Figure 1

Examples of real-world problems caused or accentuated by poorly understood AI training data.

To respond to this urgent need, we highlight the importance of data provenance for creators, developers, and users (Section 2), call for change in current data practices (Section 3), present legal and regulatory incentives (Section 4), and outline how the current space of solutions to trace data authenticity, consent, and provenance has limitations and trade-offs (Section 6). We finally outline recommendations to facilitate a standardized framework for tracing critical data features and to jointly address challenges facing informed and responsible AI (Section 7).

2. Who Needs Data Provenance and Why?

2.1 Creators: Protecting Rights and Avoiding Harms

Artists and data creators emphasize data provenance in their pursuit of “the three C’s” of creative rights: compensation, consent, and credit.14 Creators’ work is frequently used to train commercial AI models for creative writing, image, and video generation. As plaintiffs in lawsuits against major AI companies and as part of the “Writer’s Strike,”15 creators have raised concerns about the consent, legality, and ethics of using their data as well as the resulting effects on the creative economy.16 A data transparency and provenance framework can help address these problems: it would afford creators valuable insight into how their work is used in AI, giving them an opportunity to provide consent for their data to be used, verify proper credit, and seek fair compensation in applicable cases where their data is used.

The unsettled legal status of AI training data17 has led to some initial compensation proposals18 over which creators have little control. They are also rarely credited as having contributed training data, despite the occasional artist signature or watermark that slips into generated content.19 This lack of data transparency and legal clarity has led to several lawsuits against leading AI companies20 and to calls from creators and publishers for strong transparency requirements around AI data.21

The prospective benefits to creators, however, are not only about reducing harms. Greater data transparency can help users—including creators—know which model is best suited to their needs. For example, the disclosure of whether and to what extent a model has been trained on a particular language, literary genre, or style of visual art can help artists in these media identify the right AI tools. Such disclosure also assists creators who see benefits in the use of AI technology in artistic practice22 by highlighting likely gaps in model coverage and abilities and flag potential misuse. As subject matter experts, creators could also be effective in recommending or providing new data sources.

2.2. AI Developers: Data Transparency Enables Innovation

AI developers have an acute interest in data and its provenance for model performance, model behavior, and anticipate limitations and risks. While model performance generally improves with more data, the quality and diversity of data are also critical factors for reliable performance.23 Model behavior tends to emulate the structure and composition of data at both the pretraining and finetuning stages.24 For these reasons, developers curate specialized pretraining corpora for scientific writing,25 code,26 biomedical content,27 and legal works.28 As such, information about data sources and their properties informs AI model training.

Data transparency can help avoid harmful pitfalls like unexpectedly using biased or sexual content,29 private data,30 copyright infringing data,31 or noncommercially restricted data.32

Lastly, reproducibility and scientific progress more broadly are accelerated by data transparency and structured documentation.33

These potential benefits have motivated several attempts to analyze large data corpora34 and provide tools for corpus analysis.35 The Hugging Face platform has structured documentation for models and datasets36 while the Data Provenance Initiative37 has closely traced data properties, permissions, and lineage. Model developers recognize the need for a standardized approach to catalogue pretraining and finetuning data. Detailed systematic coverage would enable more informed modeling, deeper analysis, and accessibility, and increase the usage of useful datasets that remain underutilized for lack of documentation. At the same time, data transparency provides model developers with information needed to avoid unintentional leakage of synthetic data into the training set.38

2.3 Society: Reducing Risk and Counteracting Biases

Societal risks from AI run the gamut from privacy violations39 and exposure of personally identifying information, to systemic economic impacts and job displacement,40 to bias and discriminatory behavior.41 These risks are fundamentally tied to the data the models are trained on,42 which, together with the context and affordances of their applications, dictate their behavior.

Perhaps the most pressing concerns, and the ones most tightly coupled to data, related to social bias and inequitable behavior. There are already prominent examples of AI systems acquiring and perpetuating the biases present in their training data,43 especially in facial recognition systems.44 Social pressure can only be applied to companies’ data choices if they are broadly visible and documented. Post hoc attempts to mitigate or train out biases,45 or to retroactively remove contentious data sources, are per se reactive. A more proactive approach could be beneficial by providing deeper insights into model training data. Recent work on data humanism46 and data feminism47 illuminates key considerations and frameworks to make such data accessible through visualization and transparent data management practices, crucial steps in giving affected communities’ agency.

3. Growing Recognition of the Data Provenance Problem

Existing norms for tracing AI data provenance have major and increasingly widely acknowledged deficits.48 Popular AI systems fail to disclose even basic information about their training data (e.g., ChatGPT, Bard, Llama 2). The pace of innovation has prompted community calls for more systematic49 and extensive50 data documentation. However, these calls have resulted in uneven adoption and adherence. Documentation issues remain particularly acute for so-called datasets-of-datasets, massive collections of hundreds of datasets where the original provenance information is often neglected or lost due to the lack of standard structures.

As a result, practitioners have called for greater data transparency,51 for data supply chain and ecosystem monitoring,52 for content authenticity verification,53 for detailed provenance tracing on behalf of reproducible, explainable, and trustworthy AI systems,54 and specifically for a standardized database to document trustworthy data.55

Regulators and lawmakers in many countries have also shown interest. A number of bills recently introduced in the US Congress56 have proposed regulatory regimes for AI that would require data transparency. A voluntary code of conduct organized by Canadian authorities57 calls on model developers to “[p]ublish a description of the types of training data” a model uses. UN bodies have recommended international regulations on “data rights” that “[enshrine] transparency.”58

Most importantly, governments in both the United States and Euopean Union have taken significant steps toward data transparency. Both the EU AI Act, proposed in 202159, and President Joe Biden's recent executive order on safe AI60 include provisions related to transparency, provenance, and the need to thoroughly understand the input data to AI models (see Section 4.2). Both texts highlight the importance of conveying the limitations of AI models to consumers and these limitations are difficult to assess without knowledge of data provenance. The EU Act in particular spells out specific requirements for providers of foundation models related to training data provenance.

This clear interest from both the research community and lawmakers motivates our work on unified frameworks for data provenance and transparency. While such standards do not address AI risks directly, they are an essential prerequisite to assessing risks and help foster more responsible AI development.

4. The Legal and Regulatory Dimension of a Data Provenance Standard

What, then, might such a standard framework look like? Before addressing this question in detail (see Section 5), we explore its interaction with legal and regulatory constraints, particularly intellectual property law and recent AI regulations. We also outline how lawmakers can pave the way for a standard that achieves important regulatory objectives.

There are two general ways in which an AI model may violate copyright interest. First, training a model can infringe on the copyrights of those whose works are in the training data or on the copyrights of those who created the training data corpora.61 Second, specific outputs of an AI model may infringe on the copyrights associated with individual works in the training data. Different types of data are used to train AI models and these likely give rise to different copyright issues.62 AI models sometimes produce outputs that closely resemble items in the pretraining data and thus infringe on the rights of the creators of these works (who rarely consent to their content being used). It is important to underscore that although the use of pretraining data may be protected by fair use,63 this does not mean that specific output will not create copyright violations. Meanwhile, instruction tuning, finetuning, and alignment datasets are frequently used in ways not permitted by their license agreements.64 These datasets contain expressive elements created for the sole purpose of training machine learning models and thus their use for this purpose is unlikely to be covered by fair use.65

For both pretraining and finetuning, a standard data provenance framework can help mitigate legal risks and aid in the enforcement of copyright interests. Copyright infringement hinges on access to protected works66 and thus knowing what datasets were used to train a model and what works are contained in these datasets is critical to assessing copyright issues. As discussed, training data is often mixed and repackaged, which complicates the task of precisely identifying what data was used to train a particular model. A robust framework not only helps creators assert their rights when generated outputs infringe on their copyright but also helps model developers tune their models to avoid this infringement in the first place. Meanwhile, for finetuning and other curated datasets, a data provenance standard can ensure that model developers have access to accurate license information, making it easier to comply with relevant restrictions.

4.2 AI Regulation

Both the EU AI Act and President Biden's recent executive order on the Safe, Secure, and Trustworthy Development and Use of Artificial Intelligence directly and indirectly highlight the need for transparency for AI systems. Both texts require clear communication of the limitations of AI systems to consumers. The AI Act requires the disclosure of relevant information about training, validation, and testing datasets for high-risk AI systems and a summary of copyright protected training data used in foundation models. The technical specifications in the act include specific data provenance information such as how the data was obtained, labeled, and processed. Meanwhile, the executive order encourages regulatory agencies to emphasize transparency requirements for AI models to protect consumers.

4.3 The Role of Lawmakers in Encouraging Responsible AI Practices

This article is a call to action for dataset creators, researchers, and lawmakers. By understanding the nature of AI ecosystems lawmakers can create incentives that encourage better documentation of new datasets and audits of existing data. While the term “transparency” is often ill-defined in AI regulation, regulators could leverage transparency obligations to encourage model developers to record information about the datasets that have been used to train models. In addition, policymakers could provide funding for research related to data provenance.

Today, there are perverse incentives that prevent many companies from disclosing information about their datasets as doing so may increase the probability of legal action. Legal authorities could consider providing safe harbor to organizations that provide necessary information about their datasets to regulators and the public.

Table 1

Relevant data properties for facilitating authenticity, consent, and informed use of AI data.

Metadata

Definition

Source Authenticity

The authenticity of the data (including digital watermarks and embedded authenticity).a

Consent & Use Restrictions

What uses or audiences the creators consent to.b

Data Types

The data types (text, images, tabular, etc.) and their digital formats.

Source Lineage

Links to data sources from which this was assembled or derived.c

Generation Methods

How the data was created, by which people, organizations, and/or models.d

Temporal Information

When the data was created, collected, released, or edited.e

Private Information

If personally identifiable or private information is present.f

Legal Considerations

What legal information is attached to the data.

Intellectual Property

Copyright, trademark, or patent information attached.g

Terms

Associated terms of use or service accessing the data.h

Regulations

Relevant regulations, though they may be jurisdiction-dependent (e.g., the EU AI Act).

Characteristics

An extensible set of properties for the data, relevant to different applications.

Dimensions

Size of the data in measurable units.

Sensitive Content

If offensive, toxic, or graphic content is present.i

Quality Metrics

Metrics associated with quality measures of the data.j

Metadata Contributors

A version-controlled log of edits and responsible contributors to this metadata.

Note: Each row provides a citation motivating the significance of that data property. A unified data standard would be able to codify these information pieces.

a Rawte, Sheth, and Das, “A Survey”; Westerlund, “The Emergence.”

b Chayka, “Is A.I. Art Stealing”; Epstein et al., “Art.”

c Buolamwini and Gebru, “Gender Shades.”

d Longpre et al., “The Data Provenance Initiative.”

e Kelvin Luu et al., “Time Waits For No One! Analysis and Challenges of Temporal Misalignment,” preprint submitted November 14, 2021, https://arxiv.org/abs/2111.07408; Longpre et al., “A Pretrainer’s Guide.”

f Carlini et al., “Extracting”; Nasr et al., “Scalable Extraction.”

g E.g., Tremblay v. OpenAI, Inc., No. 3:23-cv-03223 (N.D. Cal.).

h Heath, “ByteDance.”

i David, “AI Image”; Internet Watch Foundation, “How AI Is Being Abused.”

j Longpre et al., “A Pretrainer’s Guide.”

5. Elements of a Data Provenance Standard

Model developers, data creators, and the public all require structured transparency into the data, but for different reasons. A standard data provenance framework could address these diverse needs, but existing solutions tends to address different transparency problems in isolation. Instead of proposing a new standard, we highlight how existing standards can be unified to effectively address the range of relevant challenges.

A unified data provenance framework should be:

  • Modality and source agnostic: An effective standard for tracing wide-ranging data provenance should not be limited to certain modalities (e.g., text, images, video, or audio), or sources.

  • Verifiable: The metadata can be verified, or its reliability assessed. Although metadata will inevitably contain errors, editing systems (like the one used by Wikipedia), or provenance confirmations created by data creators help to provide transparency and consistency.

  • Structured: Structured information should be searchable, filterable, and composable, so that automated tools can navigate the data, and the qualities of combined datasets can easily be inferred from merging their structured properties (e.g., combining license types).

  • Extensible and adaptable: An extensible framework is adaptable to new types of metadata that emerge over time as well as to various jurisdiction-specific transparency requirements.

  • Symbolically Attributable: Relevant data sources should be attributed, even as datasets are repackaged and compiled. Codifying the lineage of sources allows the resulting properties to be determined by traversing the web of data “ancestors.”

6. Existing Solutions

No complete system for data provenance exists, instead there is a patchwork of solutions to different elements of the problem (see Figure 2). We identify four broad categories of these: content authenticity techniques that attach themselves to data as it spreads, opt-in & opt-out tools that allow content creators to register how content should be used, data provenance standards that allow dataset creators to document information about datasets, and data provenance libraries that aggregate information on datasets and their content. These interventions imbue unstructured data with more attribution, authenticity, and navigability in machine-interpretable formats, but none is a complete solution to the challenge of data provenance.

Figure 2

Overview of existing data provenance solutions.

Table 2

Comparison of data provenance interventions.

Name

Media

Coverage

Verify

Search

Extend

Attribute

Purpose

Content Authenticity

Limited

Growing

Authenticity

Robots.txt Consent

Webpages

Growing

Consent

Consent Registration

All

Growing

Consent

Data Standards

All

Growing (uneven)

🟠

Data Properties

Common Crawl

Webpages

Wide

Data Properties

Hugging Face

Limited

Growing (uneven)

🟠

🟠

Data Properties

Data Provenance Initiative

Limited

Limited

Data Properties

Note: A summary across interventions of the current scope of data coverage, the ability to verify origins of data or metadata claims, the ability to search for data using metadata, whether the standard is extensible, and the ability for symbolic attribution. We see that content authenticity techniques are usually embedded into or with the data itself but have limited scope of where they can be applied and cannot easily be extended or searched over. Consent opt-in has not organized around one standard yet, and only addresses the consent problem in isolation. Data Provenance Standards and their associated libraries can provide significantly more information, extensibility, and structure that is machine readable or searchable. The challenge arises in maintaining them and ensuring the metadata is verifiable and accurate.

6.1 Content Authenticity Techniques

The growing concern for manipulated media, disinformation, and deepfakes has spurred methods that attempt to embed provenance information directly alongside or into data. In this way, a downstream user can ascertain the data’s source and authenticity to avoid copyright issues, ensure academic integrity, or for journalism and fact-finding.

Coalition for Content Provenance and Authenticity (C2PA). A prominent example of content authenticity is the C2PA, a partnership between Adobe, Microsoft, and dozens of other corporations to design a specification that “addresses the prevalence of misleading information online through the development of technical standards for certifying the source and history (or provenance) of media content.”67 To this end, verifiable information may be cryptographically embedded into images, videos, audio, and some types of documents in a way that is difficult to remove and that makes tampering evident. More broadly, the International Organization for Standardization (ISO) is finalizing the International Standard Content Code (ISCC), a universal content identifier that transparently fingerprints content across platforms using hashing.

Digital Watermarks for AI. Digital watermarks have historically been used in visual and audio content. For applications of AI, these watermarks are embedded into machine-generated content68 and more recently into machine-generated text.69 While these technologies hold promise, they are especially vulnerable to removal,70 especially for text media.71

6.2 Opt-In and Opt-Out

Robots.txt. The Robots Exclusion Standard uses a robots.txt file in a website’s directory to indicate to crawlers (e.g., from search engines) which parts of the website the webmaster would like to include and exclude from search indexing. Though this protocol lacks enforcement mechanisms, major search engines have historically respected it.72 Recent proposals extend this idea to AI data usage, including learners.txt,73 ai.txt,74 or “noai” tag by artists.75 In response, Google and OpenAI have instantiated their own versions, “User Agents” and “GPTBots,” giving websites an avenue to implement opt-out standards.76 So far, none of these have been widely adopted. While a robots.txt-type implementation signals a website’s preferences as an opt-in/opt-out binary, it fails to provide a more nuanced spectrum of preferences (e.g., only noncommercial open source models may train on my data), as well as other useful metadata.

Consent Registration. Organizations, such Spawning AI, are attempting to build infrastructure for the “consent layer” of AI data.77 This involves sourcing opt-in and opt-out information from data creators, and it into searchable databases (e.g., the Do Not Train Registry78). This approach obtains consent directly from creators rather than web hosts, but its granularity increases the burden associated with compiling these consent databases.

6.3 Dataset Provenance Standards

Beyond data authenticity and consent, standards for broader data documentation have been proposed to resolve the many other challenges including data privacy, sensitive content, licenses, source and temporal metadata, as well as relevance for training.

Datasheets, Statements, and Cards. Datasheets,79 data statements,80 and data cards81 propose documentation standards for AI datasets. Each of these standardizes documentation of AI dataset creators, annotators, language and content composition, innate biases, collection and curation processes, uses, distribution, and maintenance. Though unevenly adopted, these efforts are widely recognized for improving scientific reproducibility, and responsible AI.82

Data Nutrition Labels. Based on the FDA's Nutrition Facts label, the Data Nutrition Label83 automates data documentation using a registration form, totaling fifty-five mostly free-text question responses.

Data & Trust Alliance's Data Provenance Standard. The Data & Trust Alliance Data Provenance Standard84 is the product of joint data documentation efforts from nineteen corporations, including IBM, Nike, Mastercard, Walmart, Pfizer, and UPS. Motivated by the absence of an “industry-wide method to gauge trustworthiness of data based on how it was sourced,” their standard assimilates a wide variety of industry documentation needs into a succinct structured format. The standard provides structured documentation analogous to Nutrition Labels, but also provides a way to trace data lineages.

6.4 Data Provenance Libraries

While content authenticity is embedded in the data, data documentation and provenance standards need to be aggregated in libraries for searching, filtering, and machine navigation.

Prior work has formalized and even operationalized data governance standards,85 but only a few efforts have gained traction in guiding AI development.

Common Crawl. For pretraining text data, models rely predominantly on the Common Crawl.86 For instance, 60 percent of GPT-3 training data is from Common Crawl.87 This resource provides URLs, scrape times, and the raw documents, following the robots.txt exclusion protocol, and even adheres to requests for removal (e.g., the New York Times requested their data be removed from the repository88). This free library of crawled web data has wide adoption, is accurate, and has comprehensive web coverage, but it provides limited metadata.

Hugging Face Datasets. Hugging Face Datasets has become a widely adopted data library for AI.89 It has integrated data cards and Spawning’s data consent information into datasets to encourage documentation and consent filtering. While its coverage of AI data is vast, the documentation is uneven and often incorrect, as this information is loosely crowdsourced.90 Hugging Face data cards also have limited structure and searchability in the current API.

The Data Provenance Initiative. Identifying uneven and inaccurate provenance data in Hugging Face, the Data Provenance Initiative91 is a joint effort by AI and legal experts to add more comprehensive and accurate structured information around the most popular textual datasets in AI, including their lineage of sources, licenses, creators, and characteristics.92 This provides a richer and more accurate collection of annotations and provenance for searchable, filterable, and standardized datasets. However, this accuracy comes at the cost of expert human labor, limiting its scale compared to Hugging Face or Common Crawl.

6.5 Discussion and Trade-Offs of Interventions

Each category of interventions above targets different problems and comes with clear trade-offs in their benefits and limitations, as illustrated in Table 2 and Figure 3. For instance, content authenticity techniques offer built-in and verifiable provenance. However, they only authenticate the source or veracity of the data, without covering (or being easily extensible to) other important metadata for AI applications like copyright, privacy, bias considerations, characteristics, intended uses, or complex lineage information. See Table 1 for a more comprehensive list of the different metadata that has become important to AI development in recent years. Content authenticity techniques also apply primarily to atomic units of data, like individual images, recordings, or text files, rather than derivations or compilations, which are increasingly common for multimodal AI training.

Figure 3

Comparison of different data provenance solutions.

Proposals to extend robots.txt or register data opt-in/out are designed to facilitate creator consent, but each AI company requires custom code for their own scrapers, and many AI developers may still ignore these guidelines.

On the other end of the spectrum of data richness are standards like Datasheets, Data Nutrition Labels, or D&TA’s Data Provenance Standard, which encode more fine-grained information, but at the expense of accuracy and adoption incentives. For instance, consider the three data libraries discussed in Section 6.4. These three libraries trade off a spectrum of (a) the coverage of data, (b) the depth of provenance documentation, and (c) the accuracy of collected metadata. Common Crawl is accurate with wide coverage, but not detailed. Hugging Face can be inaccurate with varying levels of detail, but it has extensive coverage. The Data Provenance Initiative is highly accurate and detailed but is currently limited in scope.

Clearly, authenticity techniques, data consent mechanisms, and data provenance standards are complementary, and each conveys distinct and important information to AI developers. Unifying these frameworks into a standardized data infrastructure layer holds tremendous promise and is a precondition to solving the many problems of ethical, legal, and sustainable AI.

7. Key Lessons for Data Provenance

Existing data provenance solutions are piecemeal and without a robust, well-resourced data provenance framework, developers will struggle to accurately identify and evaluate the safety, copyright implications, and relevance of datasets from a dizzying array of possibilities. Data creators will similarly struggle to identify how and where their content is. Without dataset provenance standards and documentation, creating such a framework will become increasingly difficult and ultimately untenable. While each existing solution provides important insights into the data ecosystem, a robust framework to attach metadata to datasets is needed to track how datasets are mixed, compiled, and used. We advocate for a set of actions that different stakeholders can take to make data authenticity, consent, and provenance more robust to future challenges.

New work should seek to unify, or make interoperable, data infrastructure for authenticity and consent with other important documentation for privacy, legality, and relevance.

Solutions to these problems are being developed in isolation, but trustworthy and responsible AI requires assessing these factors together. Among others, Pistilli et al.93 underscore the “necessity of joint consideration of the ethical, legal, and technical in AI ethics frameworks to be used on a larger scale to govern AI systems.”

For Policymakers: Regulators play a pivotal role in shaping the future of AI through policy and guidelines. A data-centric approach to AI regulation can help identify and mitigate key risks. Policymakers can provide funding for research related to data provenance and a centralized effort to document and build provenance infrastructure. Currently, perverse legal incentives inhibit companies from disclosing information about their data. Regulators should consider legal or legislative incentives for organizations to provide necessary data transparency, as they have in the Digital Services Act (DSA) for social media platforms and require standardized documentation as part of AI transparency obligations. These types of incentives can foster universal and interoperable standards for data authenticity, consent, and provenance.

For AI Developers: AI developers are at the forefront of creating AI models and thus bear significant responsibility in ensuring ethical AI practices. It is crucial for developers to prioritize documentation responsibilities and to make public the provenance of their training data. When there are compelling business reasons for confidentiality, at the very least, they should publish aggregate statistics about the data provenance. This level of transparency is essential for building trust with users and the wider community and for fostering a responsible AI ecosystem.

For Data Creators and Compilers: Creators of training data play a critical role in the AI development process. It is imperative for these creators to meticulously document not only consent criteria, but the provenance of their data, including the sources and processing. Repositories and databases to register this information are already available. Such detailed documentation will significantly aid AI developers in respecting underlying rights and in understanding the nature of the data they use.

For the Research Community: The research community is uniquely positioned to set norms and standards around provenance disclosure. These could include incorporating provenance disclosure as a requirement for research publications, which would complement efforts like reproducibility checklists94 and ultimately help foster scientific progress.

In summary, each stakeholder group holds a piece of the puzzle in the path toward ethical and transparent AI development. By fulfilling their respective roles and collaborating effectively, these stakeholders can collectively drive the AI field toward a more responsible and trustworthy future.

8. Conclusion

We underscore the urgent need for a standardized data provenance framework to address the complex challenges of AI development. The proliferation of AI models, their diverse training data sources and associated ethical, legal, and transparency concerns have culminated in a critical need for a comprehensive approach to data documentation.

Several data provenance solutions exist, such as content authenticity techniques, opt-in/opt-out mechanisms, and data documentation standards. However, each of these addresses only specific aspects of a broader issue and functions in isolation. A unified data provenance framework is crucial to establish an ecosystem where data authenticity, consent, privacy, legality, and relevance are holistically considered and managed.

The successful implementation of such a framework requires concerted efforts from all stakeholders in the AI field. This includes creators, who need to be empowered to tag and license their content; developers, who must adopt standards for data provenance and contribute to dataset libraries; and policymakers, who should establish transparency standards and fund the documentation and construction of data libraries.

The key takeaway is the interdependence of transparency solutions: without robust data provenance libraries, it is challenging for developers to find and evaluate datasets comprehensively. Conversely, without standardized documentation and metadata attachment to data as it travels across the web, tracking and downstream use become unfeasible. This effort requires the active participation and commitment of all involved parties to create a sustainable and trustworthy AI ecosystem.

Acknowledgments

We would like to thank Katy Gero, Kevin Klyman, Yacine Jernite, Hanlin Zhang, Kristina Podnar, and Saira Jesani for their generous advice and feedback.

Bibliography

Autor, David, Caroline Chin, Anna M. Salomons, and Bryan Seegmiller. New Frontiers: The Origins and Content of New Work, 1940–2018. NBER Working Paper No. w30389. Cambridge, MA: National Bureau of Economic Research, 2022.

Bandy, Jack, and Nicholas Vincent. “Addressing ‘Documentation Debt’ in Machine Learning Research: A Retrospective Datasheet for Bookcorpus.” Preprint submitted May 11, 2021. https://arxiv.org/abs/2105.05241.

Barr, Alistair, and Kali Hays. “The New York Times Got Its Content Removed from One of the Biggest AI Training Datasets. Here's How It Did It.” Business Insider, November 8, 2023. https://www.businessinsider.com/new-york-times-content-removed-common-crawl-ai-training-dataset-2023-11

Bender, Emily M., and Batya Friedman. “Data Statements for Natural Language Processing: Toward Mitigating System Bias and Enabling Better Science.” Transactions of the Association for Computational Linguistics 6 (2018): 587–604.

Bommasani, Rishi, Drew A. Hudson, Ehsan Adeli, Russ Altman, Simran Arora, Sydney von Arx, Michael S. Bernstein et al. “On the Opportunities and Risks of Foundation Models.” Preprint submitted August 16, 2021. https://arxiv.org/abs/2108.07258

Bommasani, Rishi, Sayash Kapoor, Daniel Zhang, Arvind Narayanan, and Percy Liang. “AI Accountability Policy Request for Comment.” Department of Commerce, National Telecommunications and Information Administration. Docket no. 230407-0093, RIN 0660-XC057, June 12, 2023. https://hai.stanford.edu/sites/default/files/2023-06/Reponse-to-NTIAs-.pdf.

Bommasani, Rishi, Kevin Klyman, Shayne Longpre, Sayash Kapoor, Nestor Maslej, Betty Xiong, Daniel Zhang, and Percy Liang. “The Foundation Model Transparency Index.” Preprint submitted October 19, 2023. https://arxiv.org/abs/2310.12941.

Bommasani, Rishi, Dilara Soylu, Thomas I. Liao, Kathleen A. Creel, and Percy Liang. “Ecosystem Graphs: The Social Footprint of Foundation Models.” Preprint submitted March 28, 19, 2023. https://arxiv.org/abs/2303.15772.

Boyd, Karen L. “Datasheets for Datasets Help ML Engineers Notice and Understand Ethical Issues in Training Data.” Proceedings of the ACM on Human-Computer Interaction 5, no. CSCW2 (2021): 1–27.

Brittain, Blake. “Lawsuits Accuse AI Content Creators of Misusing Copyrighted Work.” Reuters, January 17, 2023. https://www.reuters.com/legal/transactional/lawsuits-accuse-ai-content-creators-misusing-copyrighted-work-2023-01-17/.

Brown, Tom, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D. Kaplan, Prafulla Dhariwal, Arvind Neelakantan et al. “Language Models Are Few-Shot Learners.” Advances in Neural Information Processing Systems 33 (2020): 1877–901.

Brynjolfsson, Erik, Danielle Li, and Lindsey R. Raymond. 2023. “Generative AI at Work.” NBER Working Paper No. 31161. Cambridge, MA: National Bureau of Economic Research, 2023. https://doi.org/10.3386/w31161.

Brynjolfsson, Erik, Daniel Rock, and Chad Syverson. “The Productivity J-Curve: How Intangibles Complement General Purpose Technologies.” NBER Working Paper No. 25148. Cambridge, MA: National Bureau of Economic Research, 2018. https://www.nber.org/system/files/working_papers/w25148/w25148.pdf.

Buolamwini, Joy, and Timnit Gebru. “Gender Shades: Intersectional Accuracy Disparities in Commercial Gender Classification.” In Conference on Fairness, Accountability And Transparency, edited by Sorelle A. Friedler and Christo Wilson, 77–91. Cambridge, MA: PMLR, 2018.

Carlini, Nicholas, Florian Tramer, Eric Wallace, Matthew Jagielski, Ariel Herbert-Voss, Katherine Lee, Adam Roberts et al. “Extracting Training Data from Large Language Models.” In Proceedings of the 30th USENIX Security Symposium (USENIX Security 21), edited by Michael D. Bailey and Rachel Greenstadt, 2633–50. Berkeley, CA: USENIX Association, 2021.

Cen, Sarah H., Aspen Hopkins, Andrew Ilyas, Aleksander Madry, Isabella Struckman, and Luis Videgaray. “AI Supply Chains (and Why They Matter).” AI Policy Substack, April 3, 2023. https://aipolicy.substack.com/p/supply-chains-2.

Chayka, Kyle. “Is A.I. Art Stealing from Artists?” New Yorker, February 20, 2023. https://www.newyorker.com/culture/infinite-scroll/is-ai-art-stealing-from-artists.

Cheng, Michelle. “How Should Creators Be Compensated for Their Work Training AI Models?” Quartz, October 20, 2023. https://qz.com/how-should-creators-be-compensated-for-their-work-train-1850932454.

Chu, Eric, Jacob Andreas, Stephen Ansolabehere, and Deb Roy. “Language Models Trained on Media Diets Can Predict Public Opinion.” Preprint submitted March 28, 2023. https://arxiv.org/abs/2303.16779.

Chung, Hyung Won, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Yunxuan Li et al. “Scaling Instruction-Finetuned Language Models.” Preprint submitted October 20, 2022. https://arxiv.org/abs/2210.11416.

Competition and Markets Authority. AI Foundation Models: Initial Report. GOV.UK. September 18, 2023. https://www.gov.uk/government/publications/ai-foundation-models-initial-report.

Congressional Research Service. Generative Artificial Intelligence and Copyright Law. CRS Report No. LSB10922. September 29, 2023. https://crsreports.congress.gov/product/pdf/LSB/LSB10922.

Cooke, Chris. “Creator Groups Call for EU AI Act to Retain Strong Transparency Obligations.” Complete Music Update, November 24, 2023. https://completemusicupdate.com/creator-groups-ai-act/.

Data Nutrition Team. “The Data Nutrition Project.” Accessed December 31, 2023. https://datanutrition.org/.

David, Emilia. “AI Image Training Dataset Found to Include Child Sexual Abuse Imagery.” Verge, December 20, 2023. https://www.theverge.com/2023/12/20/24009418/generative-ai-image-laion-csam-google-stability-stanford.

———. “News Outlets Demand New Rules for AI Training Data.” Verge, August 10, 2023. https://www.theverge.com/2023/8/10/23827316/news-transparency-copyright-generative-ai.

Dell'Acqua, Fabrizio, Edward McFowland, Ethan R. Mollick, Hila Lifshitz-Assaf, Katherine Kellogg, Saran Rajendran, Lisa Krayer et al. “Navigating the Jagged Technological Frontier: Field Experimental Evidence of the Effects of AI on Knowledge Worker Productivity and quality.” Harvard Business School Working Paper 24-013. Cambridge, MA: Harvard Business School, 2023.

Desai, Meera, Abigail Jacobs, and Dallas Card. “An Archival Perspective on Pretraining Data.” In Socially Responsible Language Modelling Research. OpenReview, October 23, 2023. https://openreview.net/forum?id=9xhUufywBX

DeviantArt Team. “UPDATE All Deviations Are Opted Out of AI Datasets.” November 11, 2023. Accessed December 31, 2023. https://www.deviantart.com/team/journal/Tell-AI-Datasets-If-They-Can-t-Use-Your-Content-934500371.

D’ignazio, Catherine, and Lauren F. Klein. Data Feminism. Cambridge, MA: MIT Press, 2023.

Ding, Bosheng, Chengwei Qin, Linlin Liu, Yew Ken Chia, Shafiq Joty, Boyang Li, and Lidong Bing. “Is GPT-3 a Good Data Annotator?” Preprint submitted December 20, 2022.  https://arxiv.org/abs/2212.10450.

Dodge, Jesse, Maarten Sap, Ana Marasović, William Agnew, Gabriel Ilharco, Dirk Groeneveld, Margaret Mitchell, and Matt Gardner. “Documenting Large Webtext Corpora: A Case Study on the Colossal Clean Crawled Corpus.” Preprint submitted April 18, 2021. https://arxiv.org/abs/2104.08758.

Elazar, Yanai, Akshita Bhagia, Ian Magnusson, Abhilasha Ravichander, Dustin Schwenk, Alane Suhr, Pete Walsh et al. “What's In My Big Data?” Preprint submitted October 31, 2023. https://arxiv.org/abs/2310.20707.

Epstein, Ziv, Aaron Hertzmann, Laura Herman, Robert Mahari, Morgan R. Frank, Matthew Groh, and Hope Schroeder et al. “Art and the Science of Generative AI.” Science 380, no. 6650 (2023): 1110–11.

Gao, Leo, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Jason Phang et al. “The Pile: An 800GB Dataset of Diverse Text for Language Modeling.” Preprint submitted October 31, 2020. https://arxiv.org/abs/2101.00027.

Gebru, Timnit, Jamie Morgenstern, Briana Vecchione, Jennifer Wortman Vaughan, Hanna Wallach, Hal Daumé Iii, and Kate Crawford. “Datasheets for Datasets.” Communications of the ACM 64, no. 12 (2021): 86–92.

Gilardi, Fabrizio, Meysam Alizadeh, and Maël Kubli. “ChatGPT outperforms crowd-workers for text-annotation tasks.” Preprint submitted March 27, 2023. https://arxiv.org/abs/2303.15056.

Government of Canada.“Voluntary Code of Conduct on the Responsible Development and Management of Advanced Generative AI Systems.” September 2023. Accessed December 31, 2023. https://ised-isde.canada.ca/site/ised/en/voluntary-code-conduct-responsible-development-and-management-advanced-generative-ai-systems.

Gruetzemacher, Ross, David Paradice, and Kang Bok Lee. “Forecasting Extreme Labor Displacement: A Survey of AI Practitioners.” Technological Forecasting and Social Change 161 (2020): 120323.

Hao, Karen, and Deepa Seetharaman. “Cleaning Up ChatGPT Takes Heavy Toll on Human Workers.” Wall Street Journal, July 24, 2023. https://www.wsj.com/articles/chatgpt-openai-content-abusive-sexually-explicit-harassment-kenya-workers-on-human-workers-cf191483.

Heath, Alex. “ByteDance Is Secretly Using OpenAI's Tech to Build a Competitor.” Verge, December 15, 2023. https://www.theverge.com/2023/12/15/24003151/bytedance-china-openai-microsoft-competitor-llm.

Henderson, Peter, Mark Krass, Lucia Zheng, Neel Guha, Christopher D. Manning, Dan Jurafsky, and Daniel Ho. “Pile of Law: Learning Responsible Data Filtering from the Law and a 256GB Open-Source Legal Dataset.” Advances in Neural Information Processing Systems 35 (2022): 29217–34.

Henderson, Peter, Xuechen Li, Dan Jurafsky, Tatsunori Hashimoto, Mark A. Lemley, and Percy Liang. “Foundation Models and Fair Use.” Preprint submitted March 28, 2023. https://arxiv.org/abs/2303.15715.

Hoffmann, Jordan, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas et al. “Training Compute-Optimal Large Language Models.” Preprint submitted March 29, 2022. https://arxiv.org/abs/2203.15556.

Horton, John J. “Large Language Models as Simulated Economic Agents: What Can We Learn from Homo Silicus?” NBER Working Paper No. 31122. Cambridge, MA: National Bureau of Economic Research, 2023. https://doi.org/10.3386/w31122.

Hsu, Chiou-Ting, and Ja-Ling Wu. “Hidden Digital Watermarks in Images.” IEEE Transactions on Image Processing 8, no. 1 (1999): 58–68.

Internet Watch Foundation. How AI Is Being Abused to Create Child Sexual Abuse Imagery. October 2023. Accessed December 31, 2023. https://www.iwf.org.uk/about-us/why-we-exist/our-research/how-ai-is-being-abused-to-create-child-sexual-abuse-imagery/.

Ippolito, Daphne, and Yun William Yu. 2023. “DONOTTRAIN: A Metadata Standard for Indicating Consent for Machine Learning.” In Proceedings of the 40th International Conference on Machine Learning, edited by A. Krause et al. Cambridge, MA: PMLR 202, 2023. https://genlaw.org/CameraReady/42.pdf.

Jernite, Yacine, Huu Nguyen, Stella Biderman, Anna Rogers, Maraim Masoud, Valentin Danchev, Samson Tan et al. “Data Governance in the Age of Large-Scale Data-Driven Language Technology.” In Proceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency, edited by Charles Isbell et al., 2206–22. New York: ACM, 2022. https://doi.org/10.1145/3531146.3534637.

Kale, Amruta, Tin Nguyen, Frederick C. Harris Jr, Chenhao Li, Jiyin Zhang, and Xiaogang Ma. “Provenance Documentation to Enable Explainable and Trustworthy AI: A Literature Review.” Data Intelligence 5, no. 1 (2023): 139–62.

Kaplan, Jared, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. “Scaling Laws for Neural Language Models.” Preprint submitted January 23, 2020, https://arxiv.org/abs/2001.08361.

Kapoor, Sayash, and Arvind Narayanan. “Leakage and the Reproducibility Crisis in ML-Based Science.” Preprint submitted July 14, 2022. https://arxiv.org/abs/2207.07048.

Keller, Paul, and Zuzanna Warso. “Defining Best Practices for Opting Out of ML Training.” Open Future, September 28, 2023. Accessed December 31, 2023. https://openfuture.eu/publication/defining-best-practices-for-opting-out-of-ml-training/.

Khan, Mehtab. “A New AI Lexicon: Open.” AI Now Institute, July 12, 2021. Accessed December 31, 2023. https://ainowinstitute.org/publication/a-new-ai-lexicon-open.

Kirchenbauer, John, Jonas Geiping, Yuxin Wen, Jonathan Katz, Ian Miers, and Tom Goldstein. “A Watermark for Large Language Models.” Preprint submitted January 24, 2023. https://arxiv.org/abs/2301.10226.

Kurita, Keita, Nidhi Vyas, Ayush Pareek, Alan W. Black, and Yulia Tsvetkov. “Measuring Bias in Contextualized Word Representations.” Preprint submitted June 18, 2019. https://arxiv.org/abs/1906.07337.

Lee, Katherine, A. Feder Cooper, and James Grimmelmann. “Talkin’ Bout AI Generation: Copyright and the Generative-AI Supply Chain.” Preprint submitted September 15, 2023. https://arxiv.org/abs/2309.08133.

Lemley, Mark A., and Bryan Casey. “Fair Learning.” Texas Law Review 99, no. 4 (2020): 743.

Lewkowycz, Aitor, Anders Andreassen, David Dohan, Ethan Dyer, Henryk Michalewski, Vinay Ramasesh, Ambrose Slone et al. “Solving Quantitative Reasoning Problems with Language Models.” Advances in Neural Information Processing Systems 35 (2022): 3843–57.

Lhoest, Quentin, Albert Villanova del Moral, Yacine Jernite, Abhishek Thakur, Patrick von Platen, Suraj Patil, Julien Chaumond et al. “Datasets: A Community Library for Natural Language Processing.” Preprint submitted September 7, 2021. https://arxiv.org/abs/2109.02846.

Li, Yujia, David Choi, Junyoung Chung, Nate Kushman, Julian Schrittwieser, Rémi Leblond, Tom Eccles et al. “Competition-Level Code Generation with Alphacode.” Science 378, no. 6624 (2022): 1092–97.

Lohr, Steve. “Big Companies Find a Way to Identify A.I. Data They Can Trust.” New York Times, November 30, 2023. https://www.nytimes.com/2023/11/30/business/ai-data-standards.html.

Longpre, Shayne, Robert Mahari, Anthony Chen, Naana Obeng-Marnu, Damien Sileo, William Brannon, Niklas Muennighoff et al. “The Data Provenance Initiative: A Large Scale Audit of Dataset Licensing & Attribution in AI.” Preprint submitted October 25, 2023. https://arxiv.org/abs/2310.16787.

Longpre, Shayne, Gregory Yauney, Emily Reif, Katherine Lee, Adam Roberts, Barret Zoph, Denny Zhou et al. “A Pretrainer’s Guide to Training Data: Measuring the Effects of Data Age, Domain Coverage, Quality, & Toxicity.” Preprint submitted May 22, 2023. https://arxiv.org/abs/2305.13169.

Lupi, Giorgia. “We've Reached Peak Infographics. Are You Ready for What Comes Next?” Print Mag, January 2017, 161. https://www.printmag.com/article/data-humanism-future-of-data-visualization/.

Luu, Kelvin, Daniel Khashabi, Suchin Gururangan, Karishma Mandyam, and Noah A. Smith. “Time Waits For No One! Analysis and Challenges of Temporal Misalignment.” Preprint submitted November 14, 2021. https://arxiv.org/abs/2111.07408.

Mahari, Robert, Shayne Longpre, Lisette Donewald, Alan Polozov, Ari Lipsitz, and Sandy Pentland. “Request for Comments Regarding Artificial Intelligence and Copyright.” U.S. Copyright Office Docket No. 2023-6; COLC-2023-0006, October 30, 2023. https://www.regulations.gov/comment/COLC-2023-0006-9063.

Maiberg, Emanuel. “404 Media Generative AI Market Analysis: People Love to Cum.” 404 Media, September 19, 2023. https://www.404media.co/404-media-generative-ai-sector-analysis-people-love-to-cum/.

Malik, Aisha. “OpenAI’s ChatGPT Now Has 100 Million Weekly Active Users.” TechCrunch, November 6, 2023. https://techcrunch.com/2023/11/06/openais-chatgpt-now-has-100-million-weekly-active-users/.

McElheran, Kristina, J. Frank Li, Erik Brynjolfsson, Zachary Kroff, Emin Dinlersoz, Lucia S. Foster, and Nikolas Zolas. “AI Adoption in America: Who, What, and Where.” NBER Working Paper No. 31788. Cambridge, MA: National Bureau of Economic Research, October 2023. https://doi.org/10.3386/w31788.

Min, Sewon, Suchin Gururangan, Eric Wallace, Hannaneh Hajishirzi, Noah A. Smith, and Luke Zettlemoyer. “SILO Language Models: Isolating Legal Risk in a Nonparametric Datastore.” Preprint submitted August 8, 2023. https://arxiv.org/abs/2308.04430.

Mitchell, Margaret, Simone Wu, Andrew Zaldivar, Parker Barnes, Lucy Vasserman, Ben Hutchinson, Elena Spitzer, Inioluwa Deborah Raji, and Timnit Gebru. “Model Cards for Model Reporting.” In Proceedings of the Conference on Fairness, Accountability, and Transparency, edited by danah boyd and Jamie Morgenstern, 220–29. New York: ACM, 2019.

Mitchell, Margaret, Alexandra Sasha Luccioni, Nathan Lambert, Marissa Gerchick, Angelina McMillan-Major, Ezinwanne Ozoani, Nazneen Rajani, Tristan Thrush, Yacine Jernite, and Douwe Kiela. “Measuring Data.” Preprint submitted December 9, 2022. https://arxiv.org/abs/2212.05129.

Narayanan, Arvind, and Sayash Kapoor. “Generative AI Companies Must Publish Transparency Reports.” Algorithmic Amplification and Society Blog, June 26, 2023. Knight First Amendment Institute at Columbia University. https://knightcolumbia.org/blog/generative-ai-companies-must-publish-transparency-reports.

Nasr, Milad, Nicholas Carlini, Jonathan Hayase, Matthew Jagielski, A. Feder Cooper, Daphne Ippolito, Christopher A. Choquette-Choo, Eric Wallace, Florian Tramèr, and Katherine Lee. “Scalable Extraction of Training Data from (Production) Language Models.” Preprint submitted November 28, 2023. https://arxiv.org/abs/2311.17035.

Peng, Sida, Eirini Kalliamvakou, Peter Cihon, and Mert Demirer. “The Impact of AI on Developer Productivity: Evidence from GitHub Copilot.” Preprint submitted February 13, 2023. https://arxiv.org/abs/2302.06590.

Pistilli, Giada, Carlos Muñoz Ferrandis, Yacine Jernite, and Margaret Mitchell. “Stronger Together: On the Articulation of Ethical Charters, Legal Tools, and Technical Documentation in ML.” In Proceedings of the 2023 ACM Conference on Fairness, Accountability, and Transparency, edited by Sara Fox et al., 343–54. New York: ACM, 2023.

Porciuncula, Lorrayne, and Bertrand De La Chapelle. “Hello Datasphere — Towards a Systems Approach to Data Governance.” Datasphere, February 28, 2022. https://www.thedatasphere.org/news/hello-datasphere-towards-a-systems-approach-to-data-governance/.

Pushkarna, Mahima, Andrew Zaldivar, and Oddur Kjartansson. “Data Cards: Purposeful and Transparent Dataset Documentation for Responsible AI.” In Proceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency, edited by Charles Isbell et al., 1776–826. New York: ACM, 2022.

Raji, Inioluwa Deborah, Timnit Gebru, Margaret Mitchell, Joy Buolamwini, Joonseok Lee, and Emily Denton. “Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing.” In Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society, edited by Annette Markham et al., 145–51. New York: ACM, 2020.

Rawte, Vipula, Amit Sheth, and Amitava Das. “A Survey of Hallucination in Large Foundation Models.” Preprint submitted September 12, 2023. https://arxiv.org/abs/2309.05922.

Rogers, Anna. “Changing the World by Changing the Data.” In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), edited by

Chengqing Zong, Fei Xia, Wenjie Li, and Roberto Navigli, 2182–94. Kerrville, TX: Association for Computational Linguistics, 2021.

Rogers, Anna, Tim Baldwin, and Kobi Leins. “‘Just What do You Think You're Doing, Dave?’ A Checklist for Responsible Data Use in NLP.” Preprint submitted September 14, 2021. https://arxiv.org/abs/2109.06598.

Rosenthol, Leonard. “C2PA: The World’s First Industry Standard for Content Provenance.” In Proceedings vol. 12226, Applications of Digital Image Processing XLV, 122260P. Bellingham, WA: SPIE, 2022.

Sadasivan, Vinu Sankar, Aounon Kumar, Sriram Balasubramanian, Wenxiao Wang, and Soheil Feizi. “Can AI-Generated Text Be Reliably Detected?” Preprint submitted March 17, 2023. https://arxiv.org/abs/2303.11156.

Sambasivan, Nithya, Shivani Kapania, Hannah Highfill, Diana Akrong, Praveen Paritosh, and Lora M. Aroyo. “‘Everyone Wants to Do the Model Work, Not the Data Work’: Data Cascades in High-Stakes AI.” In Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems, edited by Yoshifumi Kitamura et al., 39, 1–15. New York: ACM, 2021.

Shumailov, Ilia, Zakhar Shumaylov, Yiren Zhao, Yarin Gal, Nicolas Papernot, and Ross Anderson. “The Curse of Recursion: Training on Generated Data Makes Models Forget.” Preprint submitted May 27, 2023. https://arxiv.org/abs/2305.17493.

Singhal, Karan, Tao Tu, Juraj Gottweis, Rory Sayres, Ellery Wulczyn, Le Hou, Kevin Clark et al. “Towards Expert-Level Medical Question Answering with Large Language Models.” Preprint submitted May 16, 2023. https://arxiv.org/abs/2305.09617.

Small, Zachary. “Sarah Silverman sues OpenAI and Meta Over Copyright Infringement.” New York Times, July 10, 2023. https://www.nytimes.com/2023/07/10/arts/sarah-silverman-lawsuit-openai-meta.html.

Smee, Sebastian. “AI Is No Threat to Traditional Artists. But It Is Thrilling.” Washington Post, February 15, 2023. https://www.washingtonpost.com/arts-entertainment/2023/02/15/ai-in-art/.

South, Tobin, Robert Mahari, and Alex Pentland. “Transparency by Design for Large Language Models.” Network Law Review, Computational Legal Futures, May 25, 2023. https://www.networklawreview.org/computational-three/.

Steed, Ryan, and Aylin Caliskan. “Image Representations Learned with Unsupervised Pre-Training Contain Human-Like Biases.” In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, edited by Madeleine Clare Elish, William Isaac, and Richard Zemel, 701–13. New York: ACM, 2021.

Tremblay et al v. OpenAI, Inc. et al 3:2023cv03223 (N.D. Cal.).

United Nations. A Global Digital Compact — An Open, Free and Secure Digital Future for All. May 2023. Our Common Agenda Policy Brief #5. https://indonesia.un.org/sites/default/files/2023-07/our-common-agenda-policy-brief-gobal-digi-compact-en.pdf.

Veselovsky, Veniamin, Manoel Horta Ribeiro, and Robert West. “Artificial Artificial Artificial Intelligence: Crowd Workers Widely Use Large Language Models for Text Production Tasks.” Preprint submitted June 13, 2023. https://arxiv.org/abs/2306.07899.

Villaronga, Eduard Fosch, Peter Kieseberg, and Tiffany Li. “Humans Forget, Machines Remember: Artificial Intelligence and the Right to Be Forgotten.” Computer Law & Security Review 34, no. 2 (2018): 304–13.

Vincent, James. “Getty Images Sues AI Art Generator Stable Diffusion in the US for Copyright Infringement.” Verge, February 6, 2023. https://www.theverge.com/2023/2/6/23587393/ai-art-copyright-lawsuit-getty-images-stable-diffusion.

———. “Getty Images Is Suing the Creators of AI Art Tool Stable Diffusion for Scraping Its Content.” Verge, January 17, 2023. https://www.theverge.com/2023/1/17/23558516/ai-art-copyright-stable-diffusion-getty-images-lawsuit.

———. “The Lawsuit That Could Rewrite the rules of AI Copyright.” Verge, November 8, 2022. https://www.theverge.com/2022/11/8/23446821/microsoft-openai-github-copilot-class-action-lawsuit-ai-copyright-violation-training-data.

Vipra, Jai, and Anton Korinek. “Market Concentration Implications of Foundation Models: The Invisible Hand of ChatGPT.” Brookings, September 7, 2023. https://www.brookings.edu/articles/market-concentration-implications-of-foundation-models-the-invisible-hand-of-chatgpt/.

Werder, Karl, Balasubramaniam Ramesh, and Rongen Zhang. “Establishing Data Provenance for Responsible Artificial Intelligence Systems.” ACM Transactions on Management Information Systems (TMIS) 13, no. 2 (2022): 1–23.

Westerlund, Mika. “The Emergence of Deepfake Technology: A Review.” Technology Innovation Management Review 9, no. 11 (2019): 39–52.

Wolf, Thomas, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac et al. “HuggingFace’s Transformers: State-of-the-art Natural Language Processing.” Preprint submitted October 9, 2019. https://arxiv.org/abs/1910.03771.

Writers Guild of America. “WGA Negotiations—Status as of May 1, 2023.” May 1, 2023. https://www.wga.org/uploadedfiles/members/member_info/contract-2023/WGA_proposals.pdf.

Xi, Zhiheng, Wenxiang Chen, Xin Guo, Wei He, Yiwen Ding, Boyang Hong, Ming Zhang et al. “The Rise and Potential of Large Language Model Based Agents: A Survey.” Preprint submitted September 14, 2023. https://arxiv.org/abs/2309.07864.

Zhang, Brian Hu, Blake Lemoine, and Margaret Mitchell. “Mitigating Unwanted Biases with Adversarial Learning.” In Proceedings of the 2018 AAAI/ACM Conference on AI, Ethics, and Society, edited by Jason Furman et al., 335–40. New York: ACM, 2018.

Zhang, Hanlin, Benjamin L. Edelman, Danilo Francati, Daniele Venturi, Giuseppe Ateniese, and Boaz Barak. “Watermarks in the Sand: Impossibility of Strong Watermarking for Generative Models.” Preprint submitted November 7, 2023. https://arxiv.org/abs/2311.04378.

Ziems, Caleb, Omar Shaikh, Zhehao Zhang, William Held, Jiaao Chen, and Diyi Yang. “Can Large Language Models Transform Computational Social Science?” Computational Linguistics (2023): 1–53.

Comments
0
comment
No comments here
Why not start the discussion?