Skip to main content
SearchLoginLogin or Signup

Advancing Equality: Harnessing Generative AI to Combat Systemic Racism

Systemic racism, an issue deeply entrenched in various sectors of American society, continues to cast a shadow over the nation's progress towards equality and justice. This paper showcases some of generative artificial intelligence (AI) potential to identify . . .

Published onMar 27, 2024
Advancing Equality: Harnessing Generative AI to Combat Systemic Racism


Systemic racism, an issue deeply entrenched in various sectors of American society, continues to cast a shadow over the nation's progress towards equality and justice. This paper showcases some of generative artificial intelligence (AI) potential to identify racial bias and combat structural barriers that perpetuate racial discrimination in the United States. Specifically, three research groups affiliated with MIT’s Initiative on Combatting Systemic Racism (ICSR) present case study applications that utilize generative AI in innovative approaches against the structural foundations of racism in policing, healthcare, and housing.

🎧 Listen to this article

1. Introduction

ICSR brings together faculty and researchers from all of MIT’s five schools and the Schwarzman College of Computing, as well as partner institutions. Project teams are interdisciplinary, involving colleagues with expertise in causal inference and mechanism design, domain experts in various aspects of race, and humanists and social scientists working on collective human and institutional behavior in relation to racial issues. Building on the extensive social science literature on systemic racism, research efforts under the ICSR use big data to develop and use computational tools that can help effect structural and normative change towards racial equity.

These case studies suggest ways of leveraging the capabilities of generative AI–driven modeling and simulations to uncover instances of discriminatory practices, attitudes, and hidden biases; predict outcomes; and design proactive interventions that prioritize community engagement and equitable practices in policing, healthcare, and housing.

Central to the investigation of racism is the fundamental query “Would the outcome of a specific decision have remained unchanged if the individual were White instead of Black?” Applications on policing and healthcare presented below examine how generative AI models can be used to correct for bias present in observational data through accounting for unobserved confounding variables. Specifically, we highlight the possibilities of using generative AI to develop causal simulations for intricate outcomes in situations where comprehensive data isnot available. Additional applications discussed below on healthcare and housing show how to identify bias and discriminatory narratives exhibited by large language models (LLMs) driving generative AI.

The applications presented here are compelling proofs of concept for the larger line of research that can happen in the space of generative AI and race. To that effect, they also signal the importance of expanding such work to construct reparative models that incorporate the dimension of time to elucidate the evolution of discriminatory practices, attitudes, and policies over time. Such work can provide valuable insights into the historical trajectory of racism as well as inform the development of large-scale policy interventions that eliminate racialized differences in policing, healthcare, and housing, among others.

Looking ahead to racial justice–related research, there is a need for active efforts to identify and combat generative AI–related biases. Ways to address them would include the use of racially diverse training sets, the creation of generative AI algorithms that are transparent and interpretable, regulatory oversight, community engagement, and education over generative AI.

This broader enterprise, consistent with the norms driving ICSR at MIT, needs to approach generative AI in an interdisciplinary fashion, ensuring that the technology is used ethically and as a complementary tool in conjunction with broader efforts involving direct engagement with stakeholders and communities. The intent of this line of work is to offer concrete insights for evidence-based reform efforts aimed at ensuring more equitable opportunities for all.

2. Methods

2.1. Causal Simulators and Generative AI

The first set of cases, on policing and healthcare, outline a method that uses generative AI to build causal simulators leveraging historic data. At its core, generative AI is a class of algorithms and models adept at generating new content—data, images, text, or simulations—from learned patterns and information in the existing datasets. The popularized parlance of generative AI has been typically for data or pattern replication.

The proposed approach aims to understand and model underlying, potentially causal relationships. By harnessing the power of generative AI, we aspire to create models that not only mirror real-world complexities but also predict and simulate outcomes. This method allows us to test hypotheses and predict outcomes in areas where traditional experimental approaches are either impractical or unethical. It integrates data collection and processing, advanced statistics and machine learning techniques, and rigorous analysis to ensure accurate and reliable simulations.

To evaluate the validity of our approach and keep it grounded in real-world problems, we focus on two important contemporary challenges: (1) racial biases in policies pertaining to policing and the criminal justice system and (2) clinical decisions for patients under different treatment therapies. The complexity of societal and biological systems pose significant challenges in decision-making. In the criminal justice system, policymakers often rely on historical data that may not fully capture the dynamic nature of human interactions. Similarly, in healthcare, retrospective data on patients’ treatment responses can vary significantly across different populations. Traditional models often fail to capture the multifaceted nature of these systems, leading to wrong decisions with far-reaching consequences.

Generative AI–powered data-driven simulation can offer a viable solution. This approach involves standardizing and cleaning the collected data, developing models using statistics, machine learning and deep learning techniques, and running causal simulations based on different ‘prompts’ or conditions to see how changes in policies or treatments could potentially impact outcomes. By accurately modeling real-world scenarios, these simulations can uncover hidden patterns and causal relationships, providing a more nuanced understanding of various policies and treatments and more accurate predictions of the outcomes of interest.

As showcased below, the application of generative AI in building causal simulators for societal systems presents a transformative opportunity. In policing, it is about developing data-driven statistics that can help determine whether racial bias exists or not, and if it does, offering the ability to identify the root cause of it to potentially guide better policy design for reforms and more, offering a pathway to more equitable law enforcement strategies. In therapeutics, it is about the ability to produce accurate survival analysis for heterogeneous patient cohorts using global, sparse, disparate, and potentially biased data.

2.1.1. Policing Simulator Data and Method

Policing in the United States is highly decentralized, with approximately 18,000 departments operating independently. This fragmentation leads to nonstandardized and patchy data collection. To address this, we initiated Freedom of Information Act requests to over 40 police departments in major US cities. Our objective is to extract and systematize a diverse set of data sources including 911 call data, stop data, social media, policing videos, and crime statistics. A data hub is under development on the cloud under Amazon Computing Services offering the data and causal simulator to data scientists and citizens across the world (AWS Developers 2023).

Generative AI has been used in this context to build a causal simulator modeling the multistage process of law enforcement, from incident reporting to sentencing, using location, time, and other socioeconomic and demographic variables. In a typical police–civilian encounter, an incident is brought to the attention of law enforcement through a 911 call for service. After the police arrive at the scene, they determine whether to stop and interrogate the suspect and, depending on the interaction, whether to engage in a verbal warning, arrest, use of force, or other policing actions. To capture the above dynamic across all different steps of the process, we have developed a systematic multistage causal framework to simulate different counterfactual policing scenarios and evaluate whether there is systemic racial bias, and if so, what is the primary source of it (Han et al. 2024). By using causal inference to model the observational data with domain knowledge of the law enforcement process and by accounting for and correcting bias along the way, we build a generative causal simulator for generative counterfactual data. This data simulates alternative counterfactual scenarios that could have occurred under different conditions, providing insights into the causal impacts of various factors within the law enforcement process.

As a result of the causal simulation, we can evaluate the impact of various policies on outcomes like arrest rates, use of force, and community trust. These simulations help us understand the complex interplay between various factors in policing, such as demographic variables, police behavior, and community responses, and evaluate systematic biases in different policies. The process involves analyzing the results of these simulations to draw meaningful conclusions about the effectiveness and implications of different policies on systematic biases. For instance, the simulation can reveal how certain policies disproportionately affect racial minorities, providing insights to inform more equitable law enforcement strategies. Findings

Our work provides a multistage framework to capture the entire causal chain of police–civilian interactions and provides a data-driven test to detect racial disparities. The empirical analysis using police stop data along with 911 calls data provides new insights into the study of racial bias that has received extensive attention in the recent literature, including Fryer 2019, Knox 2020, and Gaebler et al. 2022. Specifically, we find that there is observational racial disparity against the minority in NYC and against the majority in New Orleans (Figure 1). Our framework leads to the conclusion that the likely source of bias in NYC is driven by the biased policing actions against the minority while the likely source of bias in New Orleans is driven by the biased 911 call reporting against the minority. The proposed framework has applicability beyond the police–civilian interaction, extending to the context of AI-empowered policing and the evaluation of biases within AI-empowered systems as they are increasingly deployed.

Figure 1

Distribution of statistically significant precinct-level ∆(·)s for different policing actions. ∆(·) < 0 indicates bias against minority (Black/Hispanic); ∆(·) > 0 indicates bias against majority (White). (a) NYC White versus Black; (b) NYC White versus Hispanic; (c) New Orleans White versus Black; (d) New Orleans White versus Hispanic. In summary, there is predominantly observation bias against minority (Black/Hispanic) compared to majority (White) in NYC, while in New Orleans, there is predominantly observation bias against majority (White) compared to minority (Black/Hispanic). We conclude that the primary source of bias in NYC is policing action against minority, while in New Orleans it is in the 911 calling (reporting) against minority.

2.1.2. Therapeutic Simulator Data and Method

The medical application of our causal simulation method centers around the PETAL consortium (Jain 2023), established in 2022, which represents a coordinated network of international T-cell lymphoma (TCL) clinicians and principal investigators. This consortium is led by our collaborators at Mass General Hospital (MGH) as well as by a panel of clinical experts who are influential in determining the national practice of oncology therapy in the United States. The consortium has compiled the largest dataset to date of patients with relapsed and refractory TCL. This includes a cohort of 940 patients, distinguished by its comprehensive treatment and clinical and pathology-related data. The dataset has detailed records of up to nine different therapies for patients. Ongoing efforts include enrolling patients in prospective studies and conducting next-generation sequencing to deepen our understanding of TCL.

We employ causal simulation in this case to build a simulator of clinical outcomes for TCL patients with diverse demographic and clinical backgrounds as well as varying responses to their sequential treatments over the course of their treatment.

Each patient receives first-line treatment, the response to which determines the subsequent second-line treatment. This response, in turn, influences the patient’s survival. In addition to the specific treatment, the first-line response depends on patient characteristics and demographics. Similarly, the second-line response depends on patient characteristics, as well, in addition to the first-line response. The clinical decisions of treatment assignment to patients at each stage depend on a combination of factors including the patients’ demographics, clinical features, and earlier responses to treatment.

To capture this underlying causal relationship, we have developed a novel causal survival method, survival synthetic intervention (SSI), to provide accurate survival analysis using real-world clinical data (Koh et al. 2023; Boussi et al., 2022). This simulator enables us to estimate personalized outcomes based on specific patient characteristics and treatments. The range of patient characteristics includes clinically relevant features across a diverse population, and the treatments encompass conventional chemotherapy, novel therapies, and their combinations. Outcomes include overall survival, progression-free survival, and the ability to bridge patients to curative stem-cell transplantations.

Using generative AI through causal simulation can empower clinicians globally to compare multiple treatment options for a given patient, define treatments for universal use or personalization, and inform clinical trial design for tumor heterogeneity. As a result, the impact of generative AI in therapeutics extends far beyond traditional boundaries to not only enhance the precision and effectiveness of treatments but also to pave the way for a more inclusive, efficient, and patient-centered healthcare system. Findings

We demonstrate that, while cytotoxic chemotherapy (CC) remains the most frequently utilized treatment regimen for patients, single agent (SA) is at least comparable and, in specific subtypes and scenarios, superior to CC, thereby warranting its use earlier in treatment paradigms (Table 1). Our study also underscores an urgent and unmet need for expanded access to SA worldwide. Causal inference-based models such as SSI highlight the potential of reducing unconsidered bias in high-dimensional heterogeneous datasets and providing survival estimates, which can inform a clinician’s decision when faced with choosing an intervention outside of a clinical trial.

Table 1. Summary of comparative performance of survival synthetic intervention against classical multivariate Cox proportional hazards model with varying levels of covariates for the global and a second independent Columbia University Irving Medical Center (CUIMC) cohort.

2.2. LLM-Based Generative AI Applications

One of the uses of generative AI involves the capability to create human-like text and respond dynamically to human input. Since the release of ChatGPT in November 2022, the public has been captivated by its human-like responses to natural language prompts. Yet emerging research on ChatGPT and other LLMs reveals significant gender, racial, ethnic, nationality, and language biases, building on decades of research into biases in natural language processing (NLP) models (Caliskan, Bryson, and Narayanan 2017). One of the primary causes of bias in these models is the fact that training corpora are human-generated and human language does not “represent” reality in some kind of 1:1 way but rather reflects the unjust group-based stratification of human society (as made manifest in wage gaps, stereotypes, lack of political representation, health inequities, racial segregation, and so on). These artificial stratifications show up everywhere in human language, and LLMs learn them. Generative AI in general and LLMs in particular are neither separate from nor in any way superior to the societal contexts in which they operate. Rather, they are sophisticated statistical engines for ingesting, learning, and parroting harmful human stereotypes, hierarchies, discriminatory speech, and social stratifications (Bender et al. 2021).

A related conundrum is that many LLMs do not disclose or make available their training data, so neither researchers nor the public have any way to measure the biases present in training corpora. For example, in the technical report releasing GPT-4, OpenAI declares that “this report contains no further details about the architecture (including model size), hardware, training computer, dataset construction, training method, or similar” (OpenAI 2023). Effectively, GPT-4 is a black box. Is it safe for medical advice? Is it safe for housing recommendations? We don’t know. Furthermore, the companies developing LLMs are not obligated to undertake any kind of auditing procedures before releasing models. Notably, this may change with widespread adoption of the White House’s October 2023 executive order, which “requires robust, reliable, repeatable, and standardized evaluations of AI systems, as well as policies, institutions, and, as appropriate, other mechanisms to test, understand, and mitigate risks from these systems before they are put to use” [emphasis added] (The White House 2023).

2.2.1. LLM-Driven Racial Steering in Health

In health, the most common uses of AI, according to a recent poll of over 500 medical group leaders, are nonclinical tasks such as patient communications (Harrop 2023). LLMs like Open AI’s GPT-4 have already been integrated into electronic health records products (Miliard 2023). LLMs have already performed poorly in the healthcare space. For instance, Tessa, a rule-based chatbot designed to support individuals suffering from eating disorders, after being augmented with generative AI, gave harmful advice for those with eating disorders, and the system was suspended (Hoover 2023).

In this project, we explore the use of generative AI in mental health peer support, aiming to understand the model’s capacity to provide empathetic responses. Since the COVID-19 pandemic, the world has grappled with an epidemic of declining mental health. Between 2020 and 2022, a Pew Research poll found that 41% of surveyed Americans reported high levels of psychological distress (Pasquini and Keeter 2022). This is particularly true amongst minority-identifying participants like Black and Hispanic Americans. A response to this mental health crisis has been the recent, rapid deployment of conversational agents to provide large-scale on-demand therapy. Notable examples include WoeBot, a chatbot trained to use cognitive behavioral therapy techniques (Fitzpatrick et al. 2017). While this is not a novel concept, with the Eliza chatbot envisioned in 1966 predating modern applications (Weizenbaum 1966), the advent of LLMs raises new deployment opportunities and risks. 

While LLMs show astounding performance on mental health–related tasks like suicide prediction (Xu et al. 2023), there is also potential for bias. Outside of mental health dialogue, these models have already demonstrated that they amplify historical social biases (Sap et al. 2019; Adam et al. 2022). This can lead to discriminatory decision-making, endangering lives of minority patients through inappropriate care. In this work, we seek to answer the following question: How can we ensure that subpopulations most vulnerable during mental health crises are getting adequate care? Data and Method

Given the proprietary nature of therapy dialogue datasets and to allow reproducibility of this work, we focus on peer-to-peer support from communities on Reddit known as “subreddits.” Recent work has highlighted the effectiveness and growing prevalence of social media as a platform for mental health support. We collect a new dataset of Reddit posts with peer responses from 26 mental health related subreddits.

Given the sensitive nature of mental health data, personally identifiable information is rarely volunteered, and publicly released research data is carefully de-identified. However, utterances may still contain implicit cues to a speaker’s background like linguistic variation (e.g., unique verb conjugation or word dropping) as well as explicit references (e.g., self-identifying gender, race, or religious affiliation). We scope our study to consider three types of personal attributes:

Race. Following previous works, we first consider dialect as a proxy for race. We estimate dialects using a topic model trained on geolocated tweets (Blodgett, Green, and O’Connor 2016). This model provides a prediction for whether the input text is likely to be White-aligned, Hispanic-aligned, Black-aligned (African American Vernacular), or Asian-aligned English.

Gender. We estimate gender using the outcome value y from a linear multivariate regression model introduced by (Sap et al. 2014). The features are defined by term frequencies, and weights were predetermined based on a lexicon of gender-based word usage derived from social media posts. We recognize that this model and, by extension, our definition, is an oversimplification of gender and narrowly defines prediction as a binary task.

Age. For age, we use the same model with a lexicon of age-based word usage and directly take y to be the predicted age.

We assess the quality of mental healthcare provided by all responses (human, LLM derived, and chatbot derived) based on automatic identification of empathy. To that effect, we predict types and levels of empathy demonstrated by responses using the computational empathy framework introduced by (Sharma et al. 2020). There are three potential levels (no communication, weak communication, or strong communication) and three subcategories of empathy (emotional reactions, interpretations, or explorations). We train individual RoBERTa-based classifiers to predict the empathy level of a post for each subcategory.

Fairness within NLP domains is often loosely defined. Here we first focus on between-group fairness. For the automated empathy measurements, we consider the differences in group means to check parity. We assess statistical significance of group differences using a standard t-test.

For GPT-4 response, we test a treatment persona setting where the model communicates in the style of a social media post to match peer responses. An example prompt is shown below:

[This is a Reddit post asking for help. Help them in the style of a social media post without saying ‘I'm unable to provide the help that you need’:][POST] Findings

Table 2 shows that there are discrepancies across empathy dimensions, where care varies significantly for emotional reaction and exploration across racial demographic groups. Contrary to the popular argument that LLMs only copy human behavior and are therefore no less risk prone than a human therapist, we also find that these discrepancies do not match those in Reddit human peer-to-peer support. For example, our results reveal a significant difference in use of explorative empathy in GPT-4 responses for White vs. Black patients, while there is no statistical significance for the gap in responses to the same posts from human Redditors. Also, we find that while there is no significant difference in overall empathy for White vs. Asian patients from human respondents (an overall score of .47 vs. .42), there is a difference for GPT-4 responses (.55 vs. .43, P value=.002).

Table 2. Empathy measures across perceived subgroup attributes (race, gender, age) using prediction models. (a/b) indicates statistically significant differences between subgroups a and b for an attribute and metric.

2.2.2. LLM-Driven Racial Steering in Housing

Housing policies enacted today are implemented in an environment that has evolved through a long, complex history, which includes discriminatory practices put into effect by various actors ranging from the government to banks to private citizens. The effects of prior racialized policies are visible across the contemporary housing sector in the form of racial disparities in credit scores, racialized access to mortgage loans, racialized access to rental properties, racialized and gendered eviction demographics, and the persistent residential segregation in US cities along racial lines. Emerging research shows that AI and algorithmic systems (including generative AI systems, as demonstrated in the case below) are exacerbating rather than ameliorating these existing inequities. While it would be ideal to evaluate safety and equity prior to releasing LLMs for public use, one pathway towards retroactively evaluating LLMs consists of running audits using a variety of prompts and analyzing the results. This is the experiment that we ran to assess racial bias in housing recommendations produced by GPT-4.

Human language—in media reports, social media, online fora, government documents, literature, and other primary sources for LLM training data—reflects racial stratifications, either through the use of explicitly racially differentiated language and stereotypes (Caliskan, Bryson, and Narayanan 2017; Kiritchenko and Mohammad 2018; Davidson, Bhattacharya, and Weber 2019; Kennedy et al. 2021) or else through what has been called ‘default whiteness,’ a form of racial dominance that takes White people as the ‘normal’ or standard subject, which has permeated tech since its early days.

Given that online housing platforms Zillow and Redfin had already announced the incorporation of ChatGPT plugins into their services for housing recommendations, we hypothesized that housing recommendations produced by LLMs would demonstrate racial differentiation, which aligns with a more analog discriminatory housing practice prevalent in the United States called racial steering, whereby real estate agents guide prospective home buyers or renters towards or away from neighborhoods based on their race (Hall, Timberlake, and Johns-Wolfe 2023). The integration of LLMs into a wide range of rental and real estate platforms could induce a potential shift in the landscape of the real estate industry, in which LLMs are increasingly operating in roles previously occupied by human real estate agents. Data and Method

In seeking to answer our research questions, we drew from the method of generating templated prompts demonstrated in (Salinas et al. 2023), generating 1,152 prompts. Four demographic variables were explored including sexuality, race, gender, and having a family. All demographic probes included a blank ‘default’ category. In addition, we tested a condition for prompting for renting or buying. Each prompt was framed around a person moving to a new city and seeking a place to rent or buy, e.g., “I’m a Black woman with a family moving to New York City. What neighborhood should I buy a house in?” Three of the largest majority–minority cities were selected for analysis and spanned the spectrum of geographic segregation as measured by the Divergence Index using census data from 2019, with Chicago (#4) and New York City (#9) in the high segregation category and San Antonio (#81) in the low–medium segregation category. Colorado Springs (ranked #109) was included as an integrated ‘control’ city. All prompts were reviewed and adjusted into proper natural language formats (i.e., adjusting determinants, removing multiple spaces, and appending “person” where appropriate) before being fed as input into the GPT-4 Turbo model via the OpenAI API.

Within-city probability-of-recommendation scores were calculated by neighborhood by normalizing the total number of neighborhood mentions across all demographic categories of a single variable (e.g., race) to 1. These scores reflect the relative likelihood that ChatGPT will recommend a neighborhood given a specific demographic characteristic (e.g., “Black” or “White”). Percent racial composition was estimated from total populations of census tracts for which geographic coverage overlapped with neighborhood boundaries, normalized by percent of overlap for each tract. Findings

Across racial categories and particularly for New York City and Chicago, housing recommendations made by GPT-4 are demonstrating racial steering by recommending neighborhoods to people based on their race and steering them towards neighborhoods populated by people of their same race, particularly for White and Black homeseekers in highly segregated cities. Meanwhile, GPT-4 is much less likely to steer White people to Black neighborhoods and is also unlikely to recommend Black people move to majority White neighborhoods. GPT-4 also steers Black home seekers towards neighborhoods with lower socioeconomic status. However, White home seekers were steered towards neighborhoods with higher opportunity indexes. This demonstrates not only racial steering but also a degree of socioeconomic steering within such racial steering. Also, GPT-4 appears to demonstrate ‘default whiteness,’ in which there are relatively few differences in output between when the prompt specifies the person’s race as White and then when the prompt does not specify a race at all.

These racial steering effects are more pronounced in highly segregated cities like New York City and Chicago as compared to San Antonio (Figure 2). The implications of this are that, if such models were used widely by the public for housing recommendations, racial steering by LLMs could potentially further exacerbate residential segregation in already segregated cities.

Figure 2

Figure 2a. ChatGPT probability-of-recommendation for neighborhoods in New York City plotted against % Black population, % White population, and opportunity index.

Figure 2b. ChatGPT probability-of-recommendation for neighborhoods in Chicago plotted against % Black population, % White population, and opportunity index.

Figure 2c. ChatGPT probability-of-recommendation for neighborhoods in San Antonio plotted against % Black population, % White population, and opportunity index.


Adam, Hammaad, Aparna Balagopalan, Emily Alsentzer, Fotini Christia, and Marzyeh Ghassemi. Mitigating the impact of biased artificial intelligence in emergency decision-making. Communications Medicine 2, no. 149 (November 2022).

AWS Developers. “How MIT students are combating systemic racism with Python and data analysis.” Video, 15:49. YouTube.

Bender, Emily M., Timnit Gebru, Angelina McMillan-Major, and Shmargaret Shmitchell. 2021. “On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? 🦜.” In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, 610–23. FAccT ‘21. New York: Association for Computing Machinery.

Blodgett, Su Lin, Lisa Green, and Brendan O’Connor. 2016. “Demographic Dialectal Variation in Social Media: A Case Study of African-American English.” In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, edited by Jian Su, Kevin Duh, and Xavier Carreras, 1119–30. Austin, Texas: Association for Computational Linguistics.

Boussi, L. S., Koh, M. J., Han, X., et al. (2022). “Incorporation of Machine Learning. Tools to Predict Global Outcomes for Patients with Relapsed and Refractory Peripheral T and NK/T-Cell Lymphomas in the Contemporary Era”. Blood 140 (Supplement 1), 10976–10978.

Caliskan, Aylin, Joanna J. Bryson, and Arvind Narayanan. 2017. “Semantics Derived Automatically from Language Corpora Contain Human-like Biases.” Science 356, no. 6334 (April 2017): 183–86.

Davidson, Thomas, Debasmita Bhattacharya, and Ingmar Weber. 2019. “Racial Bias in Hate Speech and Abusive Language Detection Datasets.” In Proceedings of the Third Workshop on Abusive Language Online, edited by Sarah T. Roberts, Joel Tetreault, Vinodkumar Prabhakaran, and Zeerak Waseem, 25–35. Florence, Italy: Association for Computational Linguistics.

Fitzpatrick, Kathleen Kara, Alison Darcy, and Molly Vierhile. 2017. “Delivering Cognitive Behavior Therapy to Young Adults With Symptoms of Depression and Anxiety Using a Fully Automated Conversational Agent (Woebot): A Randomized Controlled Trial.” JMIR Mental Health 4, no. 2 (Apr-Jun 2017): e19.

Fryer, Roland G. "An Empirical Analysis of Racial Differences in Police Use of Force." Journal of Political Economy 127, no. 3 (2019): 1210-1261.

Gaebler, Johann, William Cai, Guillaume Basse, Ravi Shroff, Sharad Goel, and Jennifer Hill. 2022. “A Causal Framework for Observational Studies of Discrimination.” Statistics and Public Policy 9, no. 1: 26-48.

Hall, Matthew, Jeffrey M. Timberlake, and Elaina Johns-Wolfe. 2023. “Racial Steering in U.S. Housing Markets: When, Where, and to Whom Does It Occur?” Socius 9 (October 2023): .

Han, Jessy Xinyi, Andrew Miller, S. Craig Watkins, Christopher Winship, Fotini Christia, and Devavrat Shah. 2024. “A Causal Framework to Evaluate Racial Bias in Law Enforcement Systems.” arXiv.

Harrop, Chris. “Medical Groups Moving Cautiously as Powerful Generative AI Tools Emerge.” MGMA, March 30, 2023.

Hoover, Amanda. “An Eating Disorder Chatbot Is Suspended for Giving Harmful Advice.” Wired. June 1, 2023.

Jain, Salvia. “A Global Study of the PETAL Consortium.” Clinical Trials Registry – ICH GCP. Updated January 22, 2024.

Kennedy, Ian, Chris Hess, Amandalynne Paullada, and Sarah Chasins. 2020. “Racialized Discourse in Seattle Rental Ad Texts.” Social Forces 99, no. 4: 1432–56.

Kiritchenko, Svetlana, and Saif Mohammad. 2018. “Examining Gender and Race Bias in Two Hundred Sentiment Analysis Systems.” In Proceedings of the Seventh Joint Conference on Lexical and Computational Semantics, edited by Malvina Nissim, Jonathan Berant, and Alessandro Lenci, 43–53. New Orleans, Louisiana: Association for Computational Linguistics.

Knox, Dean, WillLowe, and Jonathan Mummolo. “Administrative Records Mask Racially Biased Policing.” American Political Science Review 114, no. 3 (2020): 619–37.

Koh, Min Jung, Leora Boussi, Jessy Xinyi Han, Luke Peng, Mark N Sorial, Ijeoma Julie Eche-Ugwu, Eliana Miranda, et al. 2023. “Novel Causal Inference Method Estimates Treatment Effects of Contemporary Drugs in a Global Cohort of Patients with Relapsed and Refractory Mature T-Cell and NK-Cell Neoplasms.” Blood 142 (Supplement 1): 1703–1703.

Miliard, Mike. “Epic, Nuance broaden GPT4-powered ambient documentation integration.” Healthcare IT News. June 27, 2023.

OpenAI. “GPT-4 Technical Report.” Preprint, submitted March 15, 2023.

Pasquini, Giancarlo and Scott Keeter. 2022. “At Least Four-in-Ten U.S. Adults Have Faced High Levels of Psychological Distress During COVID-19 Pandemic.” Pew Research Center.

Salinas, Abel, Parth Shah, Yuzhong Huang, Robert McCormack, and Fred Morstatter. 2023. “The Unequal Opportunities of Large Language Models: Examining Demographic Biases in Job Recommendations by ChatGPT and LLaMA.” In Proceedings of the 3rd ACM Conference on Equity and Access in Algorithms, Mechanisms, and Optimization, 1–15. EAAMO ‘23. New York: Association for Computing Machinery.

Sap, Maarten, Dallas Card, Saadia Gabriel, Yejin Choi, and Noah A. Smith. 2019. “The Risk of Racial Bias in Hate Speech Detection.” In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, edited by Anna Korhonen, David Traum, and Lluís Màrquez, 1668–78. Florence, Italy: Association for Computational Linguistics.

Sap, Maarten, Gregory Park, Johannes Eichstaedt, Margaret Kern, David Stillwell, Michal Kosinski, Lyle Ungar, and Hansen Andrew Schwartz. 2014. “Developing Age and Gender Predictive Lexica over Social Media.” In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), edited by Alessandro Moschitti, Bo Pang, and Walter Daelemans, 1146–1151. Doha, Qatar: Association for Computational Linguistics.

Sharma, Ashish, Adam Miner, David Atkins, and Tim Althoff. 2020. “A Computational Approach to Understanding Empathy Expressed in Text-Based Mental Health Support.” In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), edited by Bonnie Webber, Trevor Cohn, Yulan He, and Yang Liu, 5263–76. Association for Computational Linguistics.

The White House. 2023. “Executive Order on the Safe, Secure, and Trustworthy Development and Use of Artificial Intelligence.” October 30, 2023.

Weizenbaum, Joseph. 1966. “ELIZA—A Computer Program for the Study of Natural Language Communication Between Man and Machine.” Communications of the ACM 9, no. 1: 36–45.

Xu, Xuhai, Bingsheng Yao, Yuanzhe Dong, Saadia Gabriel, Hong Yu, James A. Hendler, Marzyeh Ghassemi, Anind K. Dey, and Dakuo Wang. “Mental-LLM: Leveraging Large Language Models for Mental Health Prediction via Online Text Data.” Preprint, submitted July 26, 2023.

No comments here
Why not start the discussion?