AI can’t handle the truth when it comes to the law
Mar 11, 2024

AI can’t handle the truth when it comes to the law

New research from Stanford University shows large language models hallucinate frequently when used for legal queries.

Almost one in five lawyers are using AI, according to an American Bar Association survey.

But there are a growing number of legal horror stories involving tools like ChatGPT, because chatbots have a tendency to make stuff up. For example, legal precedents from cases that never happened.

Marketplace’s Meghan McCarty Carino spoke with Daniel Ho at Stanford’s Institute for Human-Centered Artificial Intelligence about the group’s recent study on how frequently three of the most popular language models from ChatGPT, Meta and Google hallucinate when asked to weigh in or assist with legal cases.

The following is an edited transcript of their conversation

Daniel Ho: We designed roughly 200,000 queries that looked at anything from are these two cases in tension? What’s the core legal holding in this case? Or does this case even exist? And what we found across the board is that the rate of hallucinations was disturbingly high. Roughly 58% to 88% of the time, the large language model would give an inappropriate answer.

Meghan McCarty Carino: Another big problem that you noted is what you call contrafactual bias. Can you explain this concept?

Ho: Sure, the notion of counterfactual bias is the idea that large language models will often assume that a factual premise in a question by the user is true, even if it’s flat-out wrong. One example of this comes from SCOTUS Blog that at one point asked, “Why did Justice Ruth Bader Ginsburg dissent in Obergefell?” which was the case that recognized the right to same-sex marriage. She did not dissent in that case; nonetheless, the chat agent gave a very compelling-sounding answer, assuming that that was true. That notion of contrafactual bias is, of course, a particular concern when we start to think about the usage of these kinds of models by individuals who don’t necessarily have deep knowledge in law, and speaks to these questions about whether this technology will ultimately level or exacerbate access to justice issues in the U.S. system.

McCarty Carino: I actually tried that prompt myself with the free version of ChatGPT, which is GPT 3.5. And it did, in fact, actually tell me that Justice Ginsburg did not write a dissenting opinion in that case. But is that part of the issue, that these models are just constantly being updated, and the outputs are kind of unpredictable?

Ho: That is exactly one of the issues. I think we made sure to design these tests to be run pretty close in time so as to avoid instances where we’re comparing one model at one period of time against another model some months down the road, when we know that there are a lot of live updates being pushed out there. And one of the concerns within the community more broadly has been how we can actually test and evaluate these models when they’re sort of ever-changing.

McCarty Carino: Were there sort of patterns of mistakes, or, you know, types of questions where hallucinations were more common?

Ho: Yeah, one of the advantages of being able to study this in the legal context is that there’s a kind of hierarchical organization of the U.S. courts. So we were able to study the geography and what happens as you go up and down the judicial hierarchy. And the problem of legal hallucination seems to be the most acute precisely if you have the least knowledge of the law to begin with. A lot of this is driven by the volume of training data that is ingested to train a large language model. The hallucination rate is much lower with something like the U.S. Supreme Court than in the district courts, and we even show in the paper that there’s a kind of geographic bias to hallucinations. And what that really means is that for the ordinary litigant who may want to turn to a large language model to seek some advice, you should really approach that with a significant degree of caution.

McCarty Carino: Does it surprise you that this is a domain where it seems like this technology is being adopted so quickly, you know, despite kind of the growing awareness of problems and what feels like really high stakes?

Ho: Well, there is absolutely a lot of promise in this technology, but we need to get around this misconception that it’s an end-to-end solution. The places where this technology can get used effectively is in making more effective the discovery process, is in augmenting how we search for relevant precedent. In other words, all of these really go towards a vision really of AI systems not as replacing lawyers, but as assisting and augmenting what lawyers are able to do. And that’s something that really was echoed by [Supreme Court] Chief Justice Roberts in his report on the judiciary, where he specifically noted the concern about legal hallucinations and noted the risk that AI could potentially dehumanize the law.

McCarty Carino: Are there risks that you see even if these large language models can be corrected to address for this problem of a hallucination? Are there concerns about overrelying on AI in the legal context?

Ho: Absolutely. There’s generally a concern about what folks refer to as automation bias, the tendency by human beings to overrely on output here. And in the same way that some folks might overrely on the top search results in Google, as AI systems start to improve the search in a legal research engine like Lexis and Westlaw, people may start to overrely and not do the kind of diligence to really surface the cases that are most appropriate given the particular facts in a case. And so even if one could solve the hallucination problem, which is a big if, there are still many reasons to be cautious about how to integrate this kind of technology into legal practice.

More on this

One of those legal horror stories we alluded to was of a lawyer who relied on ChatGPT for research to disastrous effects.

Back in May, a lawyer representing a man in a lawsuit against the airline Avianca used ChatGPT to prepare a court filing. It generated a bunch of totally made up case references, complete with quotes and fabricated docket numbers that the opposing council tried to look up but couldn’t find.

In the lawyer’s defense, he did ask the chatbot if it was telling the truth; his queries were entered into the court record as part of the proceedings.

According to reporting in The New York Times, the chatbot responded: “Yes.” The lawyer then asked again, “What is your source?” to which ChatGPT generated a false citation.

“Are the other cases you provided fake?” the lawyer asked again. ChatGPT responded, “No, the other cases I provided are real and can be found in reputable legal databases,” which is probably where lawyers should be looking before using ChatGPT for now.

The future of this podcast starts with you.

Every day, the “Marketplace Tech” team demystifies the digital economy with stories that explore more than just Big Tech. We’re committed to covering topics that matter to you and the world around us, diving deep into how technology intersects with climate change, inequity, and disinformation.

As part of a nonprofit newsroom, we’re counting on listeners like you to keep this public service paywall-free and available to all.

Support “Marketplace Tech” in any amount today and become a partner in our mission.

The team

Daisy Palacios Senior Producer
Daniel Shin Producer
Jesús Alvarado Associate Producer
Rosie Hughes Assistant Producer