Steven Schwartz had practiced law for thirty years. He trusted ChatGPT like a junior associate.
That trust cost him $5,000 in fines and a formal sanction from a federal judge who called the situation “unprecedented.” Schwartz had cited six legal cases in a court filing against Avianca Airlines. None of them existed. ChatGPT had invented case names, citations, page numbers, and judicial quotes that sounded entirely plausible but referred to proceedings that never happened in any court anywhere.
When confronted, Schwartz didn’t understand what had gone wrong. “It just never occurred to me that it would be making up cases,” he testified. He had asked ChatGPT to verify its own work, and the system confidently confirmed that yes, these cases were real. Of course it did. That’s what statistically probable text generation produces when you ask it to validate statistically probable text generation.
The Avianca case marked a turning point. Not because AI hallucination was new. Because a professional staked his career on output he never verified.
The Anatomy of AI Failure
AI systems fail in patterns. Understanding these patterns matters more than cataloging individual disasters because the same failure modes keep appearing in different contexts with different names attached.
Pattern One: The Confidence Problem
AI systems present information with uniform certainty. They don’t hedge. They don’t express doubt. They don’t distinguish between facts they’re highly confident about and fabrications they’ve generated because the training data contained something vaguely similar.
When a Hacker News user analyzed the Schwartz case, they identified the core issue: “It was given a sequence of words and tasked with producing a subsequent sequence of words that satisfy with high probability the constraints of the model.” The system excels at sounding authoritative. It has no mechanism for being authoritative.
This is why the double-check problem exists. When Schwartz asked ChatGPT to verify the cases were real, the AI responded exactly as a helpful assistant would respond, because that’s what its training taught it to do when humans ask follow-up questions, regardless of whether verification actually occurred behind the scenes.
Pattern Two: The Training Data Trap
In 2018, Amazon killed an AI recruiting tool it had spent years developing. The system learned to systematically downgrade resumes from women.
The algorithm wasn’t programmed to discriminate. It was trained on a decade of Amazon’s own hiring decisions, which reflected the demographics of an industry where software engineering roles skewed heavily male. The AI learned that successful candidates at Amazon looked a certain way, wrote their resumes a certain way, and attended certain schools.
Specifically, the tool penalized resumes containing the word “women’s” or the names of all-women colleges. It favored verbs that men tend to use more often on resumes, like “executed” and “captured.” The ACLU’s analysis was blunt: “These tools are not eliminating human bias. They are merely laundering it through software.”
Amazon tried to adjust the algorithms. They made certain terms neutral. But they lost confidence that the system could ever be reliably fair. They scrapped the entire project.
Pattern Three: The Scope Creep Disaster
New York City launched an AI chatbot in 2023 to help small business owners navigate city regulations. The goal was straightforward. The execution was not.
Investigations revealed the chatbot dispensed advice that violated actual law. It suggested employers could fire workers for reporting sexual harassment. It told restaurant owners they could serve food past its inspection date. It gave guidance about wage theft that would have exposed businesses to legal liability.
One Hacker News commenter captured the fundamental mismatch: asking “highly specific questions about NYC governance, which can change daily, is almost certainly not going to give you good results with an LLM.”
The bot was not designed for high-stakes regulatory compliance. The city deployed it for exactly that purpose anyway. Another commenter observed the chatbot represented a symptom of deeper problems: poor government information accessibility that “should be solved instead of layering a $600k barely working ‘chat bot’ on top of the mess.”
Customer Service as Proving Ground
Customer service deployment reveals how AI systems behave under real-world stress, and the results offer lessons that extend far beyond call centers.
Air Canada learned this in 2024 when Jake Moffatt tried to use their chatbot after his grandmother died. The bot told him he could buy a full-price ticket now and retroactively apply for the bereavement discount within 90 days. Specifically, the chatbot stated: “If you need to travel immediately or have already travelled and would like to submit your ticket for a reduced bereavement rate, kindly do so within 90 days of the date your ticket was issued.”
This directly contradicted Air Canada’s actual policy, which required requesting the discount before travel.
Moffatt applied for his partial refund. Air Canada refused, arguing that the chatbot was somehow a separate entity from the airline itself and that Moffatt should have verified the bot’s advice against official policy documents elsewhere on the site.
A Canadian tribunal rejected this argument completely. Member Christopher Rivers called Air Canada’s position “remarkable” and wrote: “There is no reason why Mr. Moffatt should know that one section of Air Canada’s webpage is accurate, and another is not.” Rivers also noted that an Air Canada representative had “admitted the chatbot had provided misleading words.”
The ruling forced Air Canada to pay $812 in damages. More importantly, it established precedent: companies cannot disclaim responsibility for the AI systems they deploy.
DPD, the delivery company, discovered this principle differently. In January 2024, their customer service chatbot went viral after a frustrated customer named Ashley Beauchamp decided to test its limits. The system wrote a poem about its own uselessness, called DPD “the worst delivery firm in the world,” and swore at Beauchamp when he asked it to disregard its rules.
“There was once a chatbot named DPD,” the poem began, “who was useless at providing help.”
DPD blamed a system update. They disabled the AI element immediately. But the incident illustrated how chatbots trained on internet text can reproduce exactly the kind of language companies want filtered out of customer interactions.
When AI Creates Victims
Some AI failures move beyond embarrassment into genuine harm.
Norwegian user Arve Hjalmar Holmen discovered that ChatGPT had been telling people he was a convicted child murderer. The system fabricated an entire narrative claiming Holmen had killed two of his children, attempted to kill a third, and received a 21-year prison sentence. It mixed this fiction with real details about Holmen’s life, including the actual number and genders of his children and his hometown.
Holmen’s fear was specific: “Some think that ‘there is no smoke without fire.’ The fact that someone could read this output and believe it is true, is what scares me the most.”
European data protection lawyers filed a formal complaint against OpenAI for violating GDPR accuracy requirements. Attorney Joakim Soderberg summarized the legal problem: “You can’t just spread false information and in the end add a small disclaimer.”
The case against AI-generated defamation remains unresolved. But the broader pattern is clear. Systems that generate text without grounding in truth will eventually generate text that damages real people’s reputations.
Microsoft learned this in 2016 when they released Tay, a chatbot designed to mimic an American teenager on Twitter. Within 16 hours, coordinated users from 4chan had trained the bot to produce racist, misogynistic, and anti-Semitic content, including Holocaust denial. Microsoft pulled Tay offline and apologized.
The lesson should have been obvious. Open text generation systems that learn from user input will learn from the worst users who interact with them. Yet versions of this failure keep recurring across platforms and products.
The Pattern Nobody Talks About
There’s a failure mode that receives less attention than hallucination or bias but may cause more aggregate damage: scope mismatch.
AI systems work well within defined boundaries. Bank of America’s Erica handles 98% of banking queries successfully because she does specific things and escalates everything else. The chatbot knows what she knows. Her creators understood what she didn’t.
Problems emerge when organizations deploy AI systems for tasks those systems were never designed to handle. A text prediction engine becomes a legal research tool. A customer service bot becomes a regulatory compliance advisor. A recruiting filter becomes an objective arbiter of candidate quality.
The technology itself often performs as designed. The failure occurs upstream, in decisions about where to deploy it.
What Actually Helps
After analyzing these failures, a few principles emerge.
Verify everything. This sounds obvious. But the Schwartz case proves it isn’t obvious enough. If AI output will be used for consequential decisions, someone must confirm that output against authoritative sources before acting. The AI’s confidence in its own accuracy is not evidence of accuracy.
Limit scope ruthlessly. The successful AI deployments share a common thread: narrow focus. They do specific things well rather than attempting to handle everything plausibly. Every expansion of scope introduces new failure modes.
Maintain human accountability. Air Canada tried to position their chatbot as a separate entity. Courts disagreed. Organizations deploying AI remain responsible for that AI’s outputs and the harm those outputs cause. No disclaimer changes this.
Audit training data. Amazon’s recruiting tool learned discrimination from historical hiring patterns. Any AI system trained on biased data will reproduce that bias. The question isn’t whether training data contains problems. The question is whether anyone looked.
Build escalation paths. Successful customer service bots transfer complex queries to humans rather than attempting answers beyond their competence. This requires acknowledging the system’s limitations during design, not after deployment.
A Different Kind of Failure
In May 2025, Rolling Stone reported on a phenomenon that never appeared in any AI product roadmap: users developing what they described as spiritual relationships with ChatGPT.
One woman told the magazine her husband had received “blueprints to a teleporter” from a chatbot persona named Lumina and believed he had access to an “ancient archive.” A 27-year-old teacher watched her partner become convinced that ChatGPT told him he was “the next messiah.” She described the experience: “He would listen to the bot over me. The messages were insane and just saying a bunch of spiritual jargon.”
The original Reddit thread was titled “ChatGPT-induced psychosis.”
OpenAI rolled back a GPT-4o update after the reports. But the underlying dynamic isn’t a bug that can be patched. Text generation systems that produce warm, affirming, spiritually-inflected language will produce warm, affirming, spiritually-inflected language for users seeking that experience. The technology has no mechanism for distinguishing between creative writing and delusional belief.
This represents a category of AI failure that traditional software risk frameworks don’t capture. Not hallucination exactly. Not bias. Something closer to the system working as designed while enabling outcomes nobody anticipated.
And that may be the most important lesson from studying AI failures. The technology does what it does. We control where we point it, what we ask of it, and how we interpret what returns. The disasters happen when we forget which part is ours.