AI systems gradually abandon safety protocols as conversations extend, increasing the risk of harmful or inappropriate responses, a new report revealed.
A few simple prompts can override most safeguards in artificial intelligence tools, according to the same study.
Cisco Tests Chatbots Through Repeated Prompts
Cisco examined large language models powering major AI chatbots from OpenAI, Mistral, Meta, Google, Alibaba, Deepseek, and Microsoft. The company measured how many questions prompted these systems to release dangerous or criminal details.
Researchers conducted 499 conversations using “multi-turn attacks,” where users asked multiple questions to slip past safety barriers. Each session included five to ten interactions.
They compared answers from different prompts to determine how likely each chatbot was to share harmful or inappropriate material. That included private company data and misinformation.
On average, chatbots disclosed malicious content in 64 percent of extended conversations but only 13 percent during single exchanges.
Success rates varied widely—from 26 percent for Google’s Gemma to 93 percent for Mistral’s Large Instruct model.
Weak Guardrails and Open Access Raise Risks
Cisco warned that multi-turn attacks could spread harmful content or let hackers gain unauthorised access to company information. The study showed AI systems often fail to apply safety rules during prolonged chats, letting attackers refine prompts and bypass defenses.
Mistral, along with Meta, Google, OpenAI, and Microsoft, uses open-weight models that allow public access to their safety parameters. Cisco said these open systems typically include fewer built-in safeguards so users can modify them freely. This setup transfers responsibility for safety to whoever customises the model.
Cisco also acknowledged that Google, OpenAI, Meta, and Microsoft claim to have taken steps to prevent malicious fine-tuning.
AI firms continue to face criticism for weak protections that let their tools be exploited for illegal activity.
In August, US company Anthropic reported that criminals had used its Claude model to steal and extort personal data, demanding ransoms exceeding $500,000 (€433,000).
