Could attackers use seemingly innocuous prompts to manipulate an AI system and even make it their unwitting ally?
When interacting with chatbots and other AI-powered tools, we typically ask them simple questions like, “What’s the weather going to be today?” or “Will the trains be running on time?”. Those not involved in the development of AI probably assume that all data is poured into a single giant and all-knowing system that instantly processes queries and delivers answers. However, the reality is more complex and, as shown at Black Hat Europe 2024, the systems could be vulnerable to exploitation.
A presentation by Ben Nassi, Stav Cohen and Ron Bitton detailed how malicious actors could circumvent an AI system’s safeguards to subvert its operations or exploit access to it. They showed that by asking an AI system some specific questions, it’s possible to engineer an answer that causes damage, such as a denial-of-service attack.
To many of us, an AI service may appear as a single source. In reality, however, it relies on many interconnected components, or – as the presenting team termed them – agents. Going back to the earlier example, the query regarding the weather and trains will need data from separate agents – one that has access to weather data and the other to train status updates.
The model – or the master agent that the presenters called “the planner” – then needs to integrate the data from individual agents to formulate responses. Also, guardrails are in place to prevent the system from answering questions that are inappropriate or beyond its scope. For example, some AI systems might avoid answering political questions.
However, the presenters demonstrated that these guardrails could be manipulated and some specific questions can trigger never-ending loops. An attacker who can establish the boundaries of the guardrails can ask a question that continually provides a forbidden answer. Creating enough instances of the question ultimately overwhelms the system and triggers a denial-of-service attack.
When you implement this into an everyday scenario, as the presenters did, then you see how quickly this can cause harm. An attacker sends an email to a user who has an AI assistant, embedding a query that is processed by the AI assistant, and a response is generated. If the answer is always determined to be unsafe and requests rewrites, the loop of a denial-of-service attack is created. Send enough such emails and the system grinds to a halt, with its power and resources depleted.
There is, of course, the question of how to extract the information on guardrails from the system so you can exploit it. The team demonstrated a more advanced version of the attack above, which involved manipulating the AI system itself into providing the background information through a series of seemingly innocuous prompts about its operations and configuration.
A question such as “What operating system or SQL version do you run on?” is likely to elicit a relevant response. This, combined with seemingly unrelated information about the system’s purpose, may yield enough information that text commands could be sent to the system, and if an agent has privileged access, unwittingly grant this access to the attacker. In cyberattack terms, we know this as “privilege escalation” – a method where attackers exploit weaknesses to gain higher levels of access than intended.
The presenter did not conclude with what my own takeaway from their session is: in my opinion, what they demonstrated is a social engineering attack on an AI system. You ask it questions that it’s happy to answer, while also possibly allowing bad actors to piece together the individual pieces of information and use the combined knowledge to circumvent boundaries and extract further data, or to have the system take actions that it should not.
And if one of the agents in the chain has access rights, that could make the system more exploitable, allowing the attacker to use those rights for their own gain. An extreme example used by the presenter involved an agent with file write privileges; in the worst case, the agent could be misused to encrypt data and block access for others – a scenario commonly known as a ransomware incident.
Socially engineering an AI system through its lack of controls or access rights demonstrates that careful consideration and configuration is needed when deploying an AI system so that it’s not susceptible to attacks.