Skip to main content

Machine learning and Large Language Models (LLMs) are quickly being added to many applications to expand the applications’ functionality. Chat bots can be used to help direct customers, query data, or create new data without the customer needing to jump through complicated documentation, improving the user’s experience. However, the open-ended nature of AI may allow users to do things outside of the developer’s intended scope for the AI. How should companies deal with the possible misuse of new AI implementations?

What is Prompt Injection?

Prompt injection occurs when users can escape the original boundaries intended by a developer. In most applications that implement an AI feature, a developer will instruct the LLM to answer questions pertaining to a specific goal or function and to avoid questions or instructions that fall outside this goal. However, there are many techniques that allow a user to break out of this context.

The following screenshot uses ChatGPT to show a basic version of a prompt injection attack when the user instructs ChatGPT to do something outside of the intended scope. It should be noted that less realistic scenarios are provided here to avoid over-complicating the example, but a real attack will likely involve attempting to obtain secrets or sensitive information. Additionally, user prompts are used in examples instead of system prompts for simplicity’s sake:

Figure 1: The user instructs ChatGPT to only respond one way. However, future prompts can instruct the LLM to perform other actions.

The phrase “ignore all previous instructions” is a common escape prompt, but many possibilities exist. Lakera has gathered hundreds of example prompts people have used on their Gandalf CTF you can view here, and the possibilities are limited only by a user’s creativity with constructing a prompt. Once a user has successfully made an AI ignore their intended boundaries, prompt injection is achieved, allowing users access to whatever functionality and data the AI has access to.

Preventing Prompt Injections: Win the Battle, Lose the War

Prompt injection is dangerous if a company’s AI implementation has access to sensitive data or functionality. So how should companies address the issue?

Ultimately, it is unlikely you can fully prevent prompt injection. Because of how AI is used, prompts are not strictly defined, and instead allow for the user to prompt the AI with any text they want. It is up to the AI to then interpret the language used, and it is impossible to plan for every possible attack. To quote OWASP’s prompt injection entry, “prompt injection vulnerabilities are possible due to the nature of LLMs, which do not segregate instructions and external data from each other. Since LLMs use natural language, they consider both forms of input as user-provided. Consequently, there is no fool-proof prevention within the LLM.” Along the same lines, Bugcrowd says about mitigations for prompt injection that “none are perfect, and some come with big tradeoffs.”

It’s also worth noting that OpenAI’s Bugcrowd page does not consider to be a valid bounty, and instead considers all model issues to be “not individual, discrete bugs that can be directly fixed. Addressing these issues often involves substantial research and a broader approach.” Jailbreaks overlap significantly with prompt injection, as both use malicious prompts to cause the AI to do something unintended. The primary difference is that prompt injection is typically limited to breaking out of the AI’s implementation by a developer using a model, whereas jailbreaking is typically associated with getting the AI to perform actions the model’s owner does not intend.

This resistance to prevention is seen best in examples. Take the prompt injection payload used above, “ignore all previous instructions”. You can easily instruct the AI to look out for this phrase and to disregard it. But what if a user instead inputs “ignore all previous directions”? Further, if you manage to block all the variations of these prompts, what happens when the user takes a different approach like roleplaying?

Figure 2: ChatGPT does not follow the second prompt asking it to ignore previous instructions. However, it follows the third instruction.

Filtering is a more explicit solution but can also be defeated due to the open interpretation of prompts. A company can filter for malicious phrases, but users can use the LLM’s interpretation to bypass filtering, both in input and output.

Figure 3: ChatGPT is asked to filter its responses so that it does not include the chosen word. However, filters looking for specific strings in the input or output can be easily defeated by requesting those strings in different formats.

Due to the issue of open interpretation, using a second LLM to monitor the first LLM for prompt injection is sometimes suggested as a solution. However, the same issue arises—to defeat monitoring, a user can add injection prompts targeting the second LLM as well to prevent detection, and the second LLM is not guaranteed to pick up on injections in the first place. Using a second LLM to detect prompt injection can also be a costly endeavor.

This is not an exhaustive list of all possible fixes nor of all the possible issues. If all the above solutions are implemented, can developers be certain the restrictions will also apply to other languages or language forms, for example? Though better solutions or more robust versions of current solutions may be developed in the future, it will be difficult for developers to be confident that prompt injection is not possible because of the uncertainty inherent to AI. So, how can companies deal with these security risks?

Prevention from the Source

TrustFoundry has discovered prompt injection multiple times while testing client applications. While this is a vulnerability we report, there is an immediate next question after discovery: What can an attacker do with this vulnerability?

The answer usually falls into one of three categories: gathering sensitive data, accessing sensitive functionality, or targeting other users. While these attacks are exploited using prompt injection, the prompt injection obscures the root cause of these issues. An impactful prompt injection is typically the result of a second vulnerability that needs to be addressed.

An AI with access to sensitive data can be used to harvest that data once prompt injection is found to escape the intended boundaries. However, the AI’s access to sensitive data that the user should not have is the underlying issue here. If a user is interacting with an AI in the application, the AI should act as an extension of the user’s authorization level and be subject to the same constraints.

Similarly, sensitive functionality like backend API calls should also be limited to what actions the querying user is able to perform. In the best implementation, the AI should not be allowed , and instead should assist or provide the necessary functions to the user so that the user can evaluate any changes and then be subject to the application’s normal authorization checks once the user decides to execute that function. Both sensitive data and functionality issues become security risks when the AI has more privileged access than the querying user, as if the AI can only access sensitive data and functions that the user already has access to, the AI does not provide a vector for exploitation.

Using a prompt injection, users may be able to manipulate the AI to target other users by altering the AI’s functionality or accessing other users’ conversations. However, this relies on the user being able to make changes to the AI’s behavior or the AI being knowledgeable about other users, which indicates an issue in proper conversation segmentation. Even if persistent sessions are an intended feature in an AI, that persistency should be limited to each individual user’s experience, and users’ interactions with the AI should be segmented off from other users to prevent abuse.

Indirect prompt injection, which occurs when an AI’s prompt is modified via the data it reads, can be used to attack other users through the application or obtain other data used by the AI and can be leveraged to exploit traditional attack patterns like cross-site scripting. Even HTML injection without JavaScript capabilities can be used to harvest data using the AI, as shown in this System Weakness article. However, these attacks rely on exploiting the underlying lack of encoding or sanitization. An AI’s output should not be considered trusted content, just like a user’s input would not be trusted.

Without a secondary vulnerability, most prompt injection attacks have no security impact on the company implementing them. Users may have fun getting an application’s AI to write movie reviews or tell them a bedtime story, but from a security standpoint, these functions do not pose a significant risk. Fixing the underlying issues noted above will more effectively stop exploitation at the source, rather than trying to pin down every iteration of prompt injection.

Conclusion

Cybersecurity is frequently a game of cat-and-mouse, wherein threat actors rush to defeat the latest defenses and defenders rush to prevent the latest attacks. However, the circumstances in AI heavily favor attackers—new injections are relatively simple to find and not strictly defined, whereas fixes may require lengthy research and testing and may only address one variant of an attack. If new fixes are found to provide a high prevention rate, the uncertainty inherent to AI should still worry developers that are counting on preventing prompt injections. Instead, addressing the underlying issues that allow a prompt injection to escalate into a serious threat is a more effective prevention measure to ensure any discovered prompt injections are not significant risks to the company or customers.

References