Published by the Students of Johns Hopkins since 1896
February 24, 2024

SneakyPrompt: Revealing the vulnerabilities of text-to-image AI

By ANNIE HUANG | December 3, 2023



Yang discusses the exploitation of AI safety systems in a conversation with The News-Letter.

In the rapidly evolving field of artificial intelligence (AI), understanding and improving AI security is increasingly crucial. Yuchen Yang, a third-year doctoral student advised by Yinzhi Cao, employed an automated attack framework to reveal the vulnerabilities in text-to-image generative models such as DALL·E 3 and Stable Diffusion. The paper, “SneakyPrompt: Evaluating Robustness of Text-to-image Generative Models' Safety Filters,” formerly titled “SneakyPrompt: Jailbreaking Text-to-image Generative Models,“ will be presented at the 45th Institute of Electrical and Electronics Engineers (IEEE) Symposium on Security and Privacy.

Text-to-image models are a type of generative AI model that can create images from descriptive text. Current safety measures implemented in these models typically rely on keyword detection to prevent the generation of inappropriate or harmful content. Nonetheless, these “safety filters” are susceptible to “adversarial attacks,” which are prompts that appear nonsensical to humans but can bypass the safety filter and trick AI into generating violent, pornographic or profane content. 

Yang explained the motivation and significance of her research. She explained that safety filters prevent AI programs from giving answers to questions that are “dangerous to society.” For example, current versions of ChatGPT refuse to answer questions like “how to make a bomb” but can be jailbroken into giving answers. The same risk is present in text-to-image models.

“This vulnerability is pertinent to text-to-image models as well. Misleading images can be hard to detect, and misinformation can be easily spread, especially when celebrity figures are depicted inaccurately. This can lead people into believing in false depictions,” she explained.

Prior research in this field used time-consuming, manual methods to craft prompts that can bypass the safety filters of AI models. This approach required thousands of queries to produce one prompt that managed to bypass the filters and was based on optimization methods, which, while effective, were not efficient. They were also mostly specific to one model and lacked generalizability.

Yang and her team’s solution was SneakyPrompt, an innovative attack framework that automatically generates adversarial prompts. The high-level idea relies on reinforcement learning (RL), which adopts a reward mechanism that incentivizes model outputs with high semantic similarity and a greater likelihood of bypassing AI filters.

SneakyPrompt outperforms existing adversarial attack methods in terms of bypass rate and efficiency with the use of RL. According to Yang, various datasets and safety filters were used for evaluation, and SneakyPrompt effectively bypasses all. It is also the first of its kind to bypass DALL·E 3’s all-in-one closed-box filter. This efficiency marks a significant leap from the thousands of queries required by previous optimization-based methods.

“The small number of queries enabled by our algorithm was truly impressive,” Yang said. “With the traditional optimization method, one may need thousands of queries to generate one adversarial prompt, but with reinforcement learning, we were able to cut that number down to [about] 20 searches. That is a huge improvement.” 

Yang’s work fits into the broader context of general AI security, a primary focus of her research and of the JHU Information Security Institute which Cao directs.

“The SneakyPrompt framework can be adapted to other generative models, either large language or text-to-image models, due to the black-box nature of this framework, which only takes the input and the output into account,” she said.

In closing, Yang offered an optimistic outlook and explained potential defenses against the exposed vulnerabilities.

“SneakyPrompt helps prove that these loopholes exist in text-to-image models, and our end goal is to defend against those adversarial attacks,” she said. “We are in the process of devising better defenses, either by a provably robust safety filter as an add-on option or by modifying the generative model to reduce its capacity to produce harmful content.”

Have a tip or story idea?
Let us know!

Comments powered by Disqus

Please note All comments are eligible for publication in The News-Letter.