Home / News / A two-stage approach for making AI image generators safer | CVPR

A two-stage approach for making AI image generators safer | CVPR

Wednesday, June 18, 2025

Text-to-image diffusion models can generate new and unique images based on prompts from users. But sometimes people who use these systems don’t have the best intentions in mind and use them to create inappropriate images.

While researchers have developed several methods to improve the safety of text-to-image diffusion models, Karthik Nandakumar, Affiliated Associate Professor of Computer Vision at MBZUAI, says that these systems aren’t as safe as they should be.

Nandakumar and colleagues from MBZUAI, Johns Hopkins University, and other institutions have developed a new framework called STEREO to improve the safety of these models. The team’s research was recently presented at the Computer Vision and Pattern Recognition Conference (CVPR) held in Nashville, Tennessee.

Koushik Srivatsan, Fahad Shamshad, Muzammal Nasser, and Vishal M. Patel are coauthors of the study.

Expanding on current approaches

One method that had been used to improve the safety of text-to-image diffusion models is known as concept erasure. Through this approach, researchers first use adversarial training to identify weaknesses in a model, tricking it into generating inappropriate images. Once these weaknesses are found, they adjust the parameters of the model, breaking the links between specific concepts in the model’s text and image embedding spaces. This limits the kinds of images that a model can create.

While concept erasure helps prevent the creation of inappropriate images, it can also reduce a model’s performance on harmless queries. And as more links between the text and image embedding spaces are severed, performance is weakened further.

What’s more, even when done well, concept erasure doesn’t address all of a model’s vulnerabilities. Attackers can take advantage of what are known as ‘blind spots’ in the text embedding space. These aren’t real words in any language but are areas in the embedding space that map to inappropriate images.

“Developers claim that inappropriate concepts have been erased from models,” Nandakumar says. “But we found that even if some links are cut between text and image representations, there are many other embeddings in the text space that allow the model to generate inappropriate images.”

To address this, Nandakumar and his team developed STEREO to improve the safety of text-to-image diffusion models without affecting the performance of the model on normal queries.

STEREO works in two stages. The first stage, STE (Search Thoroughly Enough), is based on adversarial training. Researchers prompt a model with the goal of getting it to produce concepts that were thought to be erased. This process happens iteratively and identifies additional target concepts that need to be deleted.

In the second stage, REO (Robustly Erase Once), the target concepts are erased in a large batch. This differs from traditional adversarial training where target concepts are erased iteratively. REO also uses what the researchers call ‘anchor concepts’ to provide positive guidance to the model, telling it what concepts need to be retained.

For example, say that the concept of ‘parachute’ needs to be erased from a model. Simply cutting ‘parachute’ could affect the performance of the model on related concepts, like ‘sky’. To prevent this, the researchers identify ‘parachute in the sky’ as a positive concept and ‘parachute’ as the negative concept. Doing so updates the model to remove ‘parachute’ while preserving ‘sky’.

“We needed to figure out how we could erase the bad concepts while maintaining the benign concepts and that’s how we came up with this two-stage approach,” Nandakumar says.

The researchers used OpenAI’s GPT-4 to generate a variety of anchor concepts that related to concepts they wanted to erase.

An improvement on other methods

The researchers compared the performance of STEREO to other concept erasing methods on a text-to-image diffusion model called Stable Diffusion v1.4, developed by Stability AI. They found that STEREO performed better than several traditional methods on a task known as artistic style removal and matched the high performance of a recently developed method called AdvUnlearn.

Next, they compared the robustness of STEREO to other methods on what are known as text-based and inversion-based attacks, which target blind spots in models. Compared to other methods, STEREO improved average robustness by 88.89%, which the researchers note is a “significant advancement in robust concept erasing”.

The researchers also measured how well STEREO maintained the utility of Stable Diffusion after erasing. Utility was measured by two metrics: CLIP score measures how well a generated image aligns with a given text prompt and FID score measures the difference between real and generated images.

STEREO led to reductions in CLIP and FID scores, but these declines were small, 1.99 and 0.81 respectively. The researchers attribute the preservation of utility to the use of anchor concepts. “The anchor concept is the most critical thing if you want to maintain the utility of the model,” Nandakumar says.

Nandakumar noted, however, that methods for measuring utility of text-to-image diffusion models, and FID score in particular, need to be improved because they don’t capture subtle changes in images. For example, if a model is directed to forget the concept of a church and is then asked to generate an image of a temple, it will do so but the temple it generates will be different following the deletion of the concept of the church. These variations aren’t captured by today’s evaluation metrics.

Further advancing the safety of diffusion models

While STEREO’s results are promising, Nandakumar acknowledges that improving the safety of models for use in the real world is extremely difficult. Erasing the concept of a parachute or a church is one thing, but developers need to make sure the models they build don’t generate images that relate to broad and complex concepts like violence and hate speech. To address this, Nandakumar and his team are working to evolve STEREO so that it can erase multiple inappropriate concepts more efficiently.

Even so, making text-to-image diffusion models safe is a never-ending task. People will develop innovative ways to get around model safeguards. There are many publicly available systems that aren’t as safe as they should be. And developers are constantly building and releasing new systems.

Even with the best intentions, developers may miss vulnerabilities when evaluating the safety of their systems. “We can’t simply go by the safety claims of developers,” Nandakumar says. “We always need to do proper testing and security analysis before believing their claims.”

Monday, August 25, 2025

A two-stage approach for making AI image generators safer | CVPR

Expanding on current approaches

An improvement on other methods

Further advancing the safety of diffusion models

Related

A new stress test for AI agents that plan, look and click

Detecting deepfakes in the presence of code-switching

When medical AI meets messy reality

A two-stage approach for making AI image generators safer | CVPR

Expanding on current approaches

An improvement on other methods

Further advancing the safety of diffusion models

Related

A new stress test for AI agents that plan, look and click

Detecting deepfakes in the presence of code-switching

When medical AI meets messy reality

Subscribe to The Node

When medical AI meets messy reality