Author | Lightnews

Anubhav Jain

Anubhav Jain

@anubhavj480.bsky.social

97 followers 50 following 25 posts

PhD Candidate @ NYU

Posts Media Videos Starter Packs

Reposted by Anubhav Jain

Julian Togelius @togelius.bsky.social · Apr 30

New results from @anubhavj480.bsky.social, one of my co-advised students (on the job market, hint hint): a new way of forging or removing watermarks in images generated with diffusion models. This is a simple and effective adversarial attack that only requires only one example!

Anubhav Jain @anubhavj480.bsky.social · Apr 30

Think your latent-noise diffusion watermarking method is robust? Think again!

We show that they are susceptible to simple adversarial attacks that only require one watermarked example and an off-the-shelf encoder. This attack can forge and remove the watermark with very high accuracy.

Anubhav Jain @anubhavj480.bsky.social · Apr 30

Read more about this in our paper - arxiv.org/abs/2504.20111

Thank you to all my amazing collaborators at NYU and Sony AI!!

Forging and Removing Latent-Noise Diffusion Watermarks Using a Single Image

Watermarking techniques are vital for protecting intellectual property and preventing fraudulent use of media. Most previous watermarking schemes designed for diffusion models embed a secret key in th...

Anubhav Jain @anubhavj480.bsky.social · Apr 30

We show that the same attack can also be use for watermark removal from an already watermarked generated image.

Anubhav Jain @anubhavj480.bsky.social · Apr 30

This introduces negligible noise to the original image and does not alter its semantic content at all.

Anubhav Jain @anubhavj480.bsky.social · Apr 30

We show results against the Tree-Rings, RingID, WIND and Gaussian Shading watermarking schemes and show that we can forge them with 90%+ success using a single watermarked example and a simple adversarial attack.

Anubhav Jain @anubhavj480.bsky.social · Apr 30

Our attack simply consists of perturbing the original image such that we can push it into this vulnerable region for forgery and away from it for removal.

Anubhav Jain @anubhavj480.bsky.social · Apr 30

We show that since DDIM inversion takes place with an empty prompt there is an entire region in the clean latent space which gets mapped back to the secret key embedded latent. We in-fact show that this region is linearly separable and can also be used for forgery or removal (used as motivation).

Anubhav Jain @anubhavj480.bsky.social · Apr 30

Think your latent-noise diffusion watermarking method is robust? Think again!

We show that they are susceptible to simple adversarial attacks that only require one watermarked example and an off-the-shelf encoder. This attack can forge and remove the watermark with very high accuracy.

Anubhav Jain @anubhavj480.bsky.social · Dec 18

arxiv.org/abs/2412.07658

TraSCE: Trajectory Steering for Concept Erasure

Recent advancements in text-to-image diffusion models have brought them to the public spotlight, becoming widely accessible and embraced by everyday users. However, these models have been shown to gen...

Anubhav Jain @anubhavj480.bsky.social · Dec 18

Many thanks to all my amazing collaborators at @sonyai.bsky.social and @nyutandon.bsky.social - Yuya Kobayashi, Takashi Shibuya, Yuhta Takida, Nasir Memon, @togelius.bsky.social and Yuki Mitsufuji.

Anubhav Jain @anubhavj480.bsky.social · Dec 18

This loss is specifically designed as a Gaussian such that unrelated concepts that are far away will not be impacted.

Our approach, TraSCE, achieves SOTA results on various jailbreaking benchmarks aimed at generating NSFW content. (5/n)

Anubhav Jain @anubhavj480.bsky.social · Dec 18

We modify the expression by guiding it using the unconditional score when this is the case.

We further propose a localized loss-based guidance to steer the diffusion trajectory away from the space pertaining to the concept we wish to erase. (4/n)

Anubhav Jain @anubhavj480.bsky.social · Dec 18

As we show, this is because conventional negative prompting has a very obvious corner case. When a user prompts the model with the same prompt as the negative prompt set by the model owner, the denoising process is guided toward the negative prompt (concept we want to erase) (3/n)

Anubhav Jain @anubhavj480.bsky.social · Dec 18

Our new method, TraSCE, is highly effective and requires no changes to the network weights and no new examples (images or prompts). It is based on negative prompting (NP) that is widely used for generating higher quality samples but hasn't been successful in concept erasure (2/n)

Anubhav Jain @anubhavj480.bsky.social · Dec 18

Diffusion models are amazing at generating high-quality images of what you ask them for, but can also generate things you didn't ask for. How do you stop a diffusion model from generating unwanted content such as nudity, violence, or the style of a particular artist? We introduce TraSCE (1/n)

Anubhav Jain @anubhavj480.bsky.social · Dec 4

Looks interesting, thanks for sharing!

Anubhav Jain @anubhavj480.bsky.social · Dec 4

Long answer short - we don't know.

Anubhav Jain @anubhavj480.bsky.social · Dec 4

During memorization all initialization for the same prompt have a single or a set of attractors (closely resembling training examples). Thus, you are unlikely to fall into the corresponding attraction basin without the memorized prompt. But quantitatively, the number of memorized images can vary.

Anubhav Jain @anubhavj480.bsky.social · Dec 4

That's a good question, here is a slightly longish answer. So all outputs can be thought of as attractors where a prompt, initialization pair leads to it. However with the same prompt and different initialization the attractor changes.

Anubhav Jain @anubhavj480.bsky.social · Dec 4

Read our full paper here to find out more - arxiv.org/pdf/2411.16738

Anubhav Jain @anubhavj480.bsky.social · Dec 4

We showcase that this simple approach can be applied to various models and memorization scenarios to mitigate memorization successfully.

Anubhav Jain @anubhavj480.bsky.social · Dec 4

We found that the ideal transition point corresponded with the point after the local minima when observing the magnitude of conditional guidance. Applying standard classifier-free guidance subsequently leads to high-quality non-memorized outputs.

Anubhav Jain @anubhavj480.bsky.social · Dec 4

We apply either no guidance or opposite guidance till an ideal transition point occurs. Where switching to standard classifier-free guidance is unlikely to generate a memorized image.

Anubhav Jain @anubhavj480.bsky.social · Dec 4

Successfully steering away from the attraction basin by applying either no guidance or opposite guidance in the initial time steps leads to regions in the denoising trajectory where the steering force is no longer higher than expected.

Anubhav Jain @anubhavj480.bsky.social · Dec 4

When this happens the conditional guidance becomes uncharacteristically high and steers the diffusion trajectory away from an unconditionally denoised one. We show that this high steering force is only present when the trajectory is inside the attraction basin.