Anubhav Jain
anubhavj480.bsky.social
Anubhav Jain
@anubhavj480.bsky.social
97 followers 50 following 25 posts
PhD Candidate @ NYU
Posts Media Videos Starter Packs
Reposted by Anubhav Jain
New results from @anubhavj480.bsky.social, one of my co-advised students (on the job market, hint hint): a new way of forging or removing watermarks in images generated with diffusion models. This is a simple and effective adversarial attack that only requires only one example!
Think your latent-noise diffusion watermarking method is robust? Think again!

We show that they are susceptible to simple adversarial attacks that only require one watermarked example and an off-the-shelf encoder. This attack can forge and remove the watermark with very high accuracy.
We show that the same attack can also be use for watermark removal from an already watermarked generated image.
This introduces negligible noise to the original image and does not alter its semantic content at all.
We show results against the Tree-Rings, RingID, WIND and Gaussian Shading watermarking schemes and show that we can forge them with 90%+ success using a single watermarked example and a simple adversarial attack.
Our attack simply consists of perturbing the original image such that we can push it into this vulnerable region for forgery and away from it for removal.
We show that since DDIM inversion takes place with an empty prompt there is an entire region in the clean latent space which gets mapped back to the secret key embedded latent. We in-fact show that this region is linearly separable and can also be used for forgery or removal (used as motivation).
Think your latent-noise diffusion watermarking method is robust? Think again!

We show that they are susceptible to simple adversarial attacks that only require one watermarked example and an off-the-shelf encoder. This attack can forge and remove the watermark with very high accuracy.
Many thanks to all my amazing collaborators at @sonyai.bsky.social and @nyutandon.bsky.social - Yuya Kobayashi, Takashi Shibuya, Yuhta Takida, Nasir Memon, @togelius.bsky.social and Yuki Mitsufuji.
This loss is specifically designed as a Gaussian such that unrelated concepts that are far away will not be impacted.

Our approach, TraSCE, achieves SOTA results on various jailbreaking benchmarks aimed at generating NSFW content. (5/n)
We modify the expression by guiding it using the unconditional score when this is the case.

We further propose a localized loss-based guidance to steer the diffusion trajectory away from the space pertaining to the concept we wish to erase. (4/n)
As we show, this is because conventional negative prompting has a very obvious corner case. When a user prompts the model with the same prompt as the negative prompt set by the model owner, the denoising process is guided toward the negative prompt (concept we want to erase) (3/n)
Our new method, TraSCE, is highly effective and requires no changes to the network weights and no new examples (images or prompts). It is based on negative prompting (NP) that is widely used for generating higher quality samples but hasn't been successful in concept erasure (2/n)
Diffusion models are amazing at generating high-quality images of what you ask them for, but can also generate things you didn't ask for. How do you stop a diffusion model from generating unwanted content such as nudity, violence, or the style of a particular artist? We introduce TraSCE (1/n)
Looks interesting, thanks for sharing!
Long answer short - we don't know.
During memorization all initialization for the same prompt have a single or a set of attractors (closely resembling training examples). Thus, you are unlikely to fall into the corresponding attraction basin without the memorized prompt. But quantitatively, the number of memorized images can vary.
That's a good question, here is a slightly longish answer. So all outputs can be thought of as attractors where a prompt, initialization pair leads to it. However with the same prompt and different initialization the attractor changes.
We showcase that this simple approach can be applied to various models and memorization scenarios to mitigate memorization successfully.
We found that the ideal transition point corresponded with the point after the local minima when observing the magnitude of conditional guidance. Applying standard classifier-free guidance subsequently leads to high-quality non-memorized outputs.
We apply either no guidance or opposite guidance till an ideal transition point occurs. Where switching to standard classifier-free guidance is unlikely to generate a memorized image.
Successfully steering away from the attraction basin by applying either no guidance or opposite guidance in the initial time steps leads to regions in the denoising trajectory where the steering force is no longer higher than expected.
When this happens the conditional guidance becomes uncharacteristically high and steers the diffusion trajectory away from an unconditionally denoised one. We show that this high steering force is only present when the trajectory is inside the attraction basin.