Lightnews — Scholar-powered news

Donato Crisostomi ✈️ NeurIPS @crisostomi.bsky.social · Mar 11

Will present this at #CVPR ✈️ See you in Nashville 🇺🇸!

Kudos to the team 👏
Antonio A. Gargiulo, @mariasofiab.bsky.social, @sscardapane.bsky.social, Fabrizio Silvestri, Emanuele Rodolà.

Donato Crisostomi ✈️ NeurIPS @crisostomi.bsky.social · Jan 8

📢Prepend “Singular” to “Task Vectors” and get +15% average accuracy for free!

1. Perform a low-rank approximation of layer-wise task vectors.

2. Minimize task interference by orthogonalizing inter-task singular vectors.

🧵(1/6)

2 5

Donato Crisostomi ✈️ NeurIPS @crisostomi.bsky.social · Jan 8

Don’t miss out on these insights and more — check out the paper!

📄 Preprint → arxiv.org/abs/2412.00081

💻 Code → github.com/AntoAndGar/t...

Joint work w/ Antonio A. Gargiulo, @mariasofiab.bsky.social, @sscardapane.bsky.social, Fabrizio Silvestri, Emanuele Rodolà.

(6/6)

Task Singular Vectors: Reducing Task Interference in Model Merging

Task Arithmetic has emerged as a simple yet effective method to merge models without additional training. However, by treating entire networks as flat parameter vectors, it overlooks key structural in...

arxiv.org

2 2

Donato Crisostomi ✈️ NeurIPS @crisostomi.bsky.social · Jan 8

By orthogonalizing the (low-rank approximated) task singular vectors, we effectively eliminate interference.

This leads to an impressive +15% gain without any test-time adaptation!

The improvement is consistent across all datasets.

🧵(5/6)

1 1

Donato Crisostomi ✈️ NeurIPS @crisostomi.bsky.social · Jan 8

We believe this happens due to inter-task interference, which we measure as the interplay between singular vectors from different tasks (A).

(B) This interference is higher in shallower (more general) layers and lower in deeper (more task-specific) layers!

🧵(4/6)

1

Donato Crisostomi ✈️ NeurIPS @crisostomi.bsky.social · Jan 8

But is low-rank approximation enough to achieve effective multi-task merging?

The answer is NO! In fact, it can even be detrimental.

Why is that?

🧵(3/6)

1

Donato Crisostomi ✈️ NeurIPS @crisostomi.bsky.social · Jan 8

Let’s start with the low-rank structure:

By keeping only a small fraction (e.g., 10%) of task singular vectors for each model, average accuracy is preserved!

🧵(2/6)

1

Donato Crisostomi ✈️ NeurIPS @crisostomi.bsky.social · Jan 8

📢Prepend “Singular” to “Task Vectors” and get +15% average accuracy for free!

1. Perform a low-rank approximation of layer-wise task vectors.

2. Minimize task interference by orthogonalizing inter-task singular vectors.

🧵(1/6)

1 3 4

Donato Crisostomi ✈️ NeurIPS @crisostomi.bsky.social · Dec 9

📣 Come check it this Friday at #NeurIPS!

Donato Crisostomi ✈️ NeurIPS @crisostomi.bsky.social · Dec 5

I know you're probably thinking, "Yeah, these neuron-permutation-based model merging methods are cool.. but are they cycle-consistent (CC)?"

Say no more!
It just so happens that our new #NeurIPS24 paper covers exactly this!

Huh? No idea what I am talking about? Read on
(1/6)

2

Donato Crisostomi ✈️ NeurIPS @crisostomi.bsky.social · Dec 5

🤝Joint work with Marco Fumero, Daniele Baieri, Florian Bernard, Emanuele Rodolà

Preprint -> arxiv.org/abs/2405.17897
Code -> github.com/crisostomi/c...

BTW if you made it to this last tweet you are probably also interested in our workshop --> unireps.org

Thanks! (6/6)

$C^2M^3$: Cycle-Consistent Multi-Model Merging

In this paper, we present a novel data-free method for merging neural networks in weight space. Differently from most existing works, our method optimizes for the permutations of network neurons globa...

arxiv.org

Donato Crisostomi ✈️ NeurIPS @crisostomi.bsky.social · Dec 5

Yes, and it works best when applying Repair (Jordan et al, 2022) to fix the activation statistics!

The approach is designed to merge 3+ models (as CC doesn't make much sense otherwise), but if you are curious about applying Frank-Wolfe for n=2, please check the paper!

(5/6)

1

Donato Crisostomi ✈️ NeurIPS @crisostomi.bsky.social · Dec 5

Yeah but why bother?

1) Optimizing globally, we don't have any more variance from the random layer order

2) The models in the universe are much more linearly connected than before

3) The models in the universe are much more similar

Does this result in better merging?
(4/6)

1

Donato Crisostomi ✈️ NeurIPS @crisostomi.bsky.social · Dec 5

Ok but how?

1) Start from the weight-matching equation introduced by Git Re-Basin

2) Consider perms between all possible pairs

3) Replace each permutation A->B with one mapping to the universe A -> U and one mapping back U -> B

4) Optimize with Frank-Wolfe

(3/6)

1

Donato Crisostomi ✈️ NeurIPS @crisostomi.bsky.social · Dec 5

If you have 3+ models A, B, and C, permitting neurons from A to B to C and back to A gets you to a different model, as the composition of the perms is not the identity!

What then? Introduce a Universe space 🌌and use it as a midpoint 🔀

This way, CC is guaranteed!

(2/6)

1

Donato Crisostomi ✈️ NeurIPS @crisostomi.bsky.social · Dec 5

I know you're probably thinking, "Yeah, these neuron-permutation-based model merging methods are cool.. but are they cycle-consistent (CC)?"

Say no more!
It just so happens that our new #NeurIPS24 paper covers exactly this!

Huh? No idea what I am talking about? Read on
(1/6)

1 1 2

Donato Crisostomi ✈️ NeurIPS @crisostomi.bsky.social · Dec 5

First blue post (still have to figure out how tweets are called here)

💡idea: we consider task vectors at the layer level and reduce task interference by decorrelating the task-specific singular vectors of any matrix-structured layer

🔬results: large-margin improvements across all vision benchmarks

Simone Scardapane @sscardapane.bsky.social · Dec 4

*Task Singular Vectors: Reducing Task Interference in Model Merging*

We show that task vectors are inherently low-rank, and we propose a merging method that significantly improves SOTA.

arxiv.org/abs/2412.00081

2 7