Lightnews — Scholar-powered news

Sam Blouir

@samblouir.bsky.social

General benchmark scores remain intact across 21 tasks on the EleutherAI LM Eval harness, and greatly improve on our new infilling task.

💡 With smarter training, we maintain SSMs’ efficiencies while dramatically enhancing their capabilities.

Table of the Story Infilling Task, where the model is given a causal story with 3-7 entries each. One entry is masked out and the model is then asked to choose the most likely option.

Hawk with Birdie gets 42.5% accuracy,
Hawk with a causal version of Birdie gets 41.5% accuracy.
Hawk with Next Token Prediction gets only 33.1%.
That is an enormous performance boost for Hawk trained with Birdie - 42.5% vs 33.1% accuracy.

A Transformer trained with Birdie gets 42.2% accuracy, and with Next Token Prediction, gets 41.9% accuracy. The performance difference here is more muted for the Transformer on this task, in contrast to the generative SQuAD V2 results, which saw the Transformer with Birdie pull ahead strongly.

November 18, 2024 at 5:28 PM

Sam Blouir

@samblouir.bsky.social

🌟 Stellar Results:

• Multi-Phone Number Retrieval: Birdie SSMs achieve 100% accuracy on single lookups; outperform standard SSMs even more as tasks become more complex.

• SQuAD V2: We match a Transformer's performance curve across sequence lengths, while standard SSMs fall behind.

Graph of the SQUAD V2 question-answering task. The X-axis shows the context length, showing the length of the tokenized Wikipedia articles used as context, and the Y-axis shows "Response Contains Labels", or the percentage of generated model responses that contained an acceptable answer to a question.

The SQUAD V2 question-answering task entails the model reading a Wikipedia article, then being immediately asked a question about what it just read. The information is always found in the article.

Training Hawk using BIrdie strongly outperforms using Next Token Prediction. Training with Next Token Prediction results in performance strongly declining when the Wikipedia article length increases to about 500 tokens.
In this 500 token scenario, Hawk trained using Next Token Prediction retrieves the exact label less than 10% of the time, while the Birdie procedure results in over 55% accuracy.

When the article is only 100 tokens long, Birdie retrieves the correct answer more than 40% of the time, while the Next Token Prediction model does this less than 30% of the time.
With Birdie, Hawk matches the "context length vs performance" curves of the Transformer trained with Next Token Prediction, but has slightly worse performance.

The Transformer trained with Birdie outperforms all models, with an average of about 75% accuracy, compared to the Next Token Prediction Transformer at 60%.
Hawk trained with Birdie gets around 50%.
Hawk trained with Next Token Prediction gets around 15%.

November 18, 2024 at 5:28 PM

Sam Blouir

@samblouir.bsky.social

Meet Birdie 🐤!

Our EMNLP 2024 paper boosts SSMs like Mamba and Hawk on long-range, context-heavy tasks, closing the gap with Transformers.

Proud to work with @jimmysmith1919.bsky.social, @antonisa.bsky.social, & Amarda Shehu.

📄 Paper: arxiv.org/abs/2411.01030
💻 Code: github.com/samblouir/bi...

The multi-number phonebook retrieval task entails retrieving several phone numbers from a phonebook at once, given names.

Hawk trained using Birdie strongly outperforms Hawk trained using Next Token Prediction on the multi-number phonebook retrieval task.

Hawk trained using Next Token Prediction performs just above random guessing when retrieving 1 and 4 phone numbers, and falls to random performance when retrieving more than 4 phone numbers.

In contrast, Hawk trained using Birdie gets 100% accuracy when retrieving 1 phone number. That 100% score slowly decays to about 80% accuracy when retrieving up to 32 phone numbers simultaneously.

Two Transformers are included, one trained using Birdie, and the other trained using Next Token Prediction. They both always achieve about 100% accuracy, even when retrieving 32 phone numbers.

November 18, 2024 at 5:28 PM

Sam Blouir

@samblouir.bsky.social

🌟 Stellar Results:

• Multi-Phone Number Retrieval: Birdie SSMs achieve 100% accuracy on single lookups; outperform standard SSMs even more as tasks become more complex.

• SQuAD V2: We match a Transformer's performance curve across sequence lengths, while standard SSMs fall behind.

The SQUAD V2 question-answering task entails the model reading a Wikipedia article, then being immediately asked a question about what it just read. The information is always found in the article.

Training Hawk using BIrdie strongly outperforms using Next Token Prediction. Training with Next Token Prediction results in performance strongly declining when the Wikipedia article length increases to about 500 tokens.
In this 500 token scenario, Hawk trained using Next Token Prediction retrieves the exact label less than 10% of the time, while the Birdie procedure results in over 55% accuracy.

When the article is only 100 tokens long, Birdie retrieves the correct answer more than 40% of the time, while the Next Token Prediction model does this less than 30% of the time.
With Birdie, Hawk matches the "context length vs performance" curves of the Transformer trained with Next Token Prediction, but has slightly worse performance.

The Transformer trained with Birdie outperforms all models, with an average of about 75% accuracy, compared to the Next Token Prediction Transformer at 60%.
Hawk trained with Birdie gets around 50%.
Hawk trained with Next Token Prediction gets around 15%.

November 18, 2024 at 5:06 PM

Sam Blouir

@samblouir.bsky.social

🌟 Stellar Results:

• Multi-Phone Number Retrieval: Birdie SSMs achieve 100% accuracy on single lookups; outperform standard SSMs even more as tasks become more complex.

• SQuAD V2: We match a Transformer's performance curve across sequence lengths, while standard SSMs fall behind.

Hawk (SSM) trained using Birdie strongly outperforms Hawk trained using Next Token Prediction on the squad v2 question-answering task - which entails the model reading a Wikipedia article, then being immediately asked a question about what it just read. Hawk trained using Next Token Prediction strongly declines in performance when the wikipedia article length increases to about 500 tokens. In this scenario, Hawk retrieves the exact label less than 10% of the time. When the article was only 96 tokens long, it was correct about 25% of the time. Hawk trained using Birdie matches the performance curves of the Transformer trained with Next Token Prediction, but has slightly worse performance. The Transformer trained with BIrdie outperforms all models, with an average of about 75% accuracy, compared to the Next Token Prediction Transformer at 60%. Hawk trained with Birdie gets around 50%. Hawk trained with Birdie gets around 15%.

November 18, 2024 at 4:48 PM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news