We will be advertising for a postdoc position soon, to work on #generative#models#structure#induction and #uncertainty with Michael Gutmann as part of @genaihub.bsky.social !
Keep an eye out, and get in touch! ( #ML#AI#ICML2025 )
We will be advertising for a postdoc position soon, to work on #generative#models#structure#induction and #uncertainty with Michael Gutmann as part of @genaihub.bsky.social !
Keep an eye out, and get in touch! ( #ML#AI#ICML2025 )
Where this really shines is in the low resource setting, where embeddings still play a critical role, but scale just isn’t available. That’s what we evaluate next, and this time we compare to LLMs in the 100M - 7B range as well as supervised embedding models 6/🧵:
Banyan turns out to be a pretty efficient learner! Its embeddings outperform our prior recursive net, as well as a RoBERTa medium ( a few million parameter encoder) and several word embedding baselines trained on 10x more data 5/🧵
2) We change our parameterization to a diagonal mechanism inspired by SSMs, which lets us reduce parameters by 10x while massively increasing performance 💪
For our initial benchmarks we pre-train Banyan on 10M tokens of English and test STS, retrieval and classification... 4/🧵
We can make this set up much more powerful with two changes:
1) Entangling: whenever any instance of the encoder merges the same span, we reconstruct it from every possible context it can occur in, learning the global connective structure of our pre-training corpus 3/🧵
Banyan is a special type of AutoEncoder, called a Self-StrAE (see fig). Given a sequence it needs to learn which elements to merge with each other, and in what order, to get the best compression. This means its representations model compositional semantics 2/🧵