Lightnews — Scholar-powered news

yuluqin.bsky.social @yuluqin.bsky.social · Jul 22

[9/9] Co-led this project with Dheeraj, co-authored with Adam and Lucia! Huge thanks to my advisors @kanishka.bsky.social and @najoung.bsky.social for their invaluable guidance and support❤️
More details👇
Paper: www.arxiv.org/abs/2507.13328
Code: github.com/tinlaborator...
Site: taxonomigqa.github.io

Vision-and-Language Training Helps Deploy Taxonomic Knowledge but Does Not Fundamentally Alter It

Does vision-and-language (VL) training change the linguistic representations of language models in meaningful ways? Most results in the literature have shown inconsistent or marginal differences, both...

www.arxiv.org

2

yuluqin.bsky.social @yuluqin.bsky.social · Jul 22

[8/9] Finally, a preliminary exploration into why vision helps: visual similarity between members of a concept predicts the performance of the VLM better than the LM. The effects vary by concept (which is intuitive) - concepts with higher visual cohesion seem to have larger effects!

1 1

yuluqin.bsky.social @yuluqin.bsky.social · Jul 22

[7/9] Next, the linear separability of questions containing hypernym vs. non-hypernym substitutions was greater in the VLM than in the LM, on questions where the gold answer was the same (“No”).

Both findings support H2!

1 1

yuluqin.bsky.social @yuluqin.bsky.social · Jul 22

[6/9] To test H2, we analyze contextual embs. of TaxonomiGQA instances for the Qwen2.5 pair.
First, the diff. in how close the hyponym was to the hypernym vs. non-hypernym in the intermediate layers had a larger effect for predicting model correctness in the VLM than in the LM.

1 1

yuluqin.bsky.social @yuluqin.bsky.social · Jul 22

[5/9] To test H1, we perform a series of behavioral and representational evaluations, including direct elucidation of taxonomic knowledge, embedding similarity analysis, and analysis of @kihopark.bsky.social et al for geometric organization of concepts. We find ~no evidence supporting H1.

1 1

yuluqin.bsky.social @yuluqin.bsky.social · Jul 22

[4/9] Why are VLMs performing better❓Two possible hypotheses are: (H1) VL training improves the underlying taxonomic knowledge in LMs; (H2) VL training improves the deployment of the taxonomic knowledge in specific task contexts. Our results support H2 but not H1.

1 2

yuluqin.bsky.social @yuluqin.bsky.social · Jul 22

[3/9] We test 7 VLM-LM pairs that use the same base LM (i.e., the pair only differs in terms of additional V+L training), and find most VLMs to outperform their LM counterpart despite TaxonomiGQA being a purely text-based task!

1 2

yuluqin.bsky.social @yuluqin.bsky.social · Jul 22

[2/9] We tackle this question by evaluating both VLMs & LMs on a new QA dataset, TaxonomiGQA, that requires sensitivity to taxonomic relations. TaxonomiGQA is a *TEXT-ONLY* dataset derived from GQA by substituting entities with their hypernyms and non-hypernyms.

1 1

yuluqin.bsky.social @yuluqin.bsky.social · Jul 22

Does vision training change how language is represented and used in meaningful ways?🤔The answer is a nuanced yes! Comparing VLM-LM minimal pairs, we find that while the taxonomic organization of the lexicon is similar, VLMs are better at _deploying_ this knowledge. [1/9]

1 3 18

yuluqin.bsky.social @yuluqin.bsky.social · Jul 22

[7/9] Next, the linear separability of questions containing hypernym vs. non-hypernym substitutions was greater in the VLM than in the LM, on questions where the gold answer was the same (“No”).

Both findings support H2!

yuluqin.bsky.social @yuluqin.bsky.social · Jul 22

[6/9] To test H2, we analyze contextual embs. of TaxonomiGQA instances for the Qwen2.5 pair.
First, the diff. in how close the hyponym was to the hypernym vs. non-hypernym in the intermediate layers had a larger effect for predicting model correctness in the VLM than in the LM.

1

yuluqin.bsky.social @yuluqin.bsky.social · Jul 22

[5/9] To test H1, we perform a series of behavioral and representational evaluations, including direct elucidation of taxonomic knowledge, embedding similarity analysis, and analysis of @kihopark.b et al for geometric organization of concepts. We find ~no evidence supporting H1.

1

yuluqin.bsky.social @yuluqin.bsky.social · Jul 22

[4/9] Why are VLMs performing better❓Two possible hypotheses are: (H1) VL training improves the underlying taxonomic knowledge in LMs; (H2) VL training improves the deployment of the taxonomic knowledge in specific task contexts. Our results support H2 but not H1.

1

yuluqin.bsky.social @yuluqin.bsky.social · Jul 22

[3/9] We test 7 VLM-LM pairs that use the same base LM (i.e., the pair only differs in terms of additional V+L training), and find most VLMs to outperform their LM counterpart despite TaxonomiGQA being a purely text-based task!

1

yuluqin.bsky.social @yuluqin.bsky.social · Jul 22

[2/9] We tackle this question by evaluating both VLMs & LMs on a new QA dataset, TaxonomiGQA, that requires sensitivity to taxonomic relations. TaxonomiGQA is a *TEXT-ONLY* dataset derived from GQA by substituting entities with their hypernyms and non-hypernyms.

1