Nanne van Noord
@nanne.bsky.social
290 followers 150 following 43 posts
Assistant Professor of Visual Culture and Multimedia at University of Amsterdam. http://nanne.github.io
Posts Media Videos Starter Packs
Reposted by Nanne van Noord
euripsconf.bsky.social
And lastly, if @neuripsconf.bsky.social would choose to reverse the decisions on the papers affected by space constraints, we would be happy and able to accommodate their presentation
nanne.bsky.social
You're arguing in bad faith, so this will be my last reply.

But yes, if you actually want to learn about multimodality then you shouldnt read about MLLM.
nanne.bsky.social
I'm not sure what the point here is, but if you're going to believe Gemini over actual research done by AI researchers there isn't much more to discuss.

If you're willing to actually learn about this then you can start here: arxiv.org/abs/2505.19614, or even here: academic.oup.com/dsh/article/...
nanne.bsky.social
That's a bit sealion-y, but I'll bite - *artificial* neural networks are a poorly analogy.

Those different details also matter a lot; especially because the brain isn't just floating in a jar, it's part of an embodied system.
nanne.bsky.social
This is where your misunderstanding is happening, as they are not elementary pieces. For the visual tokens a lot of the semantics have already been determined, and hence the interpretations it can arrive at are limited.

Brain analogy really doesnt hold here. NN != Brains.
nanne.bsky.social
Its clearly not; neural nets are a poor analogy for the brain, and clearly don't work the same way.
nanne.bsky.social
This, plus the (initial) interpretation of the modalities should not be independent - even at the pixel/word-level we may want to interpret differently depending on the other modalities (e.g., sense disambiguation)

Partial Information Decomposition has been used to formalise some of this
nanne.bsky.social
No.. that's not how any of that works 😵‍💫
nanne.bsky.social
It means I said 'mix' to explain the process, but I obviously know this involves attention - so the Gemini explanation is not meaningfully different.

Potential limited: if key visual info is missing, then attention wont recover that. So alot of 'decisions' about visual are made before fusion
nanne.bsky.social
Ah, I see how you and Gemini misunderstood. I was talking about extracting visual tokens, and mix referred to attention.

That doesnt make it meaningfully multimodal; potential of visual tokens is still limited by visual encoder.

Anyway, if I wanted to talk to an LLM I would do that directly
nanne.bsky.social
Please do explain then how whatever you're referring to is different and actually meaningfully multimodal.
nanne.bsky.social
*all semantic information* is quite the claim; in our experiments they miss a lot of semantics from visual

'text space' in that after the image encoder the visual information is fixed, and mixed with text tokens for seq2text - which is not how multimodality works..
nanne.bsky.social
Natively is a bit of an exaggeration, as it's mostly just other modalities mapped to text space as input - but this makes their 'understanding' rather shallow
nanne.bsky.social
This paper on identifying prompted artist names from generated images is such a fun and creative take on data attribution arxiv.org/abs/2507.18633

Wonder if it would do something meaningful for analysing artistic influence for human-made art 🤔
Identifying Prompted Artist Names from Generated Images
A common and controversial use of text-to-image models is to generate pictures by explicitly naming artists, such as "in the style of Greg Rutkowski". We introduce a benchmark for prompted-artist reco...
arxiv.org
nanne.bsky.social
This paper is 💯

Generally, I have the impression NLP does better at this than CV - but clearly both fields should push studying culture beyond just looking at national identities
naitian.org
naitian @naitian.org · Jul 23
I'm thrilled to be doing an oral presentation on "Culture is not Trivia" at #ACL2025 next Wednesday 7/30, as well as participating in the human-centered NLP panel afterwards!

(thanks also @lauraknelson.bsky.social for the shoutout in her #ic2s2 keynote today!)

aclanthology.org/2025.acl-lon...
A poster for "Culture is not Trivia: sociocultural theory for cultural NLP" which takes the form of a flow-chart. The central question, and the starting point of the flow chart, is "What is culture in cultural NLP?"

An arrow is labeled "wait, so what's cultural NLP?" This leads to a block explaining that the goals of cultural NLP are described in section 2 of the paper. They include inclusivity, depth, discerning, and adaptiveness.

That leads to an arrow that says "that sounds great!". But there are recurring challenges in this kind of work! Section 3 surveys some of these: a discomfort around the proxies being chosen, a lack of coverage, and a lack of dynamicity.

That in turn leads to an arrow labeled "Hm, sounds like we need to figure out..." and it leads back to the main question: "What is culture in cultural NLP?"

A final arrow extends below this block: "Well, who's to say, really?"

This points to sociocultural linguistics. Section 4 explores how other disciplines, like sociolinguistics, linguistic anthropology, and discourse analysis have faced similar challenges in the past. Section 4.2 gives an overview of sociocultural linguistics, which is a set of principles tying together some convergent themes: emergence, positionality, indexicality, relationality, and partialness.

One arrow extends from this asking, "what's that have to do with cultural NLP?" Section 5 gives a case study of how indexicality clarifies how to think about stereotypies in the context of mining cultual knowledge from the web.

Another arrow says "How can I build safe NLP systems?" Section 6.2 explores how localization can serve as a useful model from building culturally aware technologies because it forces developers to define culture explicitly and tractably.

Finally, an arrow asks "how can I study culture with NLP methods?" Section 6.1 lays out theoretically motivated directions for future empirical and theoretical work in computationally modeling culture.
nanne.bsky.social
If the priority is to dunk on people that know less about AI, instead of being accurate, that could be a conclusion I guess.
nanne.bsky.social
It would be weird to describe this 2012 system, that is doing search, as an SVM classifier doing search: www.robots.ox.ac.uk/%7Evgg/publi...

Similarly, I wouldn't describe an LLM that translates a query to a destination for a Waymo as an 'LLM driving a car'
Visual Geometry Group - University of Oxford
Computer Vision group from the University of Oxford
www.robots.ox.ac.uk
nanne.bsky.social
I'm not questioning your definition of searching, I'm questioning your use of "LLMs".

I don't think defining an LLM as a transformer-based NN is inaccurate, in which case it isn't doing search by itself, and then it would be fine to argue that it can only hallucinate.
nanne.bsky.social
That statement mostly seems to apply to hosted commercial systems. It takes more than just downloading an LLM from huggingface to have a system that does this.

Sure an LLM can be trained to formulate queries and process results, but the system doing the searching is more than 'just' an LLM.
nanne.bsky.social
Fair, but still meaningful to make the distinction between LLMs and reasoning models, as not all LLMs are reasoning models. Especially if the point is to communicate across silos.
nanne.bsky.social
Do LLMs do search? Afaik there have been systems built around LLMs that do search, and then send these results back to them (i.e., RAG-like) - but that isn't the same as an LLM doing search.
nanne.bsky.social
I couldnt find EurIPS registration costs; hopefully they can address this by lowering costs for authors

But yes - this has been absurd; especially for those with visa issues - and I do think for that group this is a (minor) improvement
nanne.bsky.social
Not my intention to defend the requirement for a full registration, but this has been common practice for a while across multiple conferences.

The main change of new locations seems primarily that those with US visa issues will be able to present somewhere. But it doesnt really change costs
nanne.bsky.social
This considers registration only, no? One could register for in person, but not go - folks with visa issues have had to do this
nanne.bsky.social
This distinction is also useful because it makes it harder to avoid responsibility, as its easy to avoid directly working on surveillance - yet harder to avoid doing CV work that is surveillance-enabling.

Unless your position is that these are the same?